IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015 957 Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives Andrea Marongiu, Member, IEEE, Alessandro Capotondi, Student Member, IEEE, Giuseppe Tagliavini, and Luca Benini, Fellow, IEEE

Abstract—Multiprocessor systems-on-chip (MPSoC) are evolv- performance/watt. Unfortunately, these advantages are traded ing into heterogeneous architectures based on one host processor off for an increased programming complexity: extensive and plus many-core accelerators. While heterogeneous SoCs promise time-consuming rewrite of applications is required, using spe- higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, cialized programming paradigms. For example, in the general- OpenCL). We present a programming model based on OpenMP, purpose and high-performance computing (HPC) domains, with additional directives to program the accelerator from a sin- (GPU)-based systems are pro- gle host program. As a test case, we evaluate an implementation grammed with OpenCL.1 OpenCL aims at providing a stan- of this programming model for the STMicroelectronics STHORM dardized way of programming such accelerators; however, it development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL offers a very low-level programming style. Programmers must codes at a significantly lower programming complexity, and up to write and compile separate programs for the host system and for 30× speedup versus host execution time. the accelerator. Data transfers to/from the many-core and syn- Index Terms—Heterogeneous systems-on-chip (SoC), many chronization must also be manually orchestrated. In addition, core, nonuniform memory access (NUMA), OpenMP. OpenCL is not performance-portable: specific optimizations have to be recoded for the accelerator at hand. This gener- I. INTRODUCTION ates the need for a higher level programming style [9] [10]. HE EVER-INCREASING demand for computational OpenMP [11] has included in the latest specification exten- T power within tight energy budgets has recently led to rad- sions to manage accelerators, and XeonPhi coprocessors offer ical evolutions of multiprocessor systems-on-chip (MPSoC). all standard programming models [OpenMP, Pthreads, message Two design paradigms have proven particularly effective in passing interface (MPI)] [12], where the accelerator appears increasing performance and energy efficiency of such systems: like a symmetric multiprocessor (SMP) on a single chip. 1) architectural heterogeneity; and 2) many-core processors. In the embedded domain, such proposals are still lacking, but Power-aware design [1], [2] and mapping [3], [4] for hetero- there is a clear trend toward designing embedded SoCs in the geneous MPSoCs are being widely studied by the research same way it is happening in the HPC domain [13], and which community, and many-core-based heterogeneous MPSoCs are will eventually call for the same programming solutions. now reality [5]Ð[8]. In this paper, we present a programming model, compiler, A common embodiment of architectural heterogeneity is a and runtime system for a heterogeneous embedded platform template where a powerful general-purpose processor (usu- template featuring a host system plus a many-core accelerator. ally called the host), featuring sophisticated cache hierar- The many-core relies on a multicluster design, where each clus- chy and full-fledged operating system, is coupled to pro- ter features several simple cores sharing L1 scratchpad memory. grammable many-core accelerators composed of several tens Intercluster communication is subject to nonuniform memory of simple processors, where highly parallel computation ker- access (NUMA) effects. The programming model consists of nels of an application can be offloaded to improve overall an extended OpenMP, where additional directives allow to effi- ciently program the accelerator from a single host program, Manuscript received May 07, 2014; revised February 05, 2015; accepted rather than writing separate host and accelerator programs, and June 08, 2015. Date of publication June 24, 2015; date of current version July 31, 2015. This work was supported by EU projects ERC-AdG MultiTherman distribute the workload among clusters in a NUMA-aware man- (291125) and FP7 P-SOCRATES (611016). Paper no. TII-14-0489. ner, thus improving the performance. The proposed OpenMP A. Marongiu and L. Benini are with the Department of Electrical, Electronic, extensions are only partly inline with the latest OpenMP v4.0 and Information Engineering “Guglielmo Marconi” (DEI), University of specifications. The latter are in our view too tailored to the Bologna, Bologna 40136, Italy, and also with the Department of Information Technology and Electrical Engineering, Swiss Federal Institute of Technology characteristics of today’s GPUs, as they emphasize data-level Zurich (ETH Zurich), Zurich 8092, Switzerland (e-mail: [email protected]. accelerator parallelism (modern GPUs being conceived for ethz.ch; [email protected]). that) and copy-based host-to-accelerator communication (mod- A. Capotondi and G. Tagliavini are with the Department of Electrical, ern GPUs being based on private-memory designs). Our focus Electronic, and Information Engineering “Guglielmo Marconi” (DEI), University of Bologna, Bologna 40136, Italy (e-mail: alessandro.capotondi@ is on many-core accelerators which efficiently support more unibo.it; [email protected]). types of parallelism (e.g., tasks) and leverage shared memory Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TII.2015.2449994 1[Online]. Available: http://www.khronos.org/opencl

1551-3203 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 958 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015 communication with the host, which is where the heteroge- neous system architecture (HSA)2 and all GPU roadmaps are heading in the longer term. We discuss how to provide efficient communication with the host on top of shared memory by trans- parently relying on pointer exchange in case virtual memory paging is natively supported by the many-core and leveraging software virtual address translation plus copies into contiguous shared memory (to overcome paging issues) if such support is lacking. We also comment on how copies can be used to imple- ment offload on top of a private accelerator memory space. To achieve our goals, we propose minimal extensions to the previous OpenMP v3.1, emphasizing ease of programming. For validation, we present a concrete embodiment of our pro- posal targeting the first STMicroelectronics STHORM devel- opment board [5]. This board couples an ARM9 host system and main memory (based on the Zynq7 device) to a 69-cores STHORM chip. We present a multi-ISA compilation toolchain Fig. 1. Heterogeneous embedded SoC template. that hides all the process of outlining an accelerator program from the host application, compiling it for the STHORM plat- where critical computation kernels of an application can be form, offloading the execution binary, and implementing data offloaded to improve overall performance/watt [5]Ð[8], [15]. sharing between the host and the accelerator. Two separate The type of many-core accelerator that we consider here has OpenMP runtime systems are developed, one for the host and a few key characteristics. one for the STHORM accelerator. 1) It leverages a multicluster design to overcome scalability Our experiments thoroughly assess the performance of the limitations [5], [6]. Processors within a cluster are proposed programming framework, considering six representa- tightly coupled to local L1 scratchpad memory, which tive benchmarks from the computer vision, image processing, implies low-latency and high-bandwidth communica- and linear algebra domains. The evaluation is articulated in tion. Globally, the many-core accelerator leverages a three parts. partitioned global address space (PGAS). Every remote 1) We relate the achieved throughput to each benchmark’s memory can be directly accessed by each processor, but operational intensity using the Roofline methodology intercluster communication travels through a networks- [14]. Here, we observe near-ideal throughput for most on-chip (NoC), and is subject to NUMA latency and benchmarks. bandwidth. 2) We compare the performance of our OpenMP to OpenCL, 2) The processors within a cluster are not GPU-like data- natively supported by the STHORM platform, achiev- parallel cores, with common fetch/decode phases which ing very close performance to hand-optimized OpenCL imply performance loss when parallel cores execute out codes, at a significantly lower programming complexity. of lock-step mode. The accelerator processors considered 3) We measure the speedup of our OpenMP versus sequen- here are simple independent RISC cores, perfectly tial execution on the ARM host, which exhibits peaks of suited to execute both single instruction, multiple data 30×. (SIMD) and multiple instruction, multiple data (MIMD) This paper is organized as follows. In Section II, we types of parallelism. This allows to efficiently support a describe the target heterogeneous embedded system tem- programming model that leverages not only data-level plate and the STHORM board. In Section III, we describe parallelism but also sophisticated forms of dynamic and our programming model, discussing differences with the irregular parallelism (e.g., tasking). OpenMP v4.0 specifications. The STHORM implementation 3) The host processor and the many-core accelerator phys- is described in Section IV. In Section V, we provide exper- ically share the main DRAM memory, meaning that imental evaluation of the proposed OpenMP implementation. they both have a physical communication channel to Section VI discusses related work. Section VII concludes DRAM, as opposed to a more traditional accelerator this paper. model where communication with the host takes place via direct memory access (DMA) transfers into a private memory. This type of zero-copy communication, pio- II. TARGET ARCHITECTURE neered by , Inc. - accelerated Fig. 1 shows the block diagram of the heterogeneous embed- processing unit (AMD APU) and nowadays widely ded system template targeted in this work. In this template, adopted or included in the roadmaps of all major SoC a powerful general-purpose processor (the host), featuring vendors [13], improves memory utilization and simplifies sophisticated cache hierarchy, virtual memory, and full-fledged programming. Note that our extensions for computation operating system, is coupled to programmable many-core offloading can also be implemented on top of a traditional accelerators composed of several tens of simple processors, DMA copy-based system, as we explain later on. To improve the performance of data sharing, an input/output- 2[Online]. Available: http://www.hsafoundation.com memory management unit (IO-MMU) block may be placed in MARONGIU et al.: SIMPLIFYING MANY-CORE-BASED HETEROGENEOUS SoC PROGRAMMING 959

Fig. 3. STHORM heterogeneous system. Fig. 2. Four-cluster STHORM SoC. front of the accelerator. The presence of an IO-MMU allows the system access into STHORM L1/L2 memories, a bridge is host and the many-core to exchange virtual shared data point- implemented in the FPGA, which has three main functions. ers. In the absence of this block, the many-core is only capable 1) It translates STHORM transactions from the SNoC proto- of addressing contiguous (nonpaged) main memory regions. col to the AXI protocol (and ARM transactions from AXI Sharing data between the host and the accelerator in this sce- to SNoC). nario requires a data copy from the paged to the contiguous 2) It implements address translation logic in the remap memory regions. Virtual to physical address translation is done address block (RAB). This is required to trans- on the host side, then the pointer to the physical address can be late addresses generated from STHORM into virtual passed to the accelerator. addresses as seen by the host application and vice versa. Indeed, the host system features paged virtual memory and MMU support, while STHORM operates on physical A. Test Platform: STHORM addresses. Thus, the RAB acts as a very simple IO-MMU. STHORM, previously known as Platform 2012 [5], is a 3) It implements a synchronization control channel by con- many-core organized as a globally asynchronous, locally syn- veying interrupts in two directions through the FPGA chronous (GALS) fabric of multicore clusters (see Fig. 2). A logic and into dedicated offchip wires. The FPGA bridge STHORM cluster contains (up to) 16 STxP70-v4 processing is clocked conservatively at 40 MHz in this first board. elements (PEs), each of which has a 32-bit RISC load-store This constitutes currently the main system bottleneck,3 as architecture, with dual issue and a seven-stage pipeline, plus we explain in Section V. private instruction cache (16 kB). PEs communicate through a shared multiported, multibank tightly coupled data memory III. PROGRAMMING MODEL (TCDM, a scratchpad memory). The interconnection between the PEs and the TCDM was explicitly designed to be ultra-low The work presented in this paper was conducted within an latency. It supports up to 16 concurrent processor-to-memory FP7 EU project kicked off in 2011, when OpenMP v3.1 had just transactions within a single clock cycle, given that the target been released. During the course of the project, we designed the addresses belong to different banks (one port per bank). extensions (presented here) that we considered key to handle The STHORM fabric is composed of four clusters,plusa the two most critical aspects for heterogeneous SoC program- fabric controller (FC), responsible for global coordination of ming: the management of a shared-memory many-core accel- the clusters. The FC and the clusters are interconnected via two erator and the management of thread affinity over its NUMA asynchronous NoC (ANoC). The first ANoC is used for access- clusters. In July 2013, OpenMP v4.0 has been released, which ing a multibanked, multiported L2 memory. The second ANoC introduces new directives to address these very issues. Aligning is used for intercluster communication via L1 TCDMs and to our own specification for affinity control to the official OpenMP access the offchip main memory (L3 DRAM). Note that all the v4.0 was natural; the same was not true for the accelerator man- memories are mapped into a global address space, visible from agement directives. OpenMP v4.0 focuses on an accelerator every PE. L3 accesses requests are transported offchip via a model based on existing GPU-like coprocessors and associ- synchronous NoC link (SNoC). ated programming models [16]. This, in our view, has made The first STHORM-based heterogeneous system is a proto- the new directives for data sharing and parallelism deployment type board based on the Xilinx Zynq 7000 FPGA device (see more complicated to use. Our custom OpenMP extensions, Fig. 3), which features an ARM CA9 dual-core host processor, designed with next-generation many-core devices in mind [13], main (L3) DDR3 memory, plus programmable logic (FPGA). emphasized simplicity, as we explain in Section III-C. The ARM subsystem on the ZYNQ is connected to an AMBA 3Similar to any realistic heterogeneous SoC design, STHORM is clearly AXI matrix, through which it accesses the DRAM controller. intended for same-die integration with the host, with orders-of-magnitude faster To grant STHORM access to the L3 memory, and the ARM bridge and larger memory bandwidth. 960 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015

A. OpenMP Extensions Traditionally, writing code for an heterogeneous SoC (e.g., with OpenCL) requires to manually write a program into separate files (at least one for the host, one for the accelera- tor), and to manually compile it into different binaries. The host program should also explicitly include instructions to load the accelerator binaries, to start the computation, to transfer data, and to synchronize. In our proposal, the programmer writes a single OpenMP host application, where a custom offload directive is used to abstract away the procedure of outlining a program for the accelerator, compiling it into a separate accel- erator executable, offloading code and data to the accelerator, and synchronizing with the accelerator. Fig. 4. Nested parallel team deployment among clusters with the proc_bind #pragma omp offload [clause [ , . . . ]] clause. structured-block where clause is one of the following: Within an offload block, all regular OpenMP v3.1 constructs name (string, integer-var) can be used, including tasks.4 The target accelerator is designed private (list) as a set of clusters, with NUMA remote communication. Nested shared (list) parallelism represents a powerful programming abstraction for firstprivate (list) cluster-based architectures, as it allows efficient exploitation of lastprivate (list) a large number of processors and a NUMA memory hierarchy. nowait Nested parallelism offers the ability of clustering threads hierar- chically, where outer levels of coarse-grained parallelism could name The clause is used to univocally identify a kernel be distributed among clusters, and data parallelism could be to be offloaded to the accelerator. This is achieved through a used to distribute work within a cluster. The OpenMP architec- literal (string) parameter, plus an integer variable whose dec- ture review board has included in the recent specification v4.0 laration is visible from the code block immediately enclosing the definition of a new proc_bind construct, to be coupled to offload an directive. The integer variable is used for syn- the parallel directive. chronization purposes. If an offload request is successful, an integer value is returned, which specifies the unique ID of proc_bind ( master | close | spread ) the offloaded job. A negative return value indicates failure, master the offload block is executed on the host itself. The same The policy assigns every thread in the team to the close integer variable specified in the name clause can be used to same place as the master thread. The policy assigns the synchronize at specific program points with the custom wait threads to places close to the place of the parent’s thread. directive The master thread executes on the parent’s place and the remaining threads in the team execute on places from the place #pragma omp wait (integer-var) list consecutive from the parent’s position in the list, with wrap around with respect to the place list. The spread policy creates nowait Note that in case the directive is not specified, a sparse distribution for a team of T threads among the P places the offloaded block executes synchronously (i.e., the offload- of the parent’s place partition. Fig. 4 illustrates how proc_bind ing host thread will block until the accelerator execution is allows to easily map a nested parallel region over the target mul- completed). ticluster. Using proc_bind(spread) at the outermost parallel private shared firstprivate lastprivate The , , , and clauses construct recruits threads from different clusters (outer paral- can be used to specify data sharing between host and accelera- lel team). Using proc_bind(close) at the innermost parallel tor, and work in the same way as standard OpenMP constructs construct recruits threads from within the same clusters (nested private for parallelism. variables are duplicated in the acceler- parallel teams). ator memory space. The code executing on the accelerator only Concerning locality, it is only effective to use as many nest- refers to these private copies and does not access the host mem- ing levels as the depth of the system interconnect (2 in the target firstprivate ory. variables work in the same way, but they are platform). However, using additional nesting levels within a offload initialized at the beginning of the block to the value of cluster can be done to get more flexibility in creating paral- the original variables from the enclosing host execution context. lelism, by dynamically creating more threads only when the lastprivate Similarly, variables have local storage in the accel- workload actually requires so. erator memory space. Their content is determined during the As an example, let us consider Strassen matrix multiplica- offload execution of the block and copied back to the original tion. It is organized in three main computation stages, to be variable in the host memory space at the end of the accelerator executed in sequence. The first stage consists of nine matrix execution. shared variables identify truly shared main memory storage. Both the host and the accelerator directly access these 4This is a major difference with OpenMP v4.0, which does not allow tasks locations. to be offloaded to the accelerator. MARONGIU et al.: SIMPLIFYING MANY-CORE-BASED HETEROGENEOUS SoC PROGRAMMING 961

Fig. 5. Nested parallel Strassen matrix multiplication deployment within a cluster.

Fig. 7. Transformed OpenMP program.

when this is beneficial (see Fig. 5). The proc_bind (close) clause ensures that the threads for the two innermost-nested par- allel region are recruited from the same cluster, thus ensuring high computation locality.

B. Host Program Transformation Fig. 6 shows an example host program which uses our OpenMP extensions. The offload construct outlines the ker- nel to be accelerated (lines 8Ð22). This kernel requires two clusters: the first executes TASK_A; and the second exe- cutes TASK_B. This is specified with the parallel sections Fig. 6. Program with OpenMP extensions. directive (lines 13Ð15). num_threads(2) specifies the num- ber of clusters, as we use the proc_bind(spread) clause. sums, the second of seven matrix multiplications, and the third TASK_A and TASK_B contain inner parallelism which is of four matrix sums. Within each stage, sum or multiplication distributed among all the 16 cores in each cluster. This is blocks are coarse-grained tasks that can be executed in paral- specified with the parallel for directive, coupled to the lel. Within each of these tasks, there is additional fine-grained num_threads(16) and proc_bind(close) clauses (lines 35Ð data (loop) parallelism. Suppose that we need to perform N 37). The host executes the offload asynchronously, sharing distinct matrix multiplications. We can use a first level of par- arrays a and b with the accelerator. This is specified with allelism to distribute the N matrix multiplication instances clauses nowait and shared(a, b) (lines 9Ð11). among different clusters. Using proc_bind(spread) ensures Fig. 7 shows how the compiler transforms the code. The that each instance will execute in isolation within a single clus- offload block is replaced with a marshaling procedure to ter. Locally, to each cluster, we can use a second level of par- implement data sharing between the host and the accelera- allelism to distribute coarse-grained tasks to cores, and a third tor (lines 12Ð22). Data marshaling packs information about level to distribute inner loop iterations to additional threads only shared, firstprivate, and lastprivate variables into 962 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015

(lines 25Ð27). The simplified code for the STHORM imple- mentation of GOMP_offload_task is shown in Fig. 8. First, the target kernel object file name (.so) is resolved (line 9). A native runtime function (LoadInBanks) is invoked to dynam- ically link and load the executable into the accelerator L2 memory (line 12). Then, firstprivate data are handled. For each data element in the corresponding descriptor, memory is allocated in the accelerator L2, then a DMA transfer is trig- gered. The pointer to the STHORM copy is then inserted into a context data structure (lines 16Ð18). For shared data no copy is involved, and only pointers to the host main mem- ory are annotated into the context data structure.5 Finally, the CallMain function is invoked to start the main method on the accelerator (lines 21Ð22). In case of a synchronous offload, lastprivate data are copied back to the host main memory after the end of the kernel execution (lines 25Ð27). When the nowait clause is specified, lastprivate data are dealt with inside the GOMP_wait primitive.

C. Comparison With OpenMP Specifications v4.0 1) Data Sharing: The way data sharing is specified in OpenMP v4.0 suggests that it assumes an offload model with segregated host and accelerator memory spaces. In this model, Fig. 8. Runtime function for an offload. typical of traditional GPGPU-based systems, data sharing relies on memory transfers. The map clause lists program variables three instances of a mdata data structure, which hold the num- that can be marked with the attributes to or from. A data item ber of variables of each type, plus an array of data_desc can appear in both lists, or just in one list, indicating that it is structures, whose elements contain base address and size of read-only or write-only within the block. Using separate lists each variable of that type (lines 11Ð16). allows to optimize the number of implied transfers. struct data_desc { Copy-based data sharing is also evident in the new unsigned int * ptr; target data construct, used in the code before the actual unsigned int size; } offload (or to bind multiple offloads to the same data context) to schedule data transfers ahead of time. struct mdata { Supporting this accelerator model requires many new direc- unsigned int n_data; tives, clauses, and original execution model semantics. In con- struct data_desc data[n_data]; } trast, our proposal aims at maintaining the traditional OpenMP The size field is necessary for IO-MMU-less systems, clauses for data sharing. Copies can be specified (e.g., for where data sharing is implemented with a (transparent) copy performance) on read-only and write-only data using the famil- from paged virtual memory into the contiguous memory region iar firstprivate and lastprivate clauses, respectively. (see Fig. 1). The same mechanism can also be used to imple- shared variables are implemented with zero-copy, embrac- ment data sharing on top of a traditional distributed memory ing an accelerator models which, following the HSA roadmap, system via DMA copies. When an IO-MMU is available, the assumes physical data sharing. Zero-copy communication sim- size field is ignored, as the virtual shared data pointer can be plifies the offload mechanism to marshaling and exchanging safely propagated to the accelerator. The three mdata instances pointers, which has a much lower cost (see Section V-B). are finally collected into an otask structure, along with the 2) Parallelism Deployment: Within an offload region, kernel name (lines 18Ð22). OpenMP v4.0 allows specific constructs to leverage the fea- tures of GPU-like accelerator hardware. Such features include struct otask { SIMD processing in the ALUs, or their organization in clusters. char *name; Specifically, the new notion of leagues represents an abstrac- struct mdata *shared_data; tion of accelerator clusters. Similarly, teams abstract parallel struct mdata *fprivate_data; cores within a cluster. A league can be specified with the teams struct mdata *lprivate_data; } directive, where the num_teams clause allows to specify how The offload block is outlined into a new function (lines many teams the league will be composed of (i.e., how many 39Ð), similar to the expansion of standard OpenMP parallel clusters we want to use). A team and its size can be specified blocks. This function is compiled both for the host and the with the parallel directive and the associated num_threads accelerator. The host tries to offload a task via a call to a clause. Distributing workload in a cluster-aware manner can custom GOMP_offload_task runtime function (line 24). If 5Note that adding a DMA copy at this point allows to support our offload a negative value is returned, the host version is executed mechanism on traditional distributed memory systems. MARONGIU et al.: SIMPLIFYING MANY-CORE-BASED HETEROGENEOUS SoC PROGRAMMING 963

All the OpenMP expansion is based on Gnu compiler collec- tion (GCC) (v4.8), which provides a mature and full-fledged implementation of OpenMP v3.1. The STxP70 backend toolkit is based on the Clang+Low-Level Virtual Machine (LLVM) (http://www.llvm.org) compilation infrastructure. The GCC compilation pipeline produces the final ARM host executable, while offload blocks and function calls therein (including those implicitly created by the expansion of parallel directives) are translated into the LLVM IR using a customized version of DragonEgg (http://dragonegg.llvm.org), and finally compiled into xP70 executables. To do so, we derive from the original program call graph as many LLVM translation units as the offload blocks as follows. First, all the functions created by GCC expansion of offload blocks are marked with a name attribute (derived from the name clause associated the offload directive). Second, a custom LLVM analysis pass Fig. 9. MultiISA toolchain. visits the call-graph and collects the marked functions (plus associated global data and type declarations) within distinct be done using the distribute directive. These new directives subcall-graphs and into separate translation units. were introduced to bridge a gap with GP-GPU programming To avoid data copies from paged virtual memory into con- abstractions (e.g., CUDA grids and blocks), but they logically tiguous memory upon offload, we force the allocation of data represent yet another abstraction of nested parallelism, already marked as shared in contiguous memory at compile time. The supported in OpenMP v3.1. Leagues can be represented with an OpenMP runtime relies on a custom library for lightweight outer parallel directive teams can be specified with an inner nested fork/join, based on our previous work [18]. parallel directive. Distributing workload in a cluster-aware manner can be done with the proc_bind directive. The exam- ple code that we have already presented in Fig. 6 shows how V. EXPERIMENTAL RESULTS this can be easily specified with standard OpenMP v3.1 direc- We evaluate our programming model using the six bench- tives plus the extensions we proposed. Moreover, our proposal marks briefly described in Table I. allows to use all OpenMP constructs within an offload block, as 1) We measure the maximum throughput [Gops/sec] the accelerators we are targeting do not have the limitations of achieved for the various benchmarks. The focus is on GPU cores in executing MIMD types of parallelism. In particu- capturing the effects on peak performance of offchip lar, we foresee the tasking execution model to be a very valuable memory bandwidth, the constraining resource in the first abstraction for extracting high degrees of parallelism from such STHORM board. To this aim, we adopt a methodology accelerators [17]. that relates processor performance to offchip memory 3) Asynchronous Offload: Specifying asynchronous traffic: the Roofline model. offloads can be done in OpenMP v4.0 by enclosing a target 2) We compare the cost for our offload mechanism and directive within a task directive. The thread executing the the performance (execution time) of our runtime layer task encounters a task scheduling point while waiting for to the corresponding support provided by OpenCL, cur- the completion of the target region, allowing the thread to rently the de facto standard for accelerator programming. be rescheduled to other independent tasks. This is evidently The official STHORM SDK provides optimized sup- not the most intuitive way to specify asynchronous offload. port for the OpenCL v1.1, which we leverage for our The nowait clause that we propose for this goal is a con- characterization. struct already present in OpenMP v3.1, used in association 3) We discuss the performance of the acceleration as com- with work-sharing constructs (for, sections) to specify pared to sequential execution of the benchmarks on the that thread synchronization at the end of such constructs is host processor. Specifically, we show how the speedup unnecessary, and to which programmers are familiar. (accelerator vs. host execution time) scales as the number Note that this does not prevent the use of the former of repetitions of the offloaded kernels increases. approach. Enclosing a target directive within a task direc- tive may enable in our proposal (where tasks can execute on the accelerator) an elegant means of specifying hierarchical A. Program Throughput and the Roofline Model tasking, allowing parts of a task graph generated on the host The Roofline model [14] defines operational intensity (here- program to run on the accelerator. after OPB: operations per byte) as an estimate of the DRAM bandwidth needed by a kernel on a particular computer (ops/byte). A Roofline plot is a 2-D graph which ties together IV. STHORM PROTOTYPE IMPLEMENTATION operational intensity on the x-axis, and peak processor per- The proposed OpenMP extensions have been implemented formance (ops/sec) plus memory performance [bytes/sec == in a multiISA toolchain for the STHORM board (see Fig. 9). (ops/sec)/(ops/byte)] on the y-axis. Peak performance is a 964 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015

TABLE I BENCHMARKS

TABLE II PARAMETERS FOR THE STHORM ROOFLINE MODEL

TABLE III OPERATIONS PER BYTE (OPB) FOR DIFFERENT BENCHMARKS

horizontal line, whereas memory performance is a line of unit slope. The two lines intersect at the point of peak computa- tional performance and peak memory bandwidth (the ridge). The composition of the two lines is a roof-shaped curve which Fig. 10. Roofline for real benchmarks. provides an upper bound on performance for a kernel depending on its operational intensity. If a kernel’s operational intensity The empirical measurement reports a maximum bandwidth is below the ridge point, the kernel will be memory-bound of 320 MB/s for read operations and 180 MB/s for write on that platform; otherwise, it will be compute bound. The operations.6 x-coordinate of the ridge point is the minimum operational Fig. 10 shows the roofline for the STHORM platform. Real intensity required to achieve maximum performance on that benchmarks are displaced along the x-axis based on their (mea- platform. sured) OPB. In most cases, the workload is strictly memory To characterize the Roofline curves for STHORM, we use bound (low OPB). MAH and SHT1 do not achieve peak (roof) the following model: performance even if their OPB has past the ridge. The measured ⎧ IPC when running the benchmarks sequentially on a single core ⎨ N ∗ ops ∗ cycle (an upper bound for the parallel benchmarks) is 0.6 for MAH Gops PEs IPC fc Perf =min cycle sec and 0.7 for SHT1. The reasons for this small IPC are multiple. sec ⎩ byte ∗ ops . DMAbw sec OPB byte First, the compiler is rarely capable of scheduling two instruc- tions at every cycle. Other limiting factors are pipeline stalls, The peak processor performance is computed as the product branch mispredictions, and access conflicts on L1 shared mem- of 1) the maximum number of instructions (ops) that a single ory. Besides the low IPC, the results achieved on the parallel processor can retire per cycle (IPC); 2) the number of proces- benchmarks are very close to the upper bound. sors available (NPEs); and 3) the processor’s clock frequency (fc). The peak memory bandwidth is computed as the product of the DMA available bandwidth (DMAbw) and the operational B. Comparison Between OpenMP and OpenCL intensity (OPB). The numerical values for all the parameters are Our proposal aims at simplifying accelerator programming summarized in Tables II and III. through the simple OpenMP directive-based programming These values come from hardware specifications, with the style; a streamlined offload implementation aims at achieving exception of DMAbw, for which we designed a custom identical performance to OpenCL. Fig. 11 shows execution microbenchmark that measures the cost (in clock cycles) for time for OpenCL and OpenMP, normalized to OpenCL. We DMA transfers of increasing sizes. This cost increases linearly highlight the cost for offload, and the time spent in kernel exe- with the size of the transfer, and we can extrapolate a slope cution. Offload is costlier for OpenCL, while kernel execution cycles time seems longer for OpenMP. This is due to the differ- value Sl bytes with linear regression. The available DMA bandwidth is finally computed as follows: ence in execution models. OpenCL completely demands par- allelism creation for the accelerator on the host side (memory fc Mbytes 6 DMAbw = . For reference, the Nvidia Kepler K40 GPU has 288 GB/s and the Intel Xeon Sl sec Phi has 320 GB/s. MARONGIU et al.: SIMPLIFYING MANY-CORE-BASED HETEROGENEOUS SoC PROGRAMMING 965

estimate the achievable speedup in a realistic STHORM-based SoC, we also run the experiments on the STHORM simulator, Gepop. Gepop allows to model a realistic bandwidth to DRAM main memory, here set to 10 GB/s. Solid lines in Fig. 12 refer to results obtained on the board (BRD); dashed lines refer to Gepop (SIM). On average, on the real system (the STHORM board), our offload-enabled OpenMP achieves ≈ 16× speedup versus ARM sequential execution, and up to 30×. The experiments on the simulator suggest that a realistic channel for accelerator- to-DRAM communication increases these values to ≈ 28× speedup on average, and up to 35×.

VI. RELATED WORK Heterogeneous systems have been long since used to improve the energy efficiency of embedded SoCs. ARM has witnessed this trend in the past years, with products such as big. LITTLE Fig. 11. OpenMP versus OpenCL. [19] or the AMBA AXI4 interconnect [20]. Nowadays, it is widely accepted that heterogeneous integration is key to allocation for data buffers, thread creation/startup, etc.). Once attack technology and utilization walls at nanoscale regimes. the accelerator is started, no additional parallelism can be cre- Numerous published results show the advantages of heteroge- ated without transferring the control back to the host, so the neous systems, indicating, for instance, an average execution kernel execution time only includes benchmark-specific com- time reduction of 41% in CMPs when compared to homoge- putation. For our OpenMP, an offload sequence only consists neous counterparts, or 2× energy reduction when using special- of transferring function and data pointers to the accelerator, but ized cores for representative applications [21]. Standardization the offloaded function is a standard OpenMP program: compu- initiatives such as the HSA foundation [13] also demonstrate tation starts on a single accelerator processor, and parallelism a general consensus among industrial and academic players is created dynamically (similar to memory allocation). Overall, about the advantages of designing SoCs as heterogeneous our extended OpenMP achieves very close performance to systems. OpenCL, and up to 10% faster in some cases). In general, In the context of multi- to many-core parallel process- the comparison between OpenMP and OpenCL is not straight- ing, a plethora of programming models has seen the light in forward, nor it is easy to generalize the results to different the past decade [22]. In particular, several researchers have implementations/platforms. On one hand, this is due to the fact explored OpenMP extensions: for dynamic power management that OpenMP allows to express much more types and “flavors” [23], tasks with dependencies [24], explicitly managed memory of parallelism than OpenCL, which ultimately impact the way hierarchy [25], etc. Focusing on heterogeneous programming, a program is written. On the other hand, the degree of opti- OpenCL attempts to standardize application development for mization of the runtime support for a programming model on accelerator-based systems, at the cost of very low-level cod- the target platform also impacts the relative results. In this ing style. To simplify the programming interface, OpenACC experiment, we have maintained the OpenMP and OpenCL par- [10] and PGI Accelerator [9] borrowed the directive-based pro- allelization schemes as similar as possible to mitigate the first gramming style of OpenMP. The focus is still on GPU-like effect. Moreover, the native runtime services used to implement accelerators and loop-level parallelism. the two programming models are the same, so the second effect Mitra et al. [26] describe an implementation of OpenMP is also mitigated. In the presence of a similar setup, our results v4.0 for the Keystone II K2H heteroge- can be broadly generalized to other similar platforms. neous platform. The proposed toolchain transforms OpenMP directives into an OpenCL program, thus insisting on a GPU- specific accelerator model. Similarly, Liao et al. [27] propose C. Comparison With the ARM Host an OpenMP v4.0 implementation which is in essence a wrap- Fig. 12 shows the speedup achieved by accelerating target per to the CUDA programming model, targeted at NVIDIA kernels versus their sequential execution on the ARM host.On GPUs rather than shared memory accelerators. Ozen et al. [28] the x-axis, we report the number of times each benchmark is explore the roles of the programmer, the compiler, and the run- repeated. The higher the number of repetitions, the lower the time system in OpenMP v4.0, trying to identify which features impact of the initial offload cost, as most of the operations should be made transparent to application developers. However, (e.g., program binary marshaling) need not be repeated for suc- the angle is simply that of specifying computational kernels in cessive kernel executions. Clearly, the data used in different a more productive way, while the assumed offload model is still repetitions are different, but data marshaling can be overlapped heavily biased toward GPU-like accelerators. In all these cases, with the execution of the previous kernel instance, which com- the target architecture and the implemented execution model pletely hides their cost in all the considered benchmarks. To are thus very different from the ones we discuss in this paper. 966 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015

Fig. 12. Comparison between ARM host and STHORM execution (OpenMP).

Cramer et al. analyze the cost of extensions to OpenMP v4.0 REFERENCES for the XEON-Phi [29], similar to ours. The main differences [1] R. Ben Atitallah, E. Senn, D. Chillet, M. Lanoe, and D. Blouin, “An are in the available hardware (HW) and SW stacks, and thus efficient framework for power-aware design of heterogeneous MPSoC,” in the OpenMP implementation. The Xeon-Phi is based on the IEEE Trans. Ind. Informat., vol. 9, no. 1, pp. 487Ð501, Feb. 2013. [2] Z. Zilic, P. Mishra, and S. K. Shukla, “Guest editors’ introduction: Special same ISA of the host system, multiISA compilation is not nec- section on system-level design and validation of heterogeneous chip mul- essary. An OpenMP implementation can leverage standard full- tiprocessors,” IEEE Trans. Comput., vol. 62, no. 2, pp. 209Ð210, Feb. fledged operating system services, different from STHORM 2013. [3] J. Castrillon, R. Leupers, and G. Ascheid, “MAPS: Mapping concur- and similar to many-cores. A direct comparison to the latest rent dataflow applications to heterogeneous MPSoCs,” IEEE Trans. Ind. OmpSs release [30] (which supports the target OpenMP v4.0 Informat., vol. 9, no. 1, pp. 527Ð545, Feb. 2013. directive) is also not feasible, as the platforms they target are [4] A. Schranzhofer, J.-J. Chen, and L. Thiele, “Dynamic power-aware map- ping of applications onto heterogeneous MPSoC platforms,” IEEE Trans. an Intel Xeon server (SMP, with 24 cores) and a machine with Ind. Informat., vol. 6, no. 4, pp. 692Ð707, Nov. 2010. two NVIDIA GTX285 GPUs, which have very different HW [5] D. Melpignano et al., “Platform 2012, a many-core computing accelerator and SW architectures than ours. Ayguadé [31] and White [32] for embedded SoCs: Performance evaluation of visual analytics applica- tions,” in Proc. 49th ACM/EDAC/IEEE Design Autom. Conf. (DAC’12), also proposed OpenMP extensions to deal with heterogeneous Jun. 2012, pp. 1137Ð1142. systems. Their work is, however, mostly focused on syntax [6] B. de Dinechin et al., “A clustered manycore processor architecture for specification (and semantics definition), while implementation embedded and accelerated applications,” in Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC’13), Sep. 2013, pp. 1Ð6. aspects and experiments are absent. [7] L. Gwennap, “Adapteva: More flops, less watts,” Microprocess. Rep., vol. 6, no. 13, pp. 11Ð02, 2011. [8] A. Heinecke, M. Klemm, and H.-J. Bungartz, “From GPGPU to many- core: Nvidia Fermi and Intel Many Integrated Core architecture,” Comput. Sci. Eng., vol. 14, no. 2, pp. 78Ð83, Mar. 2012. VII. CONCLUSION [9] M. Wolfe, “Implementing the PGI accelerator model,” in Proc. 3rd Work- In this paper, we have presented a programming model, shop Gen. Purpose Comput. Graph. Process. Units, 2010, pp. 43Ð50. [10] R. Reyes, I. Lopez, J. Fumero, and F. de Sande, “An early evaluation of compiler, and runtime system for a heterogeneous embedded the OpenACC standard,” in Proc. Int. Conf. Comput. Math. Methods Sci. system template featuring a general-purpose host processor Eng., 2012, pp. 1024Ð1035. coupled to a many-core accelerator. Our programming model [11] OpenMP Architecture Review Board. (2013). OpenMP Application Program Interface Version 4.0. [Online]. Available: http://www.openmp. is based on an extended version of OpenMP, where addi- org/mp-documents/OpenMP4.0.0.pdf tional directives allow to efficiently offload computation to [12] M. Noack, F. Wende, T. Steinke, and F. Cordes, “A unified program- the accelerator from within a single OpenMP host program. ming model for intra- and inter-node offloading on Xeon Phi clusters,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC’14), A multi-ISA compilation toolchain hides to the programmer Nov. 2014, pp. 203Ð214. the cumbersome details of outlining an accelerator program, [13] L. Su, “Architecting the future through heterogeneous computing,” in compiling and loading it to the many-core and implement- Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Pap. (ISSCC’13), Feb. 2013, pp. 8Ð11. ing data sharing between the host and the accelerator. As a [14] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insight- specific embodiment of the approach, we present an imple- ful visual performance model for multicore architectures,” Commun. mentation for the STMicroelectronics STHORM development ACM, vol. 52, no. 4, pp. 65Ð76, Apr. 2009 [Online]. Available: http://doi.acm.org/10.1145/1498765.1498785 board. Our experimental results show that we achieve near- [15] A. Gara et al., “Overview of the Blue Gene/L system architecture,” IBM ideal throughput for most benchmarks, very close performance J. Res. Develop., vol. 49, no. 2.3, pp. 195Ð212, 2005. to hand-optimized OpenCL codes at a significantly lower pro- [16] Nvidia Inc.. (2012). NVIDIA’s Next Generation CUDA Compute 30× Architecture: Kepler GK110 [Online]. Available: http://www.nvidia.com/ gramming complexity, and up to speedup versus host content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper. execution time. pdf MARONGIU et al.: SIMPLIFYING MANY-CORE-BASED HETEROGENEOUS SoC PROGRAMMING 967

[17] P. Burgio, G. Tagliavini, F. Conti, A. Marongiu, and L. Benini, “Tightly- Andrea Marongiu (M’12) received the M.S. degree coupled hardware support to dynamic parallelism acceleration in embed- in electronic engineering from the University of ded shared memory clusters,” in Proc. Des. Autom. Test Eur. Conf. Cagliari, Cagliari, Italy, in 2006, and the Ph.D. degree Exhibit. (DATE’14), Mar. 2014, pp. 1Ð6. in electronic engineering from the University of [18] A. Marongiu, P. Burgio, and L. Benini, “Fast and lightweight sup- Bologna, Bologna, Italy, in 2010. port for nested parallelism on cluster-based embedded many-cores,” He is currently a Postdoctoral Researcher with in Proc. Des. Autom. Test Eur. Conf. Exhibit. (DATE’12), Mar. 2012, the University of Bologna. He also holds a pp. 105Ð110. Postdoctoral position with the Swiss Federal Institute [19] P. Greenhalgh, “Big-Little processing with ARM Cortex-A15 & Cortex- of Technology Zurich, Zurich, Switzerland. He has A7,” ARM White Paper, 2011, pp. 1Ð8 [Online]. Available: http://www. authored more than 50 papers in peer-reviewed inter- arm.com/files/downloads/big_LITTLE_Final_Final.pdf national journals and conferences. His research inter- [20] A. Stevens, “Introduction to AMBA 4 ACE,” ARM White Paper, ests concern parallel programming model and architecture design in the Jun. 2011 [Online]. Available: http://www.arm.com/files/pdf/Cache single-chip multiprocessors domain, with special emphasis on compilation for CoherencyWhitepaper_6June2011.pdf heterogeneous architectures, efficient usage of on-chip memory hierarchies, and [21] V. Saripalli, G. Sun, A. Mishra, Y. Xie, S. Datta, and V. Narayanan, system-on-chip virtualization. “Exploiting heterogeneity for energy efficiency in chip multiprocessors,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 2, pp. 109Ð119, Jun. 2011. Alessandro Capotondi (S’15) received the M.S. [22] J. Diaz, C. Munoz-Caro, and A. Nino, “A survey of parallel programming degree in computer engineering from the University models and tools in the multi and many-core era,” IEEE Trans. Parallel of Bologna, Bologna, Italy, in 2012. He is currently Distrib. Syst., vol. 23, no. 8, pp. 1369Ð1386, Aug. 2012. pursuing the Ph.D. degree in electrical, electronic, [23] Y.-S. Hwang and K.-S. Chung, “Dynamic power management tech- and information engineering from the University of nique for multicore based embedded mobile devices,” IEEE Trans. Ind. Bologna. Informat., vol. 9, no. 3, pp. 1601Ð1612, Aug. 2013. His research interests include parallel program- [24] P. Larsen, S. Karlsson, and J. Madsen, “Expressing coarse-grain depen- ming models and code optimization for heteroge- dencies among tasks in shared memory programs,” IEEE Trans. Ind. neous many-core architectures. Informat., vol. 7, no. 4, pp. 652Ð660, Nov. 2011. [25] A. Marongiu and L. Benini, “An OpenMP compiler for efficient use of distributed scratchpad memory in MPSoCs,” IEEE Trans. Comput., vol. 61, no. 2, pp. 222Ð236, Feb. 2012. Giuseppe Tagliavini received the M.S. degree [26] G. Mitra, E. Stotzer, A. Jayaraj, and A. P. Rendell, “Implementation and in computer engineering from the University of optimization of the OpenMP accelerator model for the TI Keystone II Bologna, Bologna, Italy, in 2010. Currently, he is architecture,” in Using and Improving OpenMP for Devices, Tasks, and pursuing the Ph.D. degree in electrical, electronic, More. New York, NY, USA: Springer, 2014, pp. 202Ð214. and information engineering from the University of [27] C. Liao, Y. Yan, B. R. de Supinski, D. J. Quinlan, and B. Chapman, “Early Bologna. experiences with the OpenMP accelerator model,” in OpenMP in the Era His research interests include parallel program- of Low Power Devices and Accelerators. New York, NY, USA: Springer, ming models, run-time support for many-core accel- 2013, pp. 84Ð98. erators, and compile-time optimization. [28] G. Ozen, E. Ayguadé, and J. Labarta, “On the roles of the programmer, the compiler and the runtime system when programming accelerators in OpenMP,” in Using and Improving OpenMP for Devices, Tasks, and More. New York, NY, USA: Springer, 2014, pp. 215Ð229. Luca Benini (S’94–M’97–SM’04–F’07) received the [29] T. Cramer, D. Schmidl, M. Klemm, and D. an Mey, “OpenMP pro- Ph.D. degree in electrical engineering from Stanford gramming on Intel R Xeon Phi TM coprocessors: An early performance University, Stanford, CA, USA, in 1997. comparison,” in Proc. Many Core Appl. Res. Community (MARC) Symp., He is a Professor of Digital Circuits and Systems 2012, pp. 38Ð44. with the Swiss Federal Institute of Technology [30] A. Duran et al., “OmpSS: A proposal for programming heterogeneous Zurich, Zurich, Switzerland, and is also a Professor multi-core architectures,” Parallel Process. Lett., vol. 21, no. 2, pp. 173Ð with the University of Bologna, Bologna, Italy. He 193, 2011. has authored more than 700 papers in peer-reviewed [31] E. Ayguadé et al., “A proposal to extend the OpenMP tasking model for international journals and conferences, four books, heterogeneous architectures,” in Evolving OpenMP in an Age of Extreme and several book chapters. His research interests Parallelism. New York, NY, USA: Springer, 2009, pp. 154Ð167. include energy-efficient system design and multicore [32] L. White, “OpenMP extensions for heterogeneous architectures,” in system-on-chip design, and energy-efficient smart sensors and sensor networks OpenMP in the Petascale Era. New York, NY, USA: Springer, 2011, for biomedical and ambient intelligence applications. pp. 94Ð107. Mr. Benini is the Member of Academia Europea.