Simplifying Many-Core-Based Heterogeneous Soc Programming
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015 957 Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives Andrea Marongiu, Member, IEEE, Alessandro Capotondi, Student Member, IEEE, Giuseppe Tagliavini, and Luca Benini, Fellow, IEEE Abstract—Multiprocessor systems-on-chip (MPSoC) are evolv- performance/watt. Unfortunately, these advantages are traded ing into heterogeneous architectures based on one host processor off for an increased programming complexity: extensive and plus many-core accelerators. While heterogeneous SoCs promise time-consuming rewrite of applications is required, using spe- higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, cialized programming paradigms. For example, in the general- OpenCL). We present a programming model based on OpenMP, purpose and high-performance computing (HPC) domains, with additional directives to program the accelerator from a sin- Graphics Processing Unit (GPU)-based systems are pro- gle host program. As a test case, we evaluate an implementation grammed with OpenCL.1 OpenCL aims at providing a stan- of this programming model for the STMicroelectronics STHORM dardized way of programming such accelerators; however, it development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL offers a very low-level programming style. Programmers must codes at a significantly lower programming complexity, and up to write and compile separate programs for the host system and for 30× speedup versus host execution time. the accelerator. Data transfers to/from the many-core and syn- Index Terms—Heterogeneous systems-on-chip (SoC), many chronization must also be manually orchestrated. In addition, core, nonuniform memory access (NUMA), OpenMP. OpenCL is not performance-portable: specific optimizations have to be recoded for the accelerator at hand. This gener- I. INTRODUCTION ates the need for a higher level programming style [9] [10]. HE EVER-INCREASING demand for computational OpenMP [11] has included in the latest specification exten- T power within tight energy budgets has recently led to rad- sions to manage accelerators, and XeonPhi coprocessors offer ical evolutions of multiprocessor systems-on-chip (MPSoC). all standard programming models [OpenMP, Pthreads, message Two design paradigms have proven particularly effective in passing interface (MPI)] [12], where the accelerator appears increasing performance and energy efficiency of such systems: like a symmetric multiprocessor (SMP) on a single chip. 1) architectural heterogeneity; and 2) many-core processors. In the embedded domain, such proposals are still lacking, but Power-aware design [1], [2] and mapping [3], [4] for hetero- there is a clear trend toward designing embedded SoCs in the geneous MPSoCs are being widely studied by the research same way it is happening in the HPC domain [13], and which community, and many-core-based heterogeneous MPSoCs are will eventually call for the same programming solutions. now reality [5]–[8]. In this paper, we present a programming model, compiler, A common embodiment of architectural heterogeneity is a and runtime system for a heterogeneous embedded platform template where a powerful general-purpose processor (usu- template featuring a host system plus a many-core accelerator. ally called the host), featuring sophisticated cache hierar- The many-core relies on a multicluster design, where each clus- chy and full-fledged operating system, is coupled to pro- ter features several simple cores sharing L1 scratchpad memory. grammable many-core accelerators composed of several tens Intercluster communication is subject to nonuniform memory of simple processors, where highly parallel computation ker- access (NUMA) effects. The programming model consists of nels of an application can be offloaded to improve overall an extended OpenMP, where additional directives allow to effi- ciently program the accelerator from a single host program, Manuscript received May 07, 2014; revised February 05, 2015; accepted rather than writing separate host and accelerator programs, and June 08, 2015. Date of publication June 24, 2015; date of current version July 31, 2015. This work was supported by EU projects ERC-AdG MultiTherman distribute the workload among clusters in a NUMA-aware man- (291125) and FP7 P-SOCRATES (611016). Paper no. TII-14-0489. ner, thus improving the performance. The proposed OpenMP A. Marongiu and L. Benini are with the Department of Electrical, Electronic, extensions are only partly inline with the latest OpenMP v4.0 and Information Engineering “Guglielmo Marconi” (DEI), University of specifications. The latter are in our view too tailored to the Bologna, Bologna 40136, Italy, and also with the Department of Information Technology and Electrical Engineering, Swiss Federal Institute of Technology characteristics of today’s GPUs, as they emphasize data-level Zurich (ETH Zurich), Zurich 8092, Switzerland (e-mail: [email protected]. accelerator parallelism (modern GPUs being conceived for ethz.ch; [email protected]). that) and copy-based host-to-accelerator communication (mod- A. Capotondi and G. Tagliavini are with the Department of Electrical, ern GPUs being based on private-memory designs). Our focus Electronic, and Information Engineering “Guglielmo Marconi” (DEI), University of Bologna, Bologna 40136, Italy (e-mail: alessandro.capotondi@ is on many-core accelerators which efficiently support more unibo.it; [email protected]). types of parallelism (e.g., tasks) and leverage shared memory Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TII.2015.2449994 1[Online]. Available: http://www.khronos.org/opencl 1551-3203 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 958 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 11, NO. 4, AUGUST 2015 communication with the host, which is where the heteroge- neous system architecture (HSA)2 and all GPU roadmaps are heading in the longer term. We discuss how to provide efficient communication with the host on top of shared memory by trans- parently relying on pointer exchange in case virtual memory paging is natively supported by the many-core and leveraging software virtual address translation plus copies into contiguous shared memory (to overcome paging issues) if such support is lacking. We also comment on how copies can be used to imple- ment offload on top of a private accelerator memory space. To achieve our goals, we propose minimal extensions to the previous OpenMP v3.1, emphasizing ease of programming. For validation, we present a concrete embodiment of our pro- posal targeting the first STMicroelectronics STHORM devel- opment board [5]. This board couples an ARM9 host system and main memory (based on the Zynq7 device) to a 69-cores STHORM chip. We present a multi-ISA compilation toolchain Fig. 1. Heterogeneous embedded SoC template. that hides all the process of outlining an accelerator program from the host application, compiling it for the STHORM plat- where critical computation kernels of an application can be form, offloading the execution binary, and implementing data offloaded to improve overall performance/watt [5]–[8], [15]. sharing between the host and the accelerator. Two separate The type of many-core accelerator that we consider here has OpenMP runtime systems are developed, one for the host and a few key characteristics. one for the STHORM accelerator. 1) It leverages a multicluster design to overcome scalability Our experiments thoroughly assess the performance of the limitations [5], [6]. Processors within a cluster are proposed programming framework, considering six representa- tightly coupled to local L1 scratchpad memory, which tive benchmarks from the computer vision, image processing, implies low-latency and high-bandwidth communica- and linear algebra domains. The evaluation is articulated in tion. Globally, the many-core accelerator leverages a three parts. partitioned global address space (PGAS). Every remote 1) We relate the achieved throughput to each benchmark’s memory can be directly accessed by each processor, but operational intensity using the Roofline methodology intercluster communication travels through a networks- [14]. Here, we observe near-ideal throughput for most on-chip (NoC), and is subject to NUMA latency and benchmarks. bandwidth. 2) We compare the performance of our OpenMP to OpenCL, 2) The processors within a cluster are not GPU-like data- natively supported by the STHORM platform, achiev- parallel cores, with common fetch/decode phases which ing very close performance to hand-optimized OpenCL imply performance loss when parallel cores execute out codes, at a significantly lower programming complexity. of lock-step mode. The accelerator processors considered 3) We measure the speedup of our OpenMP versus sequen- here are simple independent RISC cores, perfectly tial execution on the ARM host, which exhibits peaks of suited to execute both single instruction, multiple data 30×. (SIMD) and multiple instruction, multiple data (MIMD) This paper is organized as follows. In Section II, we types of parallelism. This allows to efficiently support a describe the target heterogeneous embedded