Optimizing Memory Bandwidth Exploitation for Openvx Applications on Embedded Many-Core Accelerators
Total Page:16
File Type:pdf, Size:1020Kb
Real-Time Image Processing manuscript No. (will be inserted by the editor) Giuseppe Tagliavini · Germain Haugou · Andrea Marongiu · Luca Benini Optimizing Memory Bandwidth Exploitation for OpenVX Applications on Embedded Many-Core Accelerators Received: date / Revised: date Abstract In recent years image processing has been a width, even when the main memory bandwidth for the key application area for mobile and embedded computing accelerator is severely constrained. platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly-parallel ker- Keywords OpenVX, OpenCL, embedded computer nels. However architectural constraints impose hard lim- vision, bandwidth reduction, many-core accelerators its on the main memory bandwidth, and push for soft- ware techniques which optimize the memory usage of complex multi-kernel applications. In this work we propose a set of techniques, mainly 1 Introduction based on graph analysis and image tiling, targeted to accelerate the execution of image processing applica- The evolution of imaging sensors and the growing re- tions expressed as standard OpenVX graphs on cluster- quirements of applications are pushing hardware plat- based many-core accelerators. We have developed a run- form developers to incorporate advanced image process- time framework which implements these techniques us- ing capabilities into a wide range of embedded systems, ing a front-end compliant to the OpenVX standard, and ranging from smartphones to wearable devices. In par- based on an OpenCL extension that enables more ex- ticular, we have focused our attention on three main plicit control and efficient reuse of on-chip memory and classes of computationally intensive image processing greatly reduces the recourse to off-chip memory for stor- tasks: Embedded Computer Vision (ECV) [14], brain- ing intermediate results. Experiments performed on the inspired visual processing [37], and computational pho- STHORM many-core accelerator demonstrate that our tography [21]. approach leads to massive reduction of time and band- Considering the actual market trend toward HD for- mats and real-time video analysis, these algorithms re- quire hardware acceleration. Pushed by the need for This work has been supported by the EU-funded research projects P-SOCRATES (g.a. 611016) and extreme energy efficiency, embedded systems are em- MULTITHERMAN (g.a. 291125). bracing architectural heterogeneity, where a multi-core host processor is coupled with programmable accelera- Giuseppe Tagliavini Department of Electrical Electronic and Information Engi- tors specialized for various application domains. Several neering (DEI), University of Bologna, Italy companies and research groups are looking for alterna- E-mail: [email protected] tive solutions, ranging from using special functional units Germain Haugou on the host CPU [39], to embedded GPGPUs [32], to Integrated System Laboratory, ETH Zurich, Switzerland dedicated vector units [35], to many-core accelerators [4]. E-mail: [email protected] Many-core accelerators provide tens to hundreds of Andrea Marongiu small processing elements (PEs), typically organized in Integrated System Laboratory, ETH Zurich, Switzerland clusters sharing on-chip L1 memory and communicat- Department of Electrical Electronic and Information Engi- ing via low-latency, high-throughput on-chip intercon- neering (DEI), University of Bologna, Italy E-mail: [email protected] nections. Some examples of accelerators featuring this architectural paradigm are STM STHORM [4], Plurality Luca Benini Integrated System Laboratory, ETH Zurich, Switzerland HAL [38], KALRAY MMPA [24], Adapteva Epiphany- Department of Electrical Electronic and Information Engi- IV [1] and PULP [10]. In these architectures the PEs are neering (DEI), University of Bologna, Italy simpler w.r.t. common multi-core architectures and of- E-mail: [email protected] fer a good trade-off between highly parallel computation 2 and power consumption, so they are a promising target Overall DMA is difficult to use, but it can be man- for running image processing workloads. aged quite easily if its usage is limited to single frame Many-core accelerators differ from GPGPUs in two copies. However this approach cannot be generalized, main traits. First, PEs are not restricted to run the same because (i) there is not enough internal memory even instruction on different data, in an effort to improve exe- to keep a single image, and (ii) the memory band- cution efficiency of branch-rich computations and to sup- width is easily saturated in presence of multiple ker- port a more flexible workload-to-PE distribution. Sec- nels. DMA can be used for tiled transfers, using the ond, embedded many-core accelerators do not rely on async work group copy function to handle asynchronous massive multithreading to hide memory latency, but they copies between global and local memory and vice versa. rely instead on DMA engines and double buffering, which In this case programmers need to interleave DMA and give more control on the bandwidth vs. latency trade-off, computation and this gets much more complicated. For but require more programming effort to be managed. basic kernels, the lines of code used for DMA orches- From the software viewpoint, in the last years a num- tration and subsequent workload distribution are over ber of programming models for GPGPUs and many- 50% of the total [2]. Considering applications composed core accelerators have been proposed [49]. Among oth- by multiple kernels, this normally implies multiple tile ers, the very successful Khronos OpenCL standard in- sizes, because each kernel may require a different tile. In troduces platform and execution models which are par- this case, the process to manage this orchestration by ticularly suitable for programming at the emerging in- hand becomes totally unmanageable, and programmers tersection between multi-core CPUs and many-core ac- need tools. This is exactly what our approach provides. celerators. OpenCL offers a conjunction of task parallel Other programming paradigms have been proposed programming model, with a run-to-completion approach, to implement image processing applications on embed- and data parallelism, supported by global synchroniza- ded systems, based on data-flow graphs [48] [15] or func- tion mechanisms. An OpenCL application runs on the tional models [40]. Overall, these solutions tackle the is- host processor and distributes kernels on computing de- sue of memory bandwidth using a tiling-based approach, vices. Kernels are programmed in the OpenCL C lan- but in most cases the related execution models are not guage, which is based on the C99 standard. On the host suitable to build complex applications which include ir- side, applications are written in C/C++ language, and regular algorithms and data access patterns. invoke standard API calls to orchestrate the distribution In this work we introduce a framework that imple- and execution of kernels on devices, using a mechanism 1 ments a set of optimizations specifically targeted to ac- based on command queues. celerate the execution of graph-based image process- A common issue of using OpenCL on embedded sys- ing applications on many-core accelerators. The frame- tems is related to the mandatory use of global memory work front-end is based on the OpenVX standard [26]. space to share intermediate data between kernels. When OpenVX is a cross-platform API which aims at enabling increasing the number of interacting kernels, the main hardware vendors to implement and optimize low-level memory bandwidth required to fulfill data requests orig- image processing primitives, with a strong focus on mo- inated by PEs is much higher than the available one, bile and embedded systems. In our framework, data ac- causing a bottleneck. In addition, unlike accelerators for cesses are performed on local buffers in the L1 scratch- desktop computing environments (e.g. Intel Many Inte- pad memory of the reference architecture, that is what grated Core architecture [23]), SoCs have unified host we call a localized execution. To satisfy this condition, and global memory spaces, and have a common data we have taken into account all the data access patterns path connecting host processor and accelerator with L3 that can be found in the OpenVX standard-defined ker- memory [42]. As a direct consequence, applications expe- nels (e.g., local and statistical operators), and that are rience high contention for off-chip memory access, that the most common in image processing algorithms, and may severely limit the final speed-up. then we have defined a set of techniques to support auto- For instance, we consider a platform with an accelera- matic image tiling. Coupling tiling with double-buffering, tor and a DDR3-1600 memory (6400MB/s per channel). we achieve a good overlap between data communication If we reasonably assume that the accelerator has half of and kernel execution on the accelerator, that guarantees the available bandwidth (3200MB/s), and our applica- a higher efficiency in terms of PEs usage. tion need to process a 1920x1080 video source at 60fps, The novelty of our work derives from three main con- then a single image access uses 123.12MB/s. Hence, after tributions: (i) the introduction of a low-level OpenCL accessing 26 image buffers in one frame time the available extension, that can be effectively used to support effi- bandwidth is saturated, but this could not be enough for cient execution of graph-structured workloads with ex-