Enabling Openvx Support in Mw-Scale Parallel Accelerators
Total Page:16
File Type:pdf, Size:1020Kb
Enabling OpenVX support in mW-scale parallel accelerators Giuseppe Tagliavini Germain Haugou Andrea Marongiu, DEI, University of Bologna IIS, ETH Zurich Luca Benini [email protected] [email protected] DEI, University of Bologna IIS, ETH Zurich {a.marongiu,l.benini}@iis.ee.ethz.ch ABSTRACT low-power operation. Sophisticated data manipulation is mW-scale parallel accelerators are a promising target for thus enabled at the edges of the cloud, significantly reduc- application domains such as the Internet of Thing (IoT), ing the amount of data to be sent through bandwidth-limited which require a strong compliance with a limited power communication interfaces (e.g., serial peripheral interface). budget combined with high performance capabilities. An To deliver the required performance/watt targets, the important use case is given by smart sensing devices featur- most promising platforms available on the market are focus- ing increasingly sophisticated vision capabilities, at the cost ing on a class of heterogeneous systems including a micro- of an increasing amount of near-sensor computation power. controller unit (MCU) coupled to a programmable, mW-scale OpenVX is an emerging standard for the embedded vision, parallel accelerator [4][1][2][10][33]. The adoption of ultra- and provides a C-based application programming interface low-power parallel accelerators as a co-processor provides and a runtime environment. OpenVX is designed to maxi- hundred-fold increase in OPS/W [9] compared to state-of- mize functional and performance portability across diverse the-art microcontrollers. The advent of such devices, com- hardware platforms. However, state-of-the-art implementa- bined with the widespread diffusion of miniaturized cameras tions rely on memory-hungry data structures, which cannot [7], is becoming the key enabler for building sensor nodes be supported in constrained devices. In this paper we pro- capable of running computer vision (CV) workloads, which pose an alternative and novel approach to provide OpenVX will be at the heart of tomorrow's most ambitious frontier support in mW-scale parallel accelerators. Our main contri- of the IoT (smart cities, the internet of vehicles [3]). butions are: (i) an extension to the original OpenVX model Similar to what has happened to high-end heterogeneous to support static management of application graphs in the systems, programmability will become a key issue in this form of binary files; (ii) the definition of a companion run- domain as well. The adoption of mainstream programming time environment providing a lightweight support to execute paradigms is an appealing solution, but it is complicated by binary graphs in a resource-constrained environment. Our the constrained nature of IoT designs (power, memory, etc.). approach achieves 68% memory footprint reduction and 3× OpenVX [20] represents the state-of-the-art for embedded execution speed-up compared to a baseline implementation. vision programming, as witnessed by its widespread adop- At the same time, data memory bandwidth is reduced by tion in commercial products. It provides a C-based appli- 10% and energy efficiency is improved by 2×. cation programming interface (API) and a runtime environ- ment (RTE) aimed at enabling optimized implementations of numerous CV algorithms on embedded systems. OpenVX relies on a graph-based execution model, which simplifies 1. INTRODUCTION programmability by exposing basic components and their Market forecasts [14][6] report that there will be more relations to application developers. than 25 billion Internet-of-Things (IoT) devices by 2020. A In the most common case, an OpenVX framework relies key enabler for the IoT is the development of next-generation on a graph-based RTE, assuming that a data structure de- smart sensors, that combine standard analog/digital trans- scribing the graph is allocated in the accelerator memory ducers and communication interfaces with powerful data- [38]. A key trait of mW-scale accelerators is the strongly processing capabilities. At the beginning of its story, the limited amount of available on-chip memory, which requires IoT paradigm was characterized by the critical role of cloud dedicated memory management techniques to enable the ef- infrastructures as providers of computational power, and ficient execution of graphs of arbitrary large size. To this its terminal constituent devices were relatively unsophisti- aim OpenVX supports data tiling [17], a well-known tech- cated. With the advent of smart sensors this trend is evolv- nique that exploits spatial locality in a program by parti- ing rapidly [5] towards the edge computing paradigm, that tioning large data structures into smaller chunks that are moves the execution environment of applications away from brought in and out of the target memory via DMA trans- centralized nodes and to the logical extremes of a network. fers. When tiling is applied, the number of nodes in the This is made possible by recent designs for IoT devices, graph data structure becomes much larger than the number which combine complex processing capability with ultra- of application kernels, due to the tiles and to computation artifacts generated by their presence (e.g., image borders, Permission to make digital or hard copies of all or part of this work for personal or corners or inner tiles, double-buffering). Consequently, the classroom use is granted without fee provided that copies are not made or distributed scarce amount of on-chip memory can be insufficient to con- for profit or commercial advantage and that copies bear this notice and the full cita- tain both application and RTE data and code. tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- In this paper we propose a novel approach to pro- publish, to post on servers or to redistribute to lists, requires prior specific permission vide OpenVX support in mW-scale parallel acceler- and/or a fee. Request permissions from [email protected]. ators. Experiencing with real-life applications, we realized CASES ’16, October 01-07, 2016, Pittsburgh, PA, USA that standard data management techniques (e.g., tiling and c 2016 ACM. ISBN 978-1-4503-4482-1/16/10. $15.00 double buffering) are not enough to guarantee the stringent DOI: http://dx.doi.org/10.1145/2968455.2968518 memory constraints that have been outlined for this cate- gory of devices, and we had to re-think the way such ex- ecution model could be supported. Our proposal lever- ages a new API to support static management of applica- tion graphs. An OpenVX graph can be created, verified and executed on the developer workstation using the standard OpenVX execution flow, which guarantees full portability on any platform an OpenVX framework is available for. Once a program has been developed and tested following this stan- dard flow, the proposed API extension allows to save the resulting graph into a static representation as a platform- dependent binary file. The OpenVX runtime data struc- ture management is replaced by a control-code generation approach, which reduces the total RTE footprint (code + data) to a few Kilobytes. This approach drastically reduces also the platform energy consumption, by replacing costly and frequent accesses to the runtime data with cheaper con- trol instructions (ALU operations). It also makes effective usage of low-bandwidth data channels by means of software techniques aimed at maximizing the computation locality Figure 1: Heterogeneous architecture model (thus minimizing the request for external bandwidth). The control-code generation is complemented by a minimal RTE ory, an L1 data memory and a DMA engine to enable data design for the target platform, called milliVX, which in- transfers with greater memory levels. Peripherals (e.g., SPI) cludes a small subset of the OpenVX specification to read a and greater memory levels (e.g., L2 or DDR) are accessible generated binary and offload its execution to the accelerator. via off-chip communication channels. It is important to underline that our approach takes Figure 1 shows the generalized architecture model consid- advantage of the static structure extracted by an ered in the rest of this work. It consists of a MCU host OpenVX program to optimize the execution stage in coupled with a multi-core mW-scale accelerator. The host terms of memory footprint and execution time, but is the main control unit of the full system, and it has the at the same time it fully preserves the dynamic fea- option to offload computation-intensive workloads to the ac- tures of the original OpenVX standard, namely graph celerator. The link between MCU and accelerator uses the updates and node callbacks. Graph updates are dynamic SPI protocol, a common interface for off-the-shelf MCUs modifications to the graph data structure, generating a dif- which fully satisfies mW-scale power constraints. External ferent graph that requires a further verification stage before sensors and communication channels are managed by the its execution. Node callbacks are used to control the graph MCU, while data to/from the accelerator are stored in the execution flow by calling functions upon termination of a external memory. particular node. milliVX provides low-cost yet full-fledged The accelerator is a parallel platform featuring a number support to those features. n of PEs, that are fully independent cores supporting mul- To assess our approach, we provide a reference implemen- tiple instruction multiple data (MIMD) parallelism. Each tation for the OpenVX extension and the milliVX specifica- core is equipped with an instruction cache (I$), which can tion using a publicly available research tool [38] for OpenVX be private or shared. To avoid memory coherency overhead development on ultra-low-power parallel accelerators [33]. and increase energy efficiency the PEs do not have private Experimental results show that our approach achieves 68% data caches, but they share a L1 scratchpad memory (SCM).