Enabling OpenVX support in mW-scale parallel accelerators
Giuseppe Tagliavini Germain Haugou Andrea Marongiu, DEI, University of Bologna IIS, ETH Zurich Luca Benini [email protected] [email protected] DEI, University of Bologna IIS, ETH Zurich {a.marongiu,l.benini}@iis.ee.ethz.ch
ABSTRACT low-power operation. Sophisticated data manipulation is mW-scale parallel accelerators are a promising target for thus enabled at the edges of the cloud, significantly reduc- application domains such as the Internet of Thing (IoT), ing the amount of data to be sent through bandwidth-limited which require a strong compliance with a limited power communication interfaces (e.g., serial peripheral interface). budget combined with high performance capabilities. An To deliver the required performance/watt targets, the important use case is given by smart sensing devices featur- most promising platforms available on the market are focus- ing increasingly sophisticated vision capabilities, at the cost ing on a class of heterogeneous systems including a micro- of an increasing amount of near-sensor computation power. controller unit (MCU) coupled to a programmable, mW-scale OpenVX is an emerging standard for the embedded vision, parallel accelerator [4][1][2][10][33]. The adoption of ultra- and provides a C-based application programming interface low-power parallel accelerators as a co-processor provides and a runtime environment. OpenVX is designed to maxi- hundred-fold increase in OPS/W [9] compared to state-of- mize functional and performance portability across diverse the-art microcontrollers. The advent of such devices, com- hardware platforms. However, state-of-the-art implementa- bined with the widespread diffusion of miniaturized cameras tions rely on memory-hungry data structures, which cannot [7], is becoming the key enabler for building sensor nodes be supported in constrained devices. In this paper we pro- capable of running computer vision (CV) workloads, which pose an alternative and novel approach to provide OpenVX will be at the heart of tomorrow’s most ambitious frontier support in mW-scale parallel accelerators. Our main contri- of the IoT (smart cities, the internet of vehicles [3]). butions are: (i) an extension to the original OpenVX model Similar to what has happened to high-end heterogeneous to support static management of application graphs in the systems, programmability will become a key issue in this form of binary files; (ii) the definition of a companion run- domain as well. The adoption of mainstream programming time environment providing a lightweight support to execute paradigms is an appealing solution, but it is complicated by binary graphs in a resource-constrained environment. Our the constrained nature of IoT designs (power, memory, etc.). approach achieves 68% memory footprint reduction and 3× OpenVX [20] represents the state-of-the-art for embedded execution speed-up compared to a baseline implementation. vision programming, as witnessed by its widespread adop- At the same time, data memory bandwidth is reduced by tion in commercial products. It provides a C-based appli- 10% and energy efficiency is improved by 2×. cation programming interface (API) and a runtime environ- ment (RTE) aimed at enabling optimized implementations of numerous CV algorithms on embedded systems. OpenVX relies on a graph-based execution model, which simplifies 1. INTRODUCTION programmability by exposing basic components and their Market forecasts [14][6] report that there will be more relations to application developers. than 25 billion Internet-of-Things (IoT) devices by 2020. A In the most common case, an OpenVX framework relies key enabler for the IoT is the development of next-generation on a graph-based RTE, assuming that a data structure de- smart sensors, that combine standard analog/digital trans- scribing the graph is allocated in the accelerator memory ducers and communication interfaces with powerful data- [38]. A key trait of mW-scale accelerators is the strongly processing capabilities. At the beginning of its story, the limited amount of available on-chip memory, which requires IoT paradigm was characterized by the critical role of cloud dedicated memory management techniques to enable the ef- infrastructures as providers of computational power, and ficient execution of graphs of arbitrary large size. To this its terminal constituent devices were relatively unsophisti- aim OpenVX supports data tiling [17], a well-known tech- cated. With the advent of smart sensors this trend is evolv- nique that exploits spatial locality in a program by parti- ing rapidly [5] towards the edge computing paradigm, that tioning large data structures into smaller chunks that are moves the execution environment of applications away from brought in and out of the target memory via DMA trans- centralized nodes and to the logical extremes of a network. fers. When tiling is applied, the number of nodes in the This is made possible by recent designs for IoT devices, graph data structure becomes much larger than the number which combine complex processing capability with ultra- of application kernels, due to the tiles and to computation artifacts generated by their presence (e.g., image borders, Permission to make digital or hard copies of all or part of this work for personal or corners or inner tiles, double-buffering). Consequently, the classroom use is granted without fee provided that copies are not made or distributed scarce amount of on-chip memory can be insufficient to con- for profit or commercial advantage and that copies bear this notice and the full cita- tain both application and RTE data and code. tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- In this paper we propose a novel approach to pro- publish, to post on servers or to redistribute to lists, requires prior specific permission vide OpenVX support in mW-scale parallel acceler- and/or a fee. Request permissions from [email protected]. ators. Experiencing with real-life applications, we realized CASES ’16, October 01-07, 2016, Pittsburgh, PA, USA that standard data management techniques (e.g., tiling and c 2016 ACM. ISBN 978-1-4503-4482-1/16/10. . . $15.00 double buffering) are not enough to guarantee the stringent DOI: http://dx.doi.org/10.1145/2968455.2968518 memory constraints that have been outlined for this cate- gory of devices, and we had to re-think the way such ex- ecution model could be supported. Our proposal lever- ages a new API to support static management of applica- tion graphs. An OpenVX graph can be created, verified and executed on the developer workstation using the standard OpenVX execution flow, which guarantees full portability on any platform an OpenVX framework is available for. Once a program has been developed and tested following this stan- dard flow, the proposed API extension allows to save the resulting graph into a static representation as a platform- dependent binary file. The OpenVX runtime data struc- ture management is replaced by a control-code generation approach, which reduces the total RTE footprint (code + data) to a few Kilobytes. This approach drastically reduces also the platform energy consumption, by replacing costly and frequent accesses to the runtime data with cheaper con- trol instructions (ALU operations). It also makes effective usage of low-bandwidth data channels by means of software techniques aimed at maximizing the computation locality Figure 1: Heterogeneous architecture model (thus minimizing the request for external bandwidth). The control-code generation is complemented by a minimal RTE ory, an L1 data memory and a DMA engine to enable data design for the target platform, called milliVX, which in- transfers with greater memory levels. Peripherals (e.g., SPI) cludes a small subset of the OpenVX specification to read a and greater memory levels (e.g., L2 or DDR) are accessible generated binary and offload its execution to the accelerator. via off-chip communication channels. It is important to underline that our approach takes Figure 1 shows the generalized architecture model consid- advantage of the static structure extracted by an ered in the rest of this work. It consists of a MCU host OpenVX program to optimize the execution stage in coupled with a multi-core mW-scale accelerator. The host terms of memory footprint and execution time, but is the main control unit of the full system, and it has the at the same time it fully preserves the dynamic fea- option to offload computation-intensive workloads to the ac- tures of the original OpenVX standard, namely graph celerator. The link between MCU and accelerator uses the updates and node callbacks. Graph updates are dynamic SPI protocol, a common interface for off-the-shelf MCUs modifications to the graph data structure, generating a dif- which fully satisfies mW-scale power constraints. External ferent graph that requires a further verification stage before sensors and communication channels are managed by the its execution. Node callbacks are used to control the graph MCU, while data to/from the accelerator are stored in the execution flow by calling functions upon termination of a external memory. particular node. milliVX provides low-cost yet full-fledged The accelerator is a parallel platform featuring a number support to those features. n of PEs, that are fully independent cores supporting mul- To assess our approach, we provide a reference implemen- tiple instruction multiple data (MIMD) parallelism. Each tation for the OpenVX extension and the milliVX specifica- core is equipped with an instruction cache (I$), which can tion using a publicly available research tool [38] for OpenVX be private or shared. To avoid memory coherency overhead development on ultra-low-power parallel accelerators [33]. and increase energy efficiency the PEs do not have private Experimental results show that our approach achieves 68% data caches, but they share a L1 scratchpad memory (SCM). memory footprint reduction and about 3× execution speed- Communication between the cores and the SCM is based on up compared to a baseline. Moreover, the data memory an interconnect (Local interconnect) that guarantees mini- bandwidth towards off-chip memory is further reduced by mum access latency (1 or 2 cycles), implementing a word- 10% and energy efficiency is improved by 2×. level interleaving scheme to reduce the access contention to memory banks. An L2 scratchpad memory is shared among The rest of the paper is organized as follows. In Section all cores, to host code and data, with an access latency in the 2 we introduce the background and analyze the limitations order of 10 cycles. A DMA engine enables fast and flexible of standard OpenVX frameworks In Section 3 we describe communication with L2 memory and external peripherals. our approach. Section 4 presents the experimental results. In this template, the external memory is intended to store Section 5 discusses the related work. Finally, we conclude input/output data. A typical example of such IP could be a and introduce our future work in Section 6. generic flash memory or a more specific frame buffer where various sensors place sampled data. It is accessible through the SPI, which provides a low-bandwidth, long-latency serial 2. BACKGROUND IO channel. 2.1 Platform template Today the market offers several products that can be in- 2.2 The OpenVX programming paradigm cluded in the class of mW-scale parallel accelerator, mostly The OpenVX standard defines a set of API functions and in the segment of licensable IP cores [4][1][2], but also a library of kernels implementing vision primitives. The ex- hardware platforms from research institutions are available ecution model is based on a directed acyclic graph, where [10][33]. The size, performance, and power consumption of nodes are instances of vision kernels and edges are images these solutions greatly depend on the core configuration, that represent input/output dependencies on data. An im- synthesis flags, physical-IP libraries, technology node, and portant class of images includes virtual images, which are other variables. Overall, these accelerators are character- introduced to force a dependency between kernels but are ized by a common design. An architecture provides a set of not backed-up by a physical storage. Following the stan- homogeneous PEs, that are CPUs or general-purpose DSPs, dard flow, a graph is (i) created (as a data structure), (ii) commonly supporting a VLIW instruction set or a vector populated by interconnecting kernel nodes, (iii) verified to extension. Each architecture also includes an L1 code mem- guarantee the consistency of its topology and data flow, (iv) Figure 2: Example of an OpenVX graph and its expansion after tiling
executed to process the input data and produce the desired 13 status = vxVerifyGraph(graph); result. An OpenVX implementation consists of a program- 14 while(
void entry_point_function(int cfv1) { graph_control_function1(…); if(cfv1) graph_control_function2(…); else graph_control_function3(…); }
Figure 5: Example of graph modifications
added. Consequently, two alternative sub-graphs are gen- erated, the common sub-graph 1 and the alternative ones 2 and 3. The generated code for the entry point functions enables to switch between 2 and 3 on the basis of the control flow variable cfv1 (the next section describes how to practi- (a) Control function code cally use this mechanism in the milliVX framework). When the algorithm fails in generating alternative sub-graphs (i.e., sub-graph deriving from distinct graph updates intersect), the framework generates totally distinct graphs. In this case a warning is generated to inform the programmer that graph modifications were too pervasive and subgraph generation was not possible, since this condition could increase the final (b) Graph representation binary size. This particular condition is related to complex polymorphic behaviors that are not common of OpenVX Figure 4: Example of a generated control function applications, but nevertheless this corner case is correctly supported by our framework. Overall, this methodology to- 12 ... tally preserves the semantics of the original program and 13 status = vxVerifyGraph(graph); ensures full code portability. 14 #ifdef CODE_GENERATION 15 while(
introducing control flow variables. These are integer vari- n ables that are passed as an input parameter to the entry X point function. The actual value of a control flow variable (n − 1) × memsize(keri) < α memsize(kerj ) (3) discriminates what version of the related sub-graph must j=1 be executed. This behaviour is equivalent to having more The parameter α is the percentage of the total kernel foot- variants of the same graph, and each control flow variable print (supposing no other inlining) that we could not exceed select a specific variant. Figure 5 shows an example of graph to inline the current kernel. modification. Node n1 is removed, and nodes n2 and n3 are 3.2 milliVX framework 4. EXPERIMENTAL RESULTS In the context of our target mW-scale architecture, mil- This section describes a set of experiments performed to liVX is a lightweight framework available to the MCU host compare our approach to a standard dynamic graph-based to load a program binary corresponding to an OpenVX framework. First, we measure the impact of code and run- static graph and then offload its computation to the par- time data footprint when comparing CG-RTE and GB-RTE. allel accelerator. The milliVX API specification includes Second, we show how CG-RTE enables performance speedups the following functions: due to link-time optimizations enabled by the approach. • mvxCreateContext – Create a lightweight context for Third, we discuss how the reduced memory accesses due the RTE. The implementation details are strictly de- to graph removal enable important energy savings for CG- pendent on the target platform. RTE. Finally, we discuss the impact of tiling on reducing • mvxCreateImageReference – Create a reference to an the bandwidth pressure on the slow SPI channel. image location, providing a pointer. The address value is required to enable the graph binary to ac- 4.1 Setup cess input/output images. An image reference must The benchmarks used for experiments include a set of rep- be instantiated for each concrete image in the origi- resentative CV kernels for constrained embedded systems: nal OpenVX program. Virtual images just represent • Sobel is a gradient-based edge detector (nodes: Sobel dependencies, and the generated code already handles 3×3, gradient magnitude, thresholding); these dependencies internally. • FAST9 implements FAST9 algorithm [34] (nodes: • mvxLoadGraph – Load a graph into the accelerator L2 FAST9, find maxima, non-maxima suppression); memory, in the format provided by vxSaveGraph. This • Pyramid creates a set of scaled and blurred images function returns a handle to the loaded graph. (nodes: Gaussian blur, half scale, ... repeated 3 times); • mvxProcessGraph – Start the execution of the graph • Canny implements an edge detector algorithm (nodes: on the accelerator. The required parameters are the Sobel M×N, element-wise norm, phase, non-maxima graph handle returned by mvxLoadGraph, the graph in- suppression, edge tracking by hysteresis); put/output image references and the control flow vari- • Harris implements a corner detector algorithm (nodes: ables. The control variables are set by the program Sobel M×N, Harris score computation, Euclidean non- logic with the aim to execute a specific graph variant. maxima suppression, corner lister); • mvxAssignNodeCallback – Set a function callback. • NCC is an algorithm to detect abandoned/removed The required parameters are the callback identifier objects in a set of adjacent video frames [25] (nodes: (provided by the extended OpenVX RTE) and the NCC filter, erode); function to execute. The function could be the same • Disparity computes the stereo-matching disparity be- provided in a standard OpenVX RTE, with the only tween two images (nodes: subtraction, multiplication, difference that API function must be replaced with integral image, area sum, disparity computation). their equivalent in milliVX ; in most cases, the access • CNN is a convolutional neural network [23] including to a framework object is replaced with a direct memory four layers made of 48 total nodes (node types: convo- access. The communication protocol between the host lution, add, max-pooling). and the accelerator, used to manage the callback be- The reference image size used for experiments is VGA, haviour, is handled by the milliVX RTE. The commu- typical of ultra-low cost imagers. nication internals are based on platform-specific mech- As a reference platform for our experiments we use accu- anisms (e.g., shared memory or communication chan- rate simulation models for a heterogeneous system coupling nels), and also the synchronization can be achieved us- a MCU host with a PULP multi-core accelerator, based on ing alternative mechanisms (e.g., a software interrupt the open source tool ADRENALINE [38]. In our target plat- or a polling thread). form, we include an external frame memory (FM) intended The structure of a milliVX program is much simpler than to store input/output images, while L2 is dedicated to code an equivalent OpenVX program. Creation and verification and run-time data. FM is an off-chip component, as its size stages are performed by the extended RTE producing a could scale up to several MBs depending on the target image static graph, and milliVX only handles the execution stage. format. The system is configured as follows: A milliVX program can be automatically derived from the • Host: STM32-L476 MCU [36] – number of cores = 1 corresponding OpenVX code. Pointers to input/output im- (ARM Cortex M4), core frequency = 26MHz ages are provided as global external variables keeping the • External memory: memory size = 1 MB, access la- original names, and these symbols must be resolved at link tency = 50 cycles, bandwidth = 0.125 bytes/cycle time. The actual parameters for control flow variables when • PULP v3 cluster: [33] number of cores = 4 (Open- invoking mvxProcessGraph is derived by a static control flow Risc OR10N), core frequency = 200MHz, SCM size = analysis of the source code. Listing 5 shows the structure of 64 KB, L2 size = 128 KB, L2 access latency = 5 cycles, a milliVX application using callbacks. L2 bandwidth = 4 bytes/cycle The power consumption of a PULP cluster in 28nm FD- 14 mvx_context ctx = mvxCreateContext(); 15 mvx_graph graph = mvxLoadGraph(
Figure 8: Execution time (GB-RTE vs CG-RTE)
Figure 7: Total memory savings (CG-RTE vs GB-RTE)
Sourcery Linux GNU toolchain (version 4.8.2) for the ARM Cortex-M target and the OR10N LLVM/Clang toolchain Figure 9: Energy efficiency of the STM32-L476 host com- (based on LLVM 3.7) for PULP. pared to the PULP accelerator
4.2 Memory footprint both runtime versions. The speed-up of CG-RTE over GB- Figure 6 compares the code footprints on GB-RTE and RTE varies from 1.04 to 7.89, and this behavior depends on CG-RTE, highlighting the percentage savings of the second three main factors. First, it is proportional to a C/C fac- solution. Both RTEs allocate code and runtime data struc- tor, which includes the impact of the computation intensity tures on the L2 on-chip memory. Common includes a set of the kernel in terms of computation/communication ra- of low-level primitives for parallelism management that are tio. Second, it is proportional to the number of tiles, as the used by both runtime environments. The variations on ker- overhead introduced by graph interpretation in GB-RTE is nel footprint are related to the degree of code optimization higher. Third, it is inversely proportional to the number of enabled at compile and link time by the two approaches. In nodes, as the overhead of generated control loop in CG-RTE GB-RTE the kernels are standalone shared objects, and they increases with this metric. The data table in Figure 8 reports are dynamically loaded at execution time. Consequently, these factors, and the resulting speed-up can be computed their total memory footprint is the sum of all kernel bina- by the formula C/C ∗ T iles/Nodes. The resulting speed-up ries. In CG-RTE the kernels are merged with the generated is an additional benefit of our approach, mainly due to the control code at link time to produce a single binary for the aggressive link-time optimizations that are possible in the accelerator. This allows to enable aggressive LTO passes in CG-RTE runtime (described in Section 3.1). The execution the toolchain, with an impact on the code size. In the bench- time on the MCU is also reported for comparison. marks the code size is increasing in Canny, which contains multiple inlined instances of the same kernels. 4.4 Energy efficiency Figure 7 compares the memory footprint of CG-RTE and The reported MCU setup implies an average power con- GB-RTE. The figure also reports percent L2 memory savings sumption of 8.64 mW, close to the 8.10 mW value computed using CG-RTE instead of GB-RTE (these gaps are high- for the PULP accelerator. Considering this operating point, lighted by vertical dashed lines). GB-RTE includes the Figure 9 shows the energy efficiency of both OpenVX RTEs graph interpreter (Section 2.3), that is independent of the on the mW-scale accelerator with respect to the execution executed benchmark. CG-RTE includes the control code on the STM32-L476 MCU. To make a fair comparison, we generated for the specific benchmark (Section 3.1). The min- compiled the optimized code generated by CG-RTE for the imum difference between the runtime supports is variable, MCU target, using an intermediate C representation which and in general CG-RTE could exceed GB-RTE. This effect is includes advanced inlining. In addition, we used a basic evident for CNN, which has a high number of kernels whose instruction set for both cores to avoid the effects of vector- orchestration requires more lines of generated code w.r.t. ization or special-purpose acceleration. Overall, the energy other benchmarks. A horizontal dotted line highlights this efficiency of the MCU is two order of magnitude less than overhead (about 1.30 KB). In any case its value is always the one measured on the accelerator. Figure 9 also re- lower than the sum of GB-RTE and the corresponding run- ports the ratio between the energy efficiency of CG-RTE time graph. Overall the removal of the RTE runtime graph and that of GB-RTE. On average, CG-RTE improves en- is the major contribution to on-chip memory saving, that is ergy efficiency by 2.14×. This increase in energy efficiency a primary goal of our work. is mainly due to the lower number of executed instructions and to the reduced number of L1/L2 accesses to runtime 4.3 Execution time data which are replaced with cheaper control instructions Figure 8 reports the execution time of the benchmarks for (ALU operations). tively in several contexts. In the context of Domain Specific Embedded Languages, code generation is commonly used to transform high-level patterns and structures into efficient parallel code for different architectures, such as CPUs [39], GPUs [16] or DSPs [30]. Machine learning techniques can be also used to generate efficient code for a specific algorithm [34]. Another technique which is strictly connected to code generation is partial evaluation. A computer program can be modeled as a mapping of input data into output data. A new mapping (i.e., a new program) can be obtained by removing from the input space all the dimensions corresponding to static inputs that are totally known at compile time. This is Figure 10: Frame memory bandwidth the principle that we have considered in this paper, initially introduced by the first Futamura projection [13] in the con- text of code interpreters. In this work we have extended this basic principle in two ways, by generating the code at the 4.5 Bandwidth reduction runtime level of a meta-model (that is the OpenVX program Figure 10 shows the bandwidth required by both RTEs executing on the developer workstation) and by supporting compared to a baseline implementation that access the ex- the dynamic aspects of the original execution model with ternal memory to store all the intermediate results. The specific control code. bandwidth is computed as a ratio between the memory traf- OpenVX support has been provided for different devices. fic required to compute a single image and its corresponding The Khronos website [20] provides a sample implementation execution time on the accelerator. This consistent reduction of the OpenVX specification targeting x86 architectures. enables the use of a low bandwidth SPI serial memory, while VisionWorks [27] is software development toolkit that im- the access latency is hidden by double-buffering. There is plements the OpenVX standard, targeting CUDA-capable no significant difference between GB-RTE and CG-RTE re- GPUs and SOCs. Another OpenVX implementation sup- garding the access patterns on the external memory. The ports the PAAG array processor (Polymorphic Array Archi- bandwidth requirements of CG-RTE are lightly lower since tecture for Graphics and image processing) [15], a polymor- the tiles are typically greater (L1 memory does not contain phous architecture specifically tailored for graphics render- a reserved area for RTE data) and consequently the border ing and image processing. An OpenVX framework tailored effects are less evident. for low-power many-core accelerators has been presented in Conversely, the bandwidth requirements of L2 and L1 [37]. Most of these solutions are characterized by a power memories are affected. In GB-RTE a runtime graph must consumption from 500 mW up to 5 W with no specific mem- be read from L2 and written to L1 to be used by the graph ory limit, while the techniques described in this paper are interpreter. This amount of data is equal to the size of the intended for heavily-constrained mW-scale devices. runtime graphs, as reported by Figure 7. Using code gen- State-of-the-art mW-scale MCUs (e.g., STMicroelectron- eration these accesses to L1 and L2 are totally removed, as ics STM32-L476 [36], SiliconLabs EFM32 [35] and Texas runtime parameters are directly encoded as instruction im- Instruments MSP430 [40]) already target a power budget mediates with no additional redundancy. lower than 50 mW, but they cannot guarantee high comput- ing performance for the embedded vision domain. To bridge this gap some MCUs provide fixed-function hardware blocks 5. RELATED WORK [43][29] or partially programmable accelerators [8], but their The role of OpenVX to achieve performance optimization programmability is very limited and they cannot support a at system level has been initially highlighted in [32]. Its full OpenVX framework. In this paper we have preferred a execution model assumes a graph-based application descrip- more general approach, coupling to the MCU a fully pro- tion. This is a very common approach in literature, and grammable accelerator able to execute diverse and complex it has been extensively used to derive foundational models workloads with a limited budget for silicon area and power such as task graphs [42] and data-flow graphs [24]. The se- consumption. Some multi-core MCUs [28] [12] are available, mantic of OpenVX defines a dependency graph, that is a but they employ multiple cores with the objective to save structure that describes a partial evaluation order among power distributing heterogeneous tasks to specialized units. kernel nodes. This approach is common to other modern Today the market offers several products that can be in- programming models, such as OpenMP 4.0 tasking [41] and cluded in the class of mW-scale parallel accelerator, mostly Intel TBB library [21]. The OpenVX standard has been in the segment of licensable IP cores. DesignWare EV52 supported by several major industries interested into CV ac- and EV54 processors [2] by Synopsys integrate two or four celeration on parallel computation devices, such as NVIDIA, 32-bit ARC HS cores with up to eight programmable ac- AMD and Synopsys [18]. celerators optimized for CV and convolutional neural net- Alternatives to OpenVX are Halide [31] and HIPAcc [26], works. CEVA-XM4 processor [1] is based on a general- which allow programmers to specify a functional description purpose DSPs, with a VLIW support up to eight parallel with a domain-specific language. However these solutions operations on 4,096 bits. Cadence IVP processor [4] is based present major limitations when applied to generic CV al- on the configurable Xtensa CPU/DSP, supporting three par- gorithms, in particular (i) irregular data patterns are not allel operations on 512 bits. Considering platform proposed supported, (ii) composability of software modules is limited by academic institutions, possible candidates to the role of to pipeline patterns, and (iii) schedule management requires multi-core accelerator are Centip3de [10] and PULP [33]. to write platform-specific code. OpenCL [19] allows appli- Centip3de is a clustered-based fabric of Cortex M3 cores, cations to use pre-built binaries, or alternatively to load and while PULP presents a similar design based on OpenRISC compile the program source at runtime. We apply the first cores. These solutions rely on near-threshold and parallel method on the host side to load a pre-built OpenVX appli- computing to increase performance and energy efficiency. cation, but in our optimized approach the source code for In this work we used PULP as a target, for two main rea- the accelerator is automatically generated from an OpenVX sons: (i) its architecture is representative of our device class, program. and (ii) a virtual platform with an OpenVX RTE was al- The principle of code generation has been applied effec- ready available for tests and comparisons. To the best of our knowledge, no optimized OpenVX support is provided International Conference on Communication Technology for any platform including a MCU and a mW-scale parallel (ICCT), pages 598–601, 2015. [16] J. Holewinski, L.-N. Pouchet, and P. Sadayappan. accelerator. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing, pages 311–320. 6. CONCLUSIONS ACM, 2012. In this paper we propose an alternative and novel ap- [17] I. Kadayif and M. Kandemir. Data Space-oriented Tiling for Enhancing Locality. ACM Trans. Embed. Comput. Syst., proach to provide OpenVX support in heterogeneous sys- 4(2):388–414, 2005. tems including a MCU and a parallel accelerator. Our main [18] Khronos Group. OpenVX resources. contributions are an extension to the original OpenVX model http://www.khronos.org/openvx/resources. to support static management of application graphs, and [19] Kronos Group. The OpenCL 1.1 Specifications. the definition of the milliVX specification, which provides a http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. [20] Kronos Group. The OpenVX API for hardware acceleration. lightweight support to execute static graphs in a resource- http://www.khronos.org/openvx. constrained environment, without renouncing the dynamic [21] A. Kukanov and M. J. Voss. The Foundations for Scalable features provided by OpenVX. Experimental results show Multi-core Software in Intel Threading Building Blocks. Intel that our approach drastically reduces the memory footprint Technology Journal, 11(4), 2007. [22] C. Lattner and V. Adve. LLVM: A compilation framework for (-68%) and the required bandwidth (-10%). Moreover, there lifelong program analysis & transformation. In Code is an average 3× execution speed-up and a 2× energy effi- Generation and Optimization, 2004. CGO 2004. ciency compared to a baseline implementation. International Symposium on, pages 75–86. IEEE, 2004. [23] Y. LeCun and Y. Bengio. Convolutional networks for images, From a theoretical point of view our approach is fully scal- speech, and time series. The handbook of brain theory and able w.r.t. the number of nodes and processing elements in neural networks, 3361(10), 1995. the system, with the only limitation given by the system [24] E. A. Lee and D. G. Messerschmitt. Static scheduling of resources. Our future work will be focused on two main as- synchronous data flow programs for digital signal processing. Computers, IEEE Transactions on, 100(1):24–35, 1987. pects. First, specific LTO passes using advanced heuristics [25] M. Magno et al. Multimodal abandoned/removed object could further reduce the execution time with limited effects detection for low power video surveillance systems. In on the code footprint. Second, we will consider more com- Advanced Video and Signal Based Surveillance, Sixth IEEE plex architectures with multiple IoT nodes and/or acceler- International Conference on, pages 188–193. IEEE, 2009. [26] R. Membarth et al. HIPAcc: A domain-specific language and ator clusters, including in the platform model RTE-to-RTE compiler for image processing. IEEE Transactions on Parallel communication channels. and Distributed Systems, (99):1–14. [27] NVIDIA. NVIDIA Jetson TX1 Supercomputer-on-Module Drives Next Wave of Autonomous Machines. 7. ACKNOWLEDGEMENTS http://devblogs.nvidia.com/parallelforall/nvidia-jetson-tx1. This work is supported by the EU FP7 ERC Advanced [28] NXP. LPC5410x Datasheet. Rev. 2.1. [29] J. Oh, S. Lee, and H.-J. Yoo. 1.2-mW Online Learning project MULTITHERMAN (g.a. 291125) and by IcySoC Mixed-Mode Intelligent Inference Engine for Low-Power and YINS RTD projects, evaluated by the Swiss NSF and Real-Time Object Recognition Processor. Very Large Scale funded by Nano-Tera.ch with Swiss Confederation financing. Integration (VLSI) Systems, IEEE Transactions on, pages 921–933, 2013. [30] M. Puschel¨ et al. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005. 8. REFERENCES [31] J. Ragan-Kelley et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image [1] CEVA-XM4 Intelligent Vision Processor. processing pipelines. volume 48, pages 519–530. ACM, 2013. http://www.ceva-dsp.com/CEVA-XM4. [32] E. Rainey et al. Addressing System-Level Optimization with [2] DesignWare EV Family of Vision Processors. OpenVX Graphs. In Computer Vision and Pattern https://www.synopsys.com/dw/ipdir.php?ds=ev52-ev54. Recognition Workshops (CVPRW), 2014 IEEE Conference [3] IoT-From Research and Innovation to Market Deployment. on, pages 658–663. IEEE, 2014. http://www.internet-of-things-research.eu/. [33] D. Rossi et al. PULP: A Parallel Ultra-Low-Power Platform for [4] Tensilica Customizable Processor IP. Next Generation IoT Applications. In HotChips 2015. http://ip.cadence.com/ipportfolio/tensilica-ip. [34] E. Rosten, R. Porter, and T. Drummond. Faster and better: A [5] ABIresearch. Edge Analytics in IoT. https://www.abiresearch. machine learning approach to corner detection. Pattern com/press/more-than-30-billion-devices-will-wirelessly-conne/. Analysis and Machine Intelligence, IEEE Transactions on, [6] ABIresearch. More Than 30 Billion Devices Will Wirelessly 32(1):105–119, 2010. Connect to the Internet of Everything in 2020. [35] SiliconLabs. EFM32G210 Datasheet. Rev. 1.90. https://www.abiresearch.com/market-research/product/ [36] STMicroelectronics. STM32L476xx Datasheet. Rev. 2. 1021642-edge-analytics-in-iot/. [37] G. Tagliavini, G. Haugou, and L. Benini. Optimizing memory [7] S. Banerjee and D. O. Wu. Final report from the nsf workshop bandwidth in OpenVX graph execution on embedded on future directions in wireless networking. 2013. many-core accelerators. In Design and Architectures for Signal [8] Carnegie Mellon University. CMUcam. and Image Processing (DASIP), 2014 Conference on, pages http://www.cmucam.org/. 1–8. IEEE, 2014. [9] A. Y. Dogan et al. Power/performance exploration of [38] G. Tagliavini, G. Haugou, A. Marongiu, and L. Benini. single-core and multi-core processor approaches for biomedical ADRENALINE: An OpenVX Environment to Optimize signal processing. In Integrated Circuit and System Design. Embedded Vision Applications on Many-core Accelerators. In Power and Timing Modeling, Optimization, and Simulation, IEEE 9th International Symposium on Embedded pages 102–111. Springer, 2011. Multicore/Many-core Systems-on-Chip (MCSoC), pages [10] D. Fick et al. Centip3De: A 3930DMIPS/W configurable 289–296, 2015. near-threshold 3D stacked system with 64 ARM Cortex-M3 [39] A. T. Tan, J. Falcou, D. Etiemble, and H. Kaiser. Automatic cores. In IEEE International Solid-State Circuits Conference task-based code generation for high performance domain Digest of Technical Papers, pages 190–192. IEEE, 2012. specific embedded language. International Journal of Parallel [11] P. Flatresse et al. Ultra-wide body-bias range LDPC decoder in Programming, pages 1–17, 2014. 28nm UTBB FDSOI technology. In Solid-State Circuits [40] Texas Instruments. MSP430F161 Datasheet. Rev. G. Conference Digest of Technical Papers (ISSCC), 2013 IEEE [41] P. Virouleau et al. Evaluation of OpenMP dependent tasks International, pages 424–425. IEEE, 2013. with the KASTORS benchmark suite. In Using and Improving [12] Freescale. MC9S12XDP512 Datasheet. Rev. 2.21. OpenMP for Devices, Tasks, and More, pages 16–29. Springer, [13] Y. Futamura. Partial evaluation of computation process–an 2014. approach to a compiler-compiler. Higher-Order and Symbolic [42] T. Yang and A. Gerasoulis. DSC: scheduling parallel tasks on Computation, 12(4):381–391, 1999. an unbounded number of processors. IEEE Transactions on [14] Gartner. Gartner Says the Internet of Things Installed Base Parallel and Distributed Systems, 5(9):951–967, 1994. Will Grow to 26 Billion Units By 2020. [43] J.-S. Yoon et al. A Unified Graphics and Vision Processor With http://www.gartner.com/newsroom/id/2636073. a 0.89/spl mu/W/fps Pose Estimation Engine for Augmented [15] Z. Guo, J. Han, and T. Li. Implementing OpenVX on a Reality. Very Large Scale Integration (VLSI) Systems, IEEE polymorphous array processor. In 2015 IEEE 16th Transactions on, 21(2):206–216, 2013.