Enabling OpenVX support in mW-scale parallel accelerators

Giuseppe Tagliavini Germain Haugou Andrea Marongiu, DEI, University of Bologna IIS, ETH Zurich Luca Benini [email protected] [email protected] DEI, University of Bologna IIS, ETH Zurich {a.marongiu,l.benini}@iis.ee.ethz.ch

ABSTRACT low-power operation. Sophisticated data manipulation is mW-scale parallel accelerators are a promising target for thus enabled at the edges of the cloud, significantly reduc- application domains such as the Internet of Thing (IoT), ing the amount of data to be sent through bandwidth-limited which require a strong compliance with a limited power communication interfaces (e.g., serial peripheral interface). budget combined with high performance capabilities. An To deliver the required performance/watt targets, the important use case is given by smart sensing devices featur- most promising platforms available on the market are focus- ing increasingly sophisticated vision capabilities, at the cost ing on a class of heterogeneous systems including a micro- of an increasing amount of near-sensor computation power. controller unit (MCU) coupled to a programmable, mW-scale OpenVX is an emerging standard for the embedded vision, parallel accelerator [4][1][2][10][33]. The adoption of ultra- and provides a -based application programming interface low-power parallel accelerators as a co-processor provides and a runtime environment. OpenVX is designed to maxi- hundred-fold increase in OPS/W [9] compared to state-of- mize functional and performance portability across diverse the-art microcontrollers. The advent of such devices, com- hardware platforms. However, state-of-the-art implementa- bined with the widespread diffusion of miniaturized cameras tions rely on memory-hungry data structures, which cannot [7], is becoming the key enabler for building sensor nodes be supported in constrained devices. In this paper we pro- capable of running computer vision (CV) workloads, which pose an alternative and novel approach to provide OpenVX will be at the heart of tomorrow’s most ambitious frontier support in mW-scale parallel accelerators. Our main contri- of the IoT (smart cities, the internet of vehicles [3]). butions are: (i) an extension to the original OpenVX model Similar to what has happened to high-end heterogeneous to support static management of application graphs in the systems, programmability will become a key issue in this form of binary files; (ii) the definition of a companion run- domain as well. The adoption of mainstream programming time environment providing a lightweight support to execute paradigms is an appealing solution, but it is complicated by binary graphs in a resource-constrained environment. Our the constrained nature of IoT designs (power, memory, etc.). approach achieves 68% memory footprint reduction and 3× OpenVX [20] represents the state-of-the-art for embedded execution speed-up compared to a baseline implementation. vision programming, as witnessed by its widespread adop- At the same time, data memory bandwidth is reduced by tion in commercial products. It provides a C-based appli- 10% and energy efficiency is improved by 2×. cation programming interface (API) and a runtime environ- ment (RTE) aimed at enabling optimized implementations of numerous CV algorithms on embedded systems. OpenVX relies on a graph-based execution model, which simplifies 1. INTRODUCTION programmability by exposing basic components and their Market forecasts [14][6] report that there will be more relations to application developers. than 25 billion Internet-of-Things (IoT) devices by 2020. A In the most common case, an OpenVX framework relies key enabler for the IoT is the development of next-generation on a graph-based RTE, assuming that a data structure de- smart sensors, that combine standard analog/digital trans- scribing the graph is allocated in the accelerator memory ducers and communication interfaces with powerful data- [38]. A key trait of mW-scale accelerators is the strongly processing capabilities. At the beginning of its story, the limited amount of available on-chip memory, which requires IoT paradigm was characterized by the critical role of cloud dedicated memory management techniques to enable the ef- infrastructures as providers of computational power, and ficient execution of graphs of arbitrary large size. To this its terminal constituent devices were relatively unsophisti- aim OpenVX supports data tiling [17], a well-known tech- cated. With the advent of smart sensors this trend is evolv- nique that exploits spatial locality in a program by parti- ing rapidly [5] towards the edge computing paradigm, that tioning large data structures into smaller chunks that are moves the execution environment of applications away from brought in and out of the target memory via DMA trans- centralized nodes and to the logical extremes of a network. fers. When tiling is applied, the number of nodes in the This is made possible by recent designs for IoT devices, graph data structure becomes much larger than the number which combine complex processing capability with ultra- of application kernels, due to the tiles and to computation artifacts generated by their presence (e.g., image borders, Permission to make digital or hard copies of all or part of this work for personal or corners or inner tiles, double-buffering). Consequently, the classroom use is granted without fee provided that copies are not made or distributed scarce amount of on-chip memory can be insufficient to con- for profit or commercial advantage and that copies bear this notice and the full cita- tain both application and RTE data and code. tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- In this paper we propose a novel approach to pro- publish, to post on servers or to redistribute to lists, requires prior specific permission vide OpenVX support in mW-scale parallel acceler- and/or a fee. Request permissions from [email protected]. ators. Experiencing with real-life applications, we realized CASES ’16, October 01-07, 2016, Pittsburgh, PA, USA that standard data management techniques (e.g., tiling and c 2016 ACM. ISBN 978-1-4503-4482-1/16/10. . . $15.00 double buffering) are not enough to guarantee the stringent DOI: http://dx.doi.org/10.1145/2968455.2968518 memory constraints that have been outlined for this cate- gory of devices, and we had to re-think the way such ex- ecution model could be supported. Our proposal lever- ages a new API to support static management of applica- tion graphs. An OpenVX graph can be created, verified and executed on the developer workstation using the standard OpenVX execution flow, which guarantees full portability on any platform an OpenVX framework is available for. Once a program has been developed and tested following this stan- dard flow, the proposed API extension allows to save the resulting graph into a static representation as a platform- dependent binary file. The OpenVX runtime data struc- ture management is replaced by a control-code generation approach, which reduces the total RTE footprint (code + data) to a few Kilobytes. This approach drastically reduces also the platform energy consumption, by replacing costly and frequent accesses to the runtime data with cheaper con- trol instructions (ALU operations). It also makes effective usage of low-bandwidth data channels by means of software techniques aimed at maximizing the computation locality Figure 1: Heterogeneous architecture model (thus minimizing the request for external bandwidth). The control-code generation is complemented by a minimal RTE ory, an L1 data memory and a DMA engine to enable data design for the target platform, called milliVX, which in- transfers with greater memory levels. Peripherals (e.g., SPI) cludes a small subset of the OpenVX specification to read a and greater memory levels (e.g., L2 or DDR) are accessible generated binary and offload its execution to the accelerator. via off-chip communication channels. It is important to underline that our approach takes Figure 1 shows the generalized architecture model consid- advantage of the static structure extracted by an ered in the rest of this work. It consists of a MCU host OpenVX program to optimize the execution stage in coupled with a multi-core mW-scale accelerator. The host terms of memory footprint and execution time, but is the main control unit of the full system, and it has the at the same time it fully preserves the dynamic fea- option to offload computation-intensive workloads to the ac- tures of the original OpenVX standard, namely graph celerator. The link between MCU and accelerator uses the updates and node callbacks. Graph updates are dynamic SPI protocol, a common interface for off-the-shelf MCUs modifications to the graph data structure, generating a dif- which fully satisfies mW-scale power constraints. External ferent graph that requires a further verification stage before sensors and communication channels are managed by the its execution. Node callbacks are used to control the graph MCU, while data to/from the accelerator are stored in the execution flow by calling functions upon termination of a external memory. particular node. milliVX provides low-cost yet full-fledged The accelerator is a parallel platform featuring a number support to those features. n of PEs, that are fully independent cores supporting mul- To assess our approach, we provide a reference implemen- tiple instruction multiple data (MIMD) parallelism. Each tation for the OpenVX extension and the milliVX specifica- core is equipped with an instruction cache (I$), which can tion using a publicly available research tool [38] for OpenVX be private or shared. To avoid memory coherency overhead development on ultra-low-power parallel accelerators [33]. and increase energy efficiency the PEs do not have private Experimental results show that our approach achieves 68% data caches, but they share a L1 scratchpad memory (SCM). memory footprint reduction and about 3× execution speed- Communication between the cores and the SCM is based on up compared to a baseline. Moreover, the data memory an interconnect (Local interconnect) that guarantees mini- bandwidth towards off-chip memory is further reduced by mum access latency (1 or 2 cycles), implementing a word- 10% and energy efficiency is improved by 2×. level interleaving scheme to reduce the access contention to memory banks. An L2 scratchpad memory is shared among The rest of the paper is organized as follows. In Section all cores, to host code and data, with an access latency in the 2 we introduce the background and analyze the limitations order of 10 cycles. A DMA engine enables fast and flexible of standard OpenVX frameworks In Section 3 we describe communication with L2 memory and external peripherals. our approach. Section 4 presents the experimental results. In this template, the external memory is intended to store Section 5 discusses the related work. Finally, we conclude input/output data. A typical example of such IP could be a and introduce our future work in Section 6. generic flash memory or a more specific frame buffer where various sensors place sampled data. It is accessible through the SPI, which provides a low-bandwidth, long-latency serial 2. BACKGROUND IO channel. 2.1 Platform template Today the market offers several products that can be in- 2.2 The OpenVX programming paradigm cluded in the class of mW-scale parallel accelerator, mostly The OpenVX standard defines a set of API functions and in the segment of licensable IP cores [4][1][2], but also a library of kernels implementing vision primitives. The ex- hardware platforms from research institutions are available ecution model is based on a directed acyclic graph, where [10][33]. The size, performance, and power consumption of nodes are instances of vision kernels and edges are images these solutions greatly depend on the core configuration, that represent input/output dependencies on data. An im- synthesis flags, physical-IP libraries, technology node, and portant class of images includes virtual images, which are other variables. Overall, these accelerators are character- introduced to force a dependency between kernels but are ized by a common design. An architecture provides a set of not backed-up by a physical storage. Following the stan- homogeneous PEs, that are CPUs or general-purpose DSPs, dard flow, a graph is (i) created (as a data structure), (ii) commonly supporting a VLIW instruction set or a vector populated by interconnecting kernel nodes, (iii) verified to extension. Each architecture also includes an L1 code mem- guarantee the consistency of its topology and data flow, (iv) Figure 2: Example of an OpenVX graph and its expansion after tiling

executed to process the input data and produce the desired 13 status = vxVerifyGraph(graph); result. An OpenVX implementation consists of a program- 14 while() 15 { ming API and a runtime execution environment (RTE) that 16 status = vxProcessGraph(graph); supports dynamic creation and verification of a graph, plus 17 ... the capability of offloading the execution of the graph itself 18 if() 19 { to the parallel accelerator (where available) [20]. Listing 1 20 vxNode(graph, ...); shows the typical structure of an OpenVX application: 21 ... • an execution context is created (line 1); 22 vxRemoveNode(); 23 ... • a graph instance is created (line 2); 24 status = vxVerifyGraph(graph); • each image is defined with a call to vxCreateImage 25 } (lines 3-4) or vxCreateVirtualImage (5-6), specifying 26 else size and type (e.g., RGB or grayscale); 27 ... 28 } • each kernel is added to the graph as a node with a call to vxNode (lines 8-9), specifying a list of Listing 2: Graph modifications one or more input images, a list of scalar parameters (e.g., a threshold) and a list of output images; Node callbacks represent a mechanism to control the graph • the vxVerifyGraph function (line 13) checks the graph execution flow on the host side by specifying a function to consistency; be called after the execution of a particular node. Node • the vxProcessGraph function (line 15) executes the callbacks are set through the vxAssignNodeCallback func- graph on the target device, using a specific framework. tion with two parameters, the graph node and the callback function. Callbacks are intended to provide simple early exit 1 vx_context ctx = vxCreateContext(); conditions, based on the return value of the passed function. 2 vx_graph graph = vxCreateGraph(); 3 vx_image img0 = vxCreateImage(ctx, , If VX_ACTION_CONTINUE is returned, the graph execution on 4 , ), the target device will continue. If VX_ACTION_ABANDON is 5 vx_image vimg0 = vxCreateVirtualImage(ctx, , returned, execution is unspecified for all nodes whose ex- 6 , ), 7 ... ecution must follow the callback owner. In practice, the 8 vx_node node0 = vxNode(graph, execution of the graph on the device may be aborted due 9 ,..., to application constraints (e.g., a deadline was missed or an 10 , ..., 11 , ...); intermediate result was sufficient to take some decisions). 12 ... Listing 3 shows the typical usage of a node callback. 13 status = vxVerifyGraph(graph); 14 ... 12 ... 15 status = vxProcessGraph(graph); 13 vxAssignNodeCallback(, func); 14 status = vxVerifyGraph(graph); Listing 1: OpenVX program template 15 status = vxProcessGraph(graph); 16 ... The full specification contains many other functions to cre- 17 } 18 ate and manipulate framework objects, which are not cen- 19 vx_nodecomplete_f func(vx_node n) tral to the presented approach. The interested readers is 20 { referred to the official specification document [20]. OpenVX 21 ... 22 if() also includes advanced features that allow for more dynamic 23 return VX_ACTION_ABANDON; behavior, such as graph updates and node callbacks. 24 return VX_ACTION_CONTINUE; Graph updates are dynamic modifications to a graph 25 } data structure following its first execution. This can Listing 3: Node callback be accomplished by nesting graph creation constructs (vxNode and vxRemoveNode, which removes a node from its parent graph) within conditional control flow 2.3 OpenVX execution model and RTE in the program. When the original execution path is altered OpenVX was conceived for heterogeneous systems includ- due to control flow, then the OpenVX standard requires a ing a host processor and a parallel accelerator, however, it further verification stage before executing a modified graph. can be compiled and executed on any machine. Typically, Listing 2 shows the typical structure of an OpenVX appli- programmers develop and test their vision applications on cation using graph updates, extending the code of Listing 1 a development platform (e.g., an x86 workstation). Once with additional lines. The graph is executed multiple times the application code is debugged and tuned, it can be exe- in a loop structure, and when an application specific con- cuted as-is on the target OpenVX installation for heteroge- dition is met (e.g., environment conditions are changing) neous systems. When deployed to a heterogeneous system, the graph is modified accordingly (e.g., a set of nodes im- an OpenVX graph is created and verified on the host, while plementing a specific algorithm is replaced with a different graph execution is offloaded to the accelerator. one). An else clause or additional if blocks can apply al- Generally, OpenVX relies on a graph-based runtime en- ternative or additional modifications. vironment (GB-RTE), which leverages a graph interpreter 12 ... running on the accelerator side, that is a small software layer capable of reading an OpenVX graph description and orches- size required to represent a kernel node can be computed as trating the execution of the corresponding kernels. Since the total memory requirement of actual kernel parameters local memory in mW-scale parallel accelerator is a scarce plus a number of words (4 bytes) equal to the output de- resource, it is important to design the OpenVX RTE with pendencies. The size required to represent a DMA transfer minimal memory footprint (RTE code and metadata). Table node is a fixed amount (40 bytes) plus a number of words (4 1 reports the contributions to the memory footprint associ- bytes) equal to the output dependencies. Additional helper ated to the three main operations in an OpenVX program. nodes (start, end) require from 96 to 128 bytes. As a quan- The main contribution to RTE metadata footprint is the tification, applying this formula to the Sobel filter used in graph data structure, whose size depends on the application. Section 4, which has 3 kernels and 64 input tiles, the final Data tiling is a common optimization for scratchpad- graph size is 29.68 KB. based architectures, used to enable high-locality computa- Our alternative to GB-RTE is to replace the graph data tion on the fast L1 memory via explicit DMA transfers. The structure and its management (i.e., the graph interpreter) OpenVX standard incorporates tiling as the main technique with control-code generation (CG-RTE). This has the po- to manage diverse memory hierarchies with a single program tential to reduce the total RTE footprint (metadata) and to source. This technique is essential to execute a full graph reduce the management overheads. The binary footprint of with any image size on the target accelerators, but at the the generated code can be computed using this formula: same time tiling policies affect the size of an OpenVX graph. Figure 2 shows an example of how data tiling increases the (a ∗ Nimages + b ∗ Nnodes + c) ∗ d (2) size of a graph. Two node types are shown, DMA transfers The terms are: and executable kernels. Points where the graph interpreter • a is the number of C lines required per each input or is invoked are also highlighted. output tile, and its average value is 8. After tiling, the graph includes a number of nodes much • Nimages is the number of defined images. larger than the number of application kernels, because the • b is the number of C lines required for each node. It graph structure is replicated to consider different configura- can be computed as the average of pi + 2, where pi is tions (e.g., image borders, corners or inner tiles) and differ- the number of parameters required by node i. ent target buffers (e.g., double-buffering) for each sub-graph • Nnodes is the number of nodes in the application graph. generated by applying the tiling policies. In addition, nodes • c is a constant number of C lines, and its value is 35. to program DMA transfers are added to the graph. • d characterizes the average density of assembly instruc- To further complicate the scenario, there are applications tions per C line on the target platform. An average for which no tiling scheme is feasible, as they contain kernels value valid for the experiments of Section 4 is 11.9. that cannot be processed in parts (e.g., histograms) or be- Applying this formula to the same Sobel filter considered cause the tiling algorithm can’t fit all the buffers in the SCM. for GB-RTE, the final graph size is 25.88 KB, that is a 15% In these cases the graph is automatically partitioned by the reduction in code size also for a very small graph. In practi- RTE into multiple sub-graphs during the verification stage. cal case, Nintiles is much greater than (Nimages, and so the Inside each sub-graph tiling is applied and intermediate re- result of Formula 1 is greater than the result of Formula 2 sults at the sub-graph frontier are saved on a temporary Figure 3 describes the main steps of our approach. The buffer out of the small L1 SCM (e.g., the L2 memory). program source is compiled on the developer workstation, An estimation of the graph footprint inflation due to data and linked with a standard OpenVX RTE (libopenvx.so) tiling can be achieved with this formula: and a node implementation (libXYZ.so) targeting the work- a ∗ N ∗ N + b ∗ N + c (1) station environment. Testing and debug are performed on intiles nodes tiles the developer workstation. These initial steps are totally The terms are: equivalent to the standard OpenVX workflow. • a is the average size of a kernel node. In addition to that, to deploy the program on the target • Nintiles is the number of input tiles. architecture we require the call to vxProcessGraph to be • Nnodes is the number of nodes in the application graph. replaced with a call to vxSaveGraph. With this new func- • b is the average size of a transfer node. tion, the source is processed by a cross-compiling toolchain • Ntiles is the total number of tiles (input + output). to generate two binaries: (i) a program for the target host, • c is the total size of additional helper nodes. obtained by the original program linking a lightweight run- Concerning the code footprint, the graph interpreter rep- time called milliVX (libmillivx.a); (ii) a program for the tar- resents the main contribution. The graph interpreter is in- get accelerator, obtained by linking a static version of the voked each time a node is completed, marking the output program graph generated via a code-generation approach to dependencies as satisfied and looking for the next node to the kernel library implementation for the target accelerator execute. A node is ready for execution when all its input (libXYZ.a). Dynamic features of OpenVX are preserved dependencies are satisfied; to support a generic topology, a by this approach, since a proper support for graph updates full graph visit is performed at each interpreter call, and this and callbacks is provided by the milliVX RTE. implies a full iteration on the node set. As a consequence, the graph expansion due to tiling does not have an impact on 3.1 Extension for static graph support the code footprint for the interpreter. However, it does have Using the API extension, a call to vxSaveGraph produces an impact on the time overhead (and associated energy) for a binary file, using a compilation toolchain to generate a executing the interpreter more often. sequence of intermediate artifacts: 1. graph control function – A C function is created for 3. CODE GENERATION-BASED RTE each OpenVX graph, including the code required to Applying Formula 1 to existing GB-RTE OpenVX imple- orchestrate the kernel executions and the DMA trans- mentations for multi- and many-core accelerators [38], the fers specific of the graph instance. 2. entry point function – A C function is created as an en- Stage Code footprint RTE metadata try point for the execution of the OpenVX application. Creation API support Graph It orchestrates the execution of multiple sub-graphs de- Verification API support Graph riving from a single application. Execution Interpreter Graph 3. intermediate linked object – An object file is generated by linking the single artifacts (i.e., control functions, Table 1: OpenVX program phases entry point function, and kernel functions). quire complex data structures since all the instance-specific constants are encapsulated. Moreover, for each node there is a single copy of the data structures containing the kernel actual parameters, and the required fields are updated when executing different tiles. This requires a limited amount of stack memory area, just proportional to the number of nodes in the longest graph schedule. In practice, the average stack requirement for the standard OpenVX kernels is limited to 64 bytes per graph node. Figure 4a shows an example of generated code, corre- sponding to the graph depicted in Figure 2. The regions are colored to highlight the same steps described in the code generation template. All the uppercase identifiers are con- stant values, computed at code generation time on the ba- sis of graph analysis results (e.g., considering a 80×80 tiling Figure 3: Developer workflow using our approach schema with overlapping borders IMG0_TILE_WIDTH could be 82). Consequently the resulting code is highly optimized Code generation, compiling and linking steps are executed for a specific instance, as the compiler can apply constant transparently by vxSaveGraph. This solution is based on value optimization passes. Region 1 includes the code to a set of standard features common to modern compilation allocate the required space in the SCM, and each single toolchains. In our implementation, we use the LLVM [22] buffer is computed in terms of offset. Region 2 includes the toolchain libraries (libClang and libLLVM) to generate C first set of DMA transfers, one for each input image. The code and translate it into LLVM intermediate representa- dma_memcpy_2d function program the DMA to perform a 2D tion (IR) artifacts. transfer from the external memory to the SCM (EXT2LOC) specifying the full size of data (IMG0_TILE0_SIZE), the stride between adjacent lines (IMG0_STRIDE) and the line width (IMG0_TILE0_WIDTH). Region 3 contains the instructions to 3.1.1 Graph control function initialize the kernel-specific parameters, including width and A graph control function includes the capabilities provided height of the tile to compute. Region 4 includes a cumulative by the union of a graph data structure and a graph inter- wait instruction for the DMA transfers of region 2. Region preter from the GB-RTE. This approach enables the use of 5 includes the second set of DMA transfers, one for each in- different algorithms for data tiling and kernel scheduling. In put image. Region 6 initializes the variable buffer_index, this work we reuse the algorithms provided by the verifica- used to maintain the double-buffering state. The subsequent tion stage of the baseline GB-RTE with minimum modifica- regions (7-13) contain the code of the tiling loop, whose iter- tions, with the aim to perform a fair comparison limited to ations correspond to distinct set of tiles as provided by the the execution phase. The code of a graph control function tiling algorithm. Regions 7 includes the code to initialize the is generated on the basis of a common template, which is parameters that change at every iteration (e.g., buffer loca- described by Algorithm 1. tion), so that in Region 8 all the kernels are invoked in the exact order provided by the scheduler algorithm. Regions 1 – Allocate memory for SCM buffers; 9-12 include the management code for the DMA transfers: 2 – Program the first set of input DMA transfers; (9) await the previous set of output transfers to guaran- 3 – Initialize the data structures containing the kernel function parameters; tee the availability of the corresponding output buffers, (10) 4 – Wait the first set of input DMA transfers; program the output DMA transfers for the last computed 5 – Program the second set of input DMA transfers; result, (11) await the input transfers of data required by the 6 – Initialize to 0 the double-buffering state; next cycle and (12) schedule the next set of input trans- 7 forall the input tile set i do fers. Region 13 updates the double-buffering state. Finally, 8 – Update kernel parameters for the current set i; region 14 waits for the remaining DMA output transfers pro- 9 – Execute kernels (respecting the scheduling order); grammed in the last iteration of the tiling loop. 10 – Wait the previous set i − 1 of output DMA transfers; 11 – Program the current set i of output DMA transfers; Figure 4b shows a graph representation of the control 12 – Wait the next set i + 1 of input DMA transfers; function code, which can be compared to the graph-based 13 – Program the future set i + 2 of input DMA transfers; approach depicted in Figure 2. Multiple calls to the graph 14 – Update the double-buffering state for set i + i; interpreter are replaced with two application-specific control 15 end code blocks, and the tiling policy is enforced using a loop. 16 – Wait the last set of output DMA transfers; Algorithm 1: Control function generation (pseudo-code) 3.1.2 Entry point function Lines 1-6 contain the initialization phase of the algorithm. An entry point function orchestrates the execution of the The local buffers are allocated in SCM, the first set of in- multiple sub-graphs derived by the tiling policies. In addi- put DMA transfers is performed and the second one is pro- tion, the entry point function manages graph updates and grammed targeting a set of shadow buffers. The double- node callbacks. buffering technique enables the overlap between data trans- In practical cases, graph updates affect a limited portion fer and computation. The loop in lines 7-15 drives the com- of an OpenVX graph, otherwise it would be more convenient putation on all input sets, using data tiling. Line 8 up- to create a totally new graph instance. Starting from this dates the data structures containing the actual parameters assumption, we extended the OpenVX context to be aware for the kernel executed at iteration i, and it involves a lim- of modifications to the graph structure applied after one or ited number of fields changing between adjacent iterations. more executions. When a single node or a connected subset In lines 10-13, the algorithm flow awaits the completion of is removed and replaced with another node or connected the required input transfers (input set i + 1) and the out- subset, our algorithm generates additional sub-graphs. All put transfers related to the buffers to be reused (output set alternative paths in the original OpenVX graph must be i − 1). The resulting code implements the control logic of explored and verified to generate all the code variants. For the graph for which it has been generated, and does not re- instance, Listing 2 is modified as depicted in Listing 4. vxRemoveNode(n1); n2 = vxT2Node(…); n3 = vxT3Node(…);

void entry_point_function(int cfv1) { graph_control_function1(…); if(cfv1) graph_control_function2(…); else graph_control_function3(…); }

Figure 5: Example of graph modifications

added. Consequently, two alternative sub-graphs are gen- erated, the common sub-graph 1 and the alternative ones 2 and 3. The generated code for the entry point functions enables to switch between 2 and 3 on the basis of the control flow variable cfv1 (the next section describes how to practi- (a) Control function code cally use this mechanism in the milliVX framework). When the algorithm fails in generating alternative sub-graphs (i.e., sub-graph deriving from distinct graph updates intersect), the framework generates totally distinct graphs. In this case a warning is generated to inform the programmer that graph modifications were too pervasive and subgraph generation was not possible, since this condition could increase the final (b) Graph representation binary size. This particular condition is related to complex polymorphic behaviors that are not common of OpenVX Figure 4: Example of a generated control function applications, but nevertheless this corner case is correctly supported by our framework. Overall, this methodology to- 12 ... tally preserves the semantics of the original program and 13 status = vxVerifyGraph(graph); ensures full code portability. 14 #ifdef CODE_GENERATION 15 while() In our execution model, the support to callbacks is pro- 16 # endif vided by means of a communication protocol between the 17 { host and the accelerator. A notification is fired to the host 18 #ifndef CODE_GENERATION 19 status = vxProcessGraph(graph); in the entry point function exactly after the execution of the 20 # endif sub-graph that includes the involved node, and the execu- 21 ... tion is suspended waiting for a message back from the host. 22 #ifndef CODE_GENERATION 23 if() The response message contains a Boolean value representing 24 # end the continuation status; if false, the execution of the current 25 { binary on the accelerator is aborted. Since the node ab- 26 vxNode(graph, ...); 27 ... straction is not available in the milliVX RTE, a numerical 28 vxRemoveNode(); identifier is returned for each callback. The described be- 29 ... havior is fully compliant with the OpenVX standard, which 30 status = vxVerifyGraph(graph); specifies that callbacks are not guaranteed to be called im- 31 } 32 #ifndef CODE_GENERATION mediately after the node completes. 33 else 34 # end 3.1.3 Intermediate linked object 35 ... 36 #ifdef CODE_GENERATION An intermediate linked object is the final step in the bi- 37 vxSaveGraph(graph, , ); nary generation flow. It is compiled by the toolchain back- 38 # endif end for the target architecture into a single binary file, ap- 39 } plying link-time optimization (LTO) passes, such as basic Listing 4: Graph modifications inlining and dead-code elimination. In addition, we designed a new LTO pass to maximize the execution performance yet A call to vxSaveGraph following all the modifications creates limiting the binary size. This pass forces the inlining of a the additional graph control functions corresponding to the kernel function keri when (i) it is invoked once (whatever new sub-graphs and generates a new entry point function its size) or (ii) it is invoked n times and this property holds:

introducing control flow variables. These are integer vari- n ables that are passed as an input parameter to the entry X point function. The actual value of a control flow variable (n − 1) × memsize(keri) < α memsize(kerj ) (3) discriminates what version of the related sub-graph must j=1 be executed. This behaviour is equivalent to having more The parameter α is the percentage of the total kernel foot- variants of the same graph, and each control flow variable print (supposing no other inlining) that we could not exceed select a specific variant. Figure 5 shows an example of graph to inline the current kernel. modification. Node n1 is removed, and nodes n2 and n3 are 3.2 milliVX framework 4. EXPERIMENTAL RESULTS In the context of our target mW-scale architecture, mil- This section describes a set of experiments performed to liVX is a lightweight framework available to the MCU host compare our approach to a standard dynamic graph-based to load a program binary corresponding to an OpenVX framework. First, we measure the impact of code and run- static graph and then offload its computation to the par- time data footprint when comparing CG-RTE and GB-RTE. allel accelerator. The milliVX API specification includes Second, we show how CG-RTE enables performance speedups the following functions: due to link-time optimizations enabled by the approach. • mvxCreateContext – Create a lightweight context for Third, we discuss how the reduced memory accesses due the RTE. The implementation details are strictly de- to graph removal enable important energy savings for CG- pendent on the target platform. RTE. Finally, we discuss the impact of tiling on reducing • mvxCreateImageReference – Create a reference to an the bandwidth pressure on the slow SPI channel. image location, providing a pointer. The address value is required to enable the graph binary to ac- 4.1 Setup cess input/output images. An image reference must The benchmarks used for experiments include a set of rep- be instantiated for each concrete image in the origi- resentative CV kernels for constrained embedded systems: nal OpenVX program. Virtual images just represent • Sobel is a gradient-based edge detector (nodes: Sobel dependencies, and the generated code already handles 3×3, gradient magnitude, thresholding); these dependencies internally. • FAST9 implements FAST9 algorithm [34] (nodes: • mvxLoadGraph – Load a graph into the accelerator L2 FAST9, find maxima, non-maxima suppression); memory, in the format provided by vxSaveGraph. This • Pyramid creates a set of scaled and blurred images function returns a handle to the loaded graph. (nodes: Gaussian blur, half scale, ... repeated 3 times); • mvxProcessGraph – Start the execution of the graph • Canny implements an edge detector algorithm (nodes: on the accelerator. The required parameters are the Sobel M×N, element-wise norm, phase, non-maxima graph handle returned by mvxLoadGraph, the graph in- suppression, edge tracking by hysteresis); put/output image references and the control flow vari- • Harris implements a corner detector algorithm (nodes: ables. The control variables are set by the program Sobel M×N, Harris score computation, Euclidean non- logic with the aim to execute a specific graph variant. maxima suppression, corner lister); • mvxAssignNodeCallback – Set a function callback. • NCC is an algorithm to detect abandoned/removed The required parameters are the callback identifier objects in a set of adjacent video frames [25] (nodes: (provided by the extended OpenVX RTE) and the NCC filter, erode); function to execute. The function could be the same • Disparity computes the stereo-matching disparity be- provided in a standard OpenVX RTE, with the only tween two images (nodes: subtraction, multiplication, difference that API function must be replaced with integral image, area sum, disparity computation). their equivalent in milliVX ; in most cases, the access • CNN is a convolutional neural network [23] including to a framework object is replaced with a direct memory four layers made of 48 total nodes (node types: convo- access. The communication protocol between the host lution, add, max-pooling). and the accelerator, used to manage the callback be- The reference image size used for experiments is VGA, haviour, is handled by the milliVX RTE. The commu- typical of ultra-low cost imagers. nication internals are based on platform-specific mech- As a reference platform for our experiments we use accu- anisms (e.g., shared memory or communication chan- rate simulation models for a heterogeneous system coupling nels), and also the synchronization can be achieved us- a MCU host with a PULP multi-core accelerator, based on ing alternative mechanisms (e.g., a software interrupt the open source tool ADRENALINE [38]. In our target plat- or a polling thread). form, we include an external frame memory (FM) intended The structure of a milliVX program is much simpler than to store input/output images, while L2 is dedicated to code an equivalent OpenVX program. Creation and verification and run-time data. FM is an off-chip component, as its size stages are performed by the extended RTE producing a could scale up to several MBs depending on the target image static graph, and milliVX only handles the execution stage. format. The system is configured as follows: A milliVX program can be automatically derived from the • Host: STM32-L476 MCU [36] – number of cores = 1 corresponding OpenVX code. Pointers to input/output im- (ARM Cortex M4), core frequency = 26MHz ages are provided as global external variables keeping the • External memory: memory size = 1 MB, access la- original names, and these symbols must be resolved at link tency = 50 cycles, bandwidth = 0.125 bytes/cycle time. The actual parameters for control flow variables when • PULP v3 cluster: [33] number of cores = 4 (Open- invoking mvxProcessGraph is derived by a static control flow Risc OR10N), core frequency = 200MHz, SCM size = analysis of the source code. Listing 5 shows the structure of 64 KB, L2 size = 128 KB, L2 access latency = 5 cycles, a milliVX application using callbacks. L2 bandwidth = 4 bytes/cycle The power consumption of a PULP cluster in 28nm FD- 14 mvx_context ctx = mvxCreateContext(); 15 mvx_graph graph = mvxLoadGraph(); SOI technology [11] (running at 200MHz, with a voltage 16 mvx_image img0 = mvxCreateImageReference(ctx,); supply of 0.7V) is 9mW. The power consumption of the SoC 17 mvxAssignNodeCallback(, func); periphery, IO pads and support circuits is around 2mW. The 18 ... 19 status = mvxProcessGraph(graph, , bandwidth and latency to the external memory are modeled 20 ); after those of a SPI interface providing 100 Mbit/s trans- 21 } fers @ 100 MHz. For realistic performance and power con- 22 sumption measurement, the simulation platform has been 23 mvx_nodecomplete_f func(int callback_id) 24 { augmented with a performance monitoring unit that is used 25 ... to measure active and idle cycles for cores, DMA, intercon- 26 if() nects and external memory access. Power numbers for the 27 return MVX_ACTION_ABANDON; 28 return MVX_ACTION_CONTINUE; host + SPI are pre-characterized with real measures on a 29 } STM32-L476 MCU. Leakage and dynamic power numbers Listing 5: milliVX callback support for the PULP accelerator are extracted from a post-layout back-annotated timing and power analysis (PULP v3 chip @ 200 MHz). To compile the benchmarks, we used the ARM Figure 6: Code footprint (GB-RTE vs CG-RTE)

Figure 8: Execution time (GB-RTE vs CG-RTE)

Figure 7: Total memory savings (CG-RTE vs GB-RTE)

Sourcery Linux GNU toolchain (version 4.8.2) for the ARM Cortex-M target and the OR10N LLVM/Clang toolchain Figure 9: Energy efficiency of the STM32-L476 host com- (based on LLVM 3.7) for PULP. pared to the PULP accelerator

4.2 Memory footprint both runtime versions. The speed-up of CG-RTE over GB- Figure 6 compares the code footprints on GB-RTE and RTE varies from 1.04 to 7.89, and this behavior depends on CG-RTE, highlighting the percentage savings of the second three main factors. First, it is proportional to a C/C fac- solution. Both RTEs allocate code and runtime data struc- tor, which includes the impact of the computation intensity tures on the L2 on-chip memory. Common includes a set of the kernel in terms of computation/communication ra- of low-level primitives for parallelism management that are tio. Second, it is proportional to the number of tiles, as the used by both runtime environments. The variations on ker- overhead introduced by graph interpretation in GB-RTE is nel footprint are related to the degree of code optimization higher. Third, it is inversely proportional to the number of enabled at compile and link time by the two approaches. In nodes, as the overhead of generated control loop in CG-RTE GB-RTE the kernels are standalone shared objects, and they increases with this metric. The data table in Figure 8 reports are dynamically loaded at execution time. Consequently, these factors, and the resulting speed-up can be computed their total memory footprint is the sum of all kernel bina- by the formula C/C ∗ T iles/Nodes. The resulting speed-up ries. In CG-RTE the kernels are merged with the generated is an additional benefit of our approach, mainly due to the control code at link time to produce a single binary for the aggressive link-time optimizations that are possible in the accelerator. This allows to enable aggressive LTO passes in CG-RTE runtime (described in Section 3.1). The execution the toolchain, with an impact on the code size. In the bench- time on the MCU is also reported for comparison. marks the code size is increasing in Canny, which contains multiple inlined instances of the same kernels. 4.4 Energy efficiency Figure 7 compares the memory footprint of CG-RTE and The reported MCU setup implies an average power con- GB-RTE. The figure also reports percent L2 memory savings sumption of 8.64 mW, close to the 8.10 mW value computed using CG-RTE instead of GB-RTE (these gaps are high- for the PULP accelerator. Considering this operating point, lighted by vertical dashed lines). GB-RTE includes the Figure 9 shows the energy efficiency of both OpenVX RTEs graph interpreter (Section 2.3), that is independent of the on the mW-scale accelerator with respect to the execution executed benchmark. CG-RTE includes the control code on the STM32-L476 MCU. To make a fair comparison, we generated for the specific benchmark (Section 3.1). The min- compiled the optimized code generated by CG-RTE for the imum difference between the runtime supports is variable, MCU target, using an intermediate C representation which and in general CG-RTE could exceed GB-RTE. This effect is includes advanced inlining. In addition, we used a basic evident for CNN, which has a high number of kernels whose instruction set for both cores to avoid the effects of vector- orchestration requires more lines of generated code w.r.t. ization or special-purpose acceleration. Overall, the energy other benchmarks. A horizontal dotted line highlights this efficiency of the MCU is two order of magnitude less than overhead (about 1.30 KB). In any case its value is always the one measured on the accelerator. Figure 9 also re- lower than the sum of GB-RTE and the corresponding run- ports the ratio between the energy efficiency of CG-RTE time graph. Overall the removal of the RTE runtime graph and that of GB-RTE. On average, CG-RTE improves en- is the major contribution to on-chip memory saving, that is ergy efficiency by 2.14×. This increase in energy efficiency a primary goal of our work. is mainly due to the lower number of executed instructions and to the reduced number of L1/L2 accesses to runtime 4.3 Execution time data which are replaced with cheaper control instructions Figure 8 reports the execution time of the benchmarks for (ALU operations). tively in several contexts. In the context of Domain Specific Embedded Languages, code generation is commonly used to transform high-level patterns and structures into efficient parallel code for different architectures, such as CPUs [39], GPUs [16] or DSPs [30]. Machine learning techniques can be also used to generate efficient code for a specific algorithm [34]. Another technique which is strictly connected to code generation is partial evaluation. A computer program can be modeled as a mapping of input data into output data. A new mapping (i.e., a new program) can be obtained by removing from the input space all the dimensions corresponding to static inputs that are totally known at compile time. This is Figure 10: Frame memory bandwidth the principle that we have considered in this paper, initially introduced by the first Futamura projection [13] in the con- text of code interpreters. In this work we have extended this basic principle in two ways, by generating the code at the 4.5 Bandwidth reduction runtime level of a meta-model (that is the OpenVX program Figure 10 shows the bandwidth required by both RTEs executing on the developer workstation) and by supporting compared to a baseline implementation that access the ex- the dynamic aspects of the original execution model with ternal memory to store all the intermediate results. The specific control code. bandwidth is computed as a ratio between the memory traf- OpenVX support has been provided for different devices. fic required to compute a single image and its corresponding The Khronos website [20] provides a sample implementation execution time on the accelerator. This consistent reduction of the OpenVX specification targeting x86 architectures. enables the use of a low bandwidth SPI serial memory, while VisionWorks [27] is software development toolkit that im- the access latency is hidden by double-buffering. There is plements the OpenVX standard, targeting CUDA-capable no significant difference between GB-RTE and CG-RTE re- GPUs and SOCs. Another OpenVX implementation sup- garding the access patterns on the external memory. The ports the PAAG array processor (Polymorphic Array Archi- bandwidth requirements of CG-RTE are lightly lower since tecture for Graphics and image processing) [15], a polymor- the tiles are typically greater (L1 memory does not contain phous architecture specifically tailored for graphics render- a reserved area for RTE data) and consequently the border ing and image processing. An OpenVX framework tailored effects are less evident. for low-power many-core accelerators has been presented in Conversely, the bandwidth requirements of L2 and L1 [37]. Most of these solutions are characterized by a power memories are affected. In GB-RTE a runtime graph must consumption from 500 mW up to 5 W with no specific mem- be read from L2 and written to L1 to be used by the graph ory limit, while the techniques described in this paper are interpreter. This amount of data is equal to the size of the intended for heavily-constrained mW-scale devices. runtime graphs, as reported by Figure 7. Using code gen- State-of-the-art mW-scale MCUs (e.g., STMicroelectron- eration these accesses to L1 and L2 are totally removed, as ics STM32-L476 [36], SiliconLabs EFM32 [35] and Texas runtime parameters are directly encoded as instruction im- Instruments MSP430 [40]) already target a power budget mediates with no additional redundancy. lower than 50 mW, but they cannot guarantee high comput- ing performance for the embedded vision domain. To bridge this gap some MCUs provide fixed-function hardware blocks 5. RELATED WORK [43][29] or partially programmable accelerators [8], but their The role of OpenVX to achieve performance optimization programmability is very limited and they cannot support a at system level has been initially highlighted in [32]. Its full OpenVX framework. In this paper we have preferred a execution model assumes a graph-based application descrip- more general approach, coupling to the MCU a fully pro- tion. This is a very common approach in literature, and grammable accelerator able to execute diverse and complex it has been extensively used to derive foundational models workloads with a limited budget for silicon area and power such as task graphs [42] and data-flow graphs [24]. The se- consumption. Some multi-core MCUs [28] [12] are available, mantic of OpenVX defines a dependency graph, that is a but they employ multiple cores with the objective to save structure that describes a partial evaluation order among power distributing heterogeneous tasks to specialized units. kernel nodes. This approach is common to other modern Today the market offers several products that can be in- programming models, such as OpenMP 4.0 tasking [41] and cluded in the class of mW-scale parallel accelerator, mostly TBB library [21]. The OpenVX standard has been in the segment of licensable IP cores. DesignWare EV52 supported by several major industries interested into CV ac- and EV54 processors [2] by integrate two or four celeration on parallel computation devices, such as , 32-bit ARC HS cores with up to eight programmable ac- AMD and Synopsys [18]. celerators optimized for CV and convolutional neural net- Alternatives to OpenVX are Halide [31] and HIPAcc [26], works. CEVA-XM4 processor [1] is based on a general- which allow programmers to specify a functional description purpose DSPs, with a VLIW support up to eight parallel with a domain-specific language. However these solutions operations on 4,096 bits. Cadence IVP processor [4] is based present major limitations when applied to generic CV al- on the configurable Xtensa CPU/DSP, supporting three par- gorithms, in particular (i) irregular data patterns are not allel operations on 512 bits. Considering platform proposed supported, (ii) composability of software modules is limited by academic institutions, possible candidates to the role of to pipeline patterns, and (iii) schedule management requires multi-core accelerator are Centip3de [10] and PULP [33]. to write platform-specific code. OpenCL [19] allows appli- Centip3de is a clustered-based fabric of Cortex M3 cores, cations to use pre-built binaries, or alternatively to load and while PULP presents a similar design based on OpenRISC compile the program source at runtime. We apply the first cores. These solutions rely on near-threshold and parallel method on the host side to load a pre-built OpenVX appli- computing to increase performance and energy efficiency. cation, but in our optimized approach the source code for In this work we used PULP as a target, for two main rea- the accelerator is automatically generated from an OpenVX sons: (i) its architecture is representative of our device class, program. and (ii) a virtual platform with an OpenVX RTE was al- The principle of code generation has been applied effec- ready available for tests and comparisons. To the best of our knowledge, no optimized OpenVX support is provided International Conference on Communication Technology for any platform including a MCU and a mW-scale parallel (ICCT), pages 598–601, 2015. [16] J. Holewinski, L.-N. Pouchet, and P. Sadayappan. accelerator. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing, pages 311–320. 6. CONCLUSIONS ACM, 2012. In this paper we propose an alternative and novel ap- [17] I. Kadayif and M. Kandemir. Data Space-oriented Tiling for Enhancing Locality. ACM Trans. Embed. Comput. Syst., proach to provide OpenVX support in heterogeneous sys- 4(2):388–414, 2005. tems including a MCU and a parallel accelerator. Our main [18] . OpenVX resources. contributions are an extension to the original OpenVX model http://www.khronos.org/openvx/resources. to support static management of application graphs, and [19] Kronos Group. The OpenCL 1.1 Specifications. the definition of the milliVX specification, which provides a http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf. [20] Kronos Group. The OpenVX API for hardware acceleration. lightweight support to execute static graphs in a resource- http://www.khronos.org/openvx. constrained environment, without renouncing the dynamic [21] A. Kukanov and M. J. Voss. The Foundations for Scalable features provided by OpenVX. Experimental results show Multi-core Software in Intel Threading Building Blocks. Intel that our approach drastically reduces the memory footprint Technology Journal, 11(4), 2007. [22] C. Lattner and V. Adve. LLVM: A compilation framework for (-68%) and the required bandwidth (-10%). Moreover, there lifelong program analysis & transformation. In Code is an average 3× execution speed-up and a 2× energy effi- Generation and Optimization, 2004. CGO 2004. ciency compared to a baseline implementation. International Symposium on, pages 75–86. IEEE, 2004. [23] Y. LeCun and Y. Bengio. Convolutional networks for images, From a theoretical point of view our approach is fully scal- speech, and time series. The handbook of brain theory and able w.r.t. the number of nodes and processing elements in neural networks, 3361(10), 1995. the system, with the only limitation given by the system [24] E. A. Lee and D. G. Messerschmitt. Static scheduling of resources. Our future work will be focused on two main as- synchronous data flow programs for digital signal processing. Computers, IEEE Transactions on, 100(1):24–35, 1987. pects. First, specific LTO passes using advanced heuristics [25] M. Magno et al. Multimodal abandoned/removed object could further reduce the execution time with limited effects detection for low power video surveillance systems. In on the code footprint. Second, we will consider more com- Advanced Video and Signal Based Surveillance, Sixth IEEE plex architectures with multiple IoT nodes and/or acceler- International Conference on, pages 188–193. IEEE, 2009. [26] R. Membarth et al. HIPAcc: A domain-specific language and ator clusters, including in the platform model RTE-to-RTE compiler for image processing. IEEE Transactions on Parallel communication channels. and Distributed Systems, (99):1–14. [27] NVIDIA. NVIDIA Jetson TX1 Supercomputer-on-Module Drives Next Wave of Autonomous Machines. 7. ACKNOWLEDGEMENTS http://devblogs.nvidia.com/parallelforall/nvidia-jetson-tx1. This work is supported by the EU FP7 ERC Advanced [28] NXP. LPC5410x Datasheet. Rev. 2.1. [29] J. Oh, S. Lee, and H.-J. Yoo. 1.2-mW Online Learning project MULTITHERMAN (g.a. 291125) and by IcySoC Mixed-Mode Intelligent Inference Engine for Low-Power and YINS RTD projects, evaluated by the Swiss NSF and Real-Time Object Recognition Processor. Very Large Scale funded by Nano-Tera.ch with Swiss Confederation financing. Integration (VLSI) Systems, IEEE Transactions on, pages 921–933, 2013. [30] M. Puschel¨ et al. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005. 8. REFERENCES [31] J. Ragan-Kelley et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image [1] CEVA-XM4 Intelligent Vision Processor. processing pipelines. volume 48, pages 519–530. ACM, 2013. http://www.ceva-dsp.com/CEVA-XM4. [32] E. Rainey et al. Addressing System-Level Optimization with [2] DesignWare EV Family of Vision Processors. OpenVX Graphs. In Computer Vision and Pattern https://www.synopsys.com/dw/ipdir.php?ds=ev52-ev54. Recognition Workshops (CVPRW), 2014 IEEE Conference [3] IoT-From Research and Innovation to Market Deployment. on, pages 658–663. IEEE, 2014. http://www.internet-of-things-research.eu/. [33] D. Rossi et al. PULP: A Parallel Ultra-Low-Power Platform for [4] Tensilica Customizable Processor IP. Next Generation IoT Applications. In HotChips 2015. http://ip.cadence.com/ipportfolio/tensilica-ip. [34] E. Rosten, R. Porter, and T. Drummond. Faster and better: A [5] ABIresearch. Edge Analytics in IoT. https://www.abiresearch. machine learning approach to corner detection. Pattern com/press/more-than-30-billion-devices-will-wirelessly-conne/. Analysis and Machine Intelligence, IEEE Transactions on, [6] ABIresearch. More Than 30 Billion Devices Will Wirelessly 32(1):105–119, 2010. Connect to the Internet of Everything in 2020. [35] SiliconLabs. EFM32G210 Datasheet. Rev. 1.90. https://www.abiresearch.com/market-research/product/ [36] STMicroelectronics. STM32L476xx Datasheet. Rev. 2. 1021642-edge-analytics-in-iot/. [37] G. Tagliavini, G. Haugou, and L. Benini. Optimizing memory [7] S. Banerjee and D. O. Wu. Final report from the nsf workshop bandwidth in OpenVX graph execution on embedded on future directions in wireless networking. 2013. many-core accelerators. In Design and Architectures for Signal [8] Carnegie Mellon University. CMUcam. and Image Processing (DASIP), 2014 Conference on, pages http://www.cmucam.org/. 1–8. IEEE, 2014. [9] A. Y. Dogan et al. Power/performance exploration of [38] G. Tagliavini, G. Haugou, A. Marongiu, and L. Benini. single-core and multi-core processor approaches for biomedical ADRENALINE: An OpenVX Environment to Optimize signal processing. In Integrated Circuit and System Design. Embedded Vision Applications on Many-core Accelerators. In Power and Timing Modeling, Optimization, and Simulation, IEEE 9th International Symposium on Embedded pages 102–111. Springer, 2011. Multicore/Many-core Systems-on-Chip (MCSoC), pages [10] D. Fick et al. Centip3De: A 3930DMIPS/W configurable 289–296, 2015. near-threshold 3D stacked system with 64 ARM Cortex-M3 [39] A. T. Tan, J. Falcou, D. Etiemble, and H. Kaiser. Automatic cores. In IEEE International Solid-State Circuits Conference task-based code generation for high performance domain Digest of Technical Papers, pages 190–192. IEEE, 2012. specific embedded language. International Journal of Parallel [11] P. Flatresse et al. Ultra-wide body-bias range LDPC decoder in Programming, pages 1–17, 2014. 28nm UTBB FDSOI technology. In Solid-State Circuits [40] . MSP430F161 Datasheet. Rev. G. Conference Digest of Technical Papers (ISSCC), 2013 IEEE [41] P. Virouleau et al. Evaluation of OpenMP dependent tasks International, pages 424–425. IEEE, 2013. with the KASTORS benchmark suite. In Using and Improving [12] Freescale. MC9S12XDP512 Datasheet. Rev. 2.21. OpenMP for Devices, Tasks, and More, pages 16–29. Springer, [13] Y. Futamura. Partial evaluation of computation process–an 2014. approach to a compiler-compiler. Higher-Order and Symbolic [42] T. Yang and A. Gerasoulis. DSC: scheduling parallel tasks on Computation, 12(4):381–391, 1999. an unbounded number of processors. IEEE Transactions on [14] Gartner. Gartner Says the Internet of Things Installed Base Parallel and Distributed Systems, 5(9):951–967, 1994. Will Grow to 26 Billion Units By 2020. [43] J.-S. Yoon et al. A Unified Graphics and Vision Processor With http://www.gartner.com/newsroom/id/2636073. a 0.89/spl mu/W/fps Pose Estimation Engine for Augmented [15] Z. Guo, J. Han, and T. Li. Implementing OpenVX on a Reality. Very Large Scale Integration (VLSI) Systems, IEEE polymorphous array processor. In 2015 IEEE 16th Transactions on, 21(2):206–216, 2013.