Real-Time Image Processing manuscript No. (will be inserted by the editor)

Giuseppe Tagliavini · Germain Haugou · Andrea Marongiu · Luca Benini Optimizing Memory Bandwidth Exploitation for OpenVX Applications on Embedded Many-Core Accelerators

Received: date / Revised: date

Abstract In recent years image processing has been a width, even when the main memory bandwidth for the key application area for mobile and embedded computing accelerator is severely constrained. platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly-parallel ker- Keywords OpenVX, OpenCL, embedded computer nels. However architectural constraints impose hard lim- vision, bandwidth reduction, many-core accelerators its on the main memory bandwidth, and push for soft- ware techniques which optimize the memory usage of complex multi-kernel applications. In this work we propose a set of techniques, mainly 1 Introduction based on graph analysis and image tiling, targeted to accelerate the execution of image processing applica- The evolution of imaging sensors and the growing re- tions expressed as standard OpenVX graphs on cluster- quirements of applications are pushing hardware plat- based many-core accelerators. We have developed a run- form developers to incorporate advanced image process- time framework which implements these techniques us- ing capabilities into a wide range of embedded systems, ing a front-end compliant to the OpenVX standard, and ranging from smartphones to wearable devices. In par- based on an OpenCL extension that enables more ex- ticular, we have focused our attention on three main plicit control and efficient reuse of on-chip memory and classes of computationally intensive image processing greatly reduces the recourse to off-chip memory for stor- tasks: Embedded Computer Vision (ECV) [14], brain- ing intermediate results. Experiments performed on the inspired visual processing [37], and computational pho- STHORM many-core accelerator demonstrate that our tography [21]. approach leads to massive reduction of time and band- Considering the actual market trend toward HD for- mats and real-time video analysis, these algorithms re- quire hardware acceleration. Pushed by the need for This work has been supported by the EU-funded research projects P-SOCRATES (g.a. 611016) and extreme energy efficiency, embedded systems are em- MULTITHERMAN (g.a. 291125). bracing architectural heterogeneity, where a multi-core host processor is coupled with programmable accelera- Giuseppe Tagliavini Department of Electrical Electronic and Information Engi- tors specialized for various application domains. Several neering (DEI), University of Bologna, Italy companies and research groups are looking for alterna- E-mail: [email protected] tive solutions, ranging from using special functional units Germain Haugou on the host CPU [39], to embedded GPGPUs [32], to Integrated System Laboratory, ETH Zurich, Switzerland dedicated vector units [35], to many-core accelerators [4]. E-mail: [email protected] Many-core accelerators provide tens to hundreds of Andrea Marongiu small processing elements (PEs), typically organized in Integrated System Laboratory, ETH Zurich, Switzerland clusters sharing on-chip L1 memory and communicat- Department of Electrical Electronic and Information Engi- ing via low-latency, high-throughput on-chip intercon- neering (DEI), University of Bologna, Italy E-mail: [email protected] nections. Some examples of accelerators featuring this architectural paradigm are STM STHORM [4], Plurality Luca Benini Integrated System Laboratory, ETH Zurich, Switzerland HAL [38], KALRAY MMPA [24], Adapteva Epiphany- Department of Electrical Electronic and Information Engi- IV [1] and PULP [10]. In these architectures the PEs are neering (DEI), University of Bologna, Italy simpler w.r.t. common multi-core architectures and of- E-mail: [email protected] fer a good trade-off between highly parallel computation 2 and power consumption, so they are a promising target Overall DMA is difficult to use, but it can be man- for running image processing workloads. aged quite easily if its usage is limited to single frame Many-core accelerators differ from GPGPUs in two copies. However this approach cannot be generalized, main traits. First, PEs are not restricted to run the same because (i) there is not enough internal memory even instruction on different data, in an effort to improve exe- to keep a single image, and (ii) the memory band- cution efficiency of branch-rich computations and to sup- width is easily saturated in presence of multiple ker- port a more flexible workload-to-PE distribution. Sec- nels. DMA can be used for tiled transfers, using the ond, embedded many-core accelerators do not rely on async work group copy function to handle asynchronous massive multithreading to hide memory latency, but they copies between global and local memory and vice versa. rely instead on DMA engines and double buffering, which In this case programmers need to interleave DMA and give more control on the bandwidth vs. latency trade-off, computation and this gets much more complicated. For but require more programming effort to be managed. basic kernels, the lines of code used for DMA orches- From the software viewpoint, in the last years a num- tration and subsequent workload distribution are over ber of programming models for GPGPUs and many- 50% of the total [2]. Considering applications composed core accelerators have been proposed [49]. Among oth- by multiple kernels, this normally implies multiple tile ers, the very successful Khronos OpenCL standard in- sizes, because each kernel may require a different tile. In troduces platform and execution models which are par- this case, the process to manage this orchestration by ticularly suitable for programming at the emerging in- hand becomes totally unmanageable, and programmers tersection between multi-core CPUs and many-core ac- need tools. This is exactly what our approach provides. celerators. OpenCL offers a conjunction of task parallel Other programming paradigms have been proposed programming model, with a run-to-completion approach, to implement image processing applications on embed- and data parallelism, supported by global synchroniza- ded systems, based on data-flow graphs [48] [15] or func- tion mechanisms. An OpenCL application runs on the tional models [40]. Overall, these solutions tackle the is- host processor and distributes kernels on computing de- sue of memory bandwidth using a tiling-based approach, vices. Kernels are programmed in the OpenCL lan- but in most cases the related execution models are not guage, which is based on the C99 standard. On the host suitable to build complex applications which include ir- side, applications are written in C/C++ language, and regular algorithms and data access patterns. invoke standard API calls to orchestrate the distribution In this work we introduce a framework that imple- and execution of kernels on devices, using a mechanism 1 ments a set of optimizations specifically targeted to ac- based on command queues. celerate the execution of graph-based image process- A common issue of using OpenCL on embedded sys- ing applications on many-core accelerators. The frame- tems is related to the mandatory use of global memory work front-end is based on the OpenVX standard [26]. space to share intermediate data between kernels. When OpenVX is a cross-platform API which aims at enabling increasing the number of interacting kernels, the main hardware vendors to implement and optimize low-level memory bandwidth required to fulfill data requests orig- image processing primitives, with a strong focus on mo- inated by PEs is much higher than the available one, bile and embedded systems. In our framework, data ac- causing a bottleneck. In addition, unlike accelerators for cesses are performed on local buffers in the L1 scratch- desktop computing environments (e.g. Many Inte- pad memory of the reference architecture, that is what grated Core architecture [23]), SoCs have unified host we call a localized execution. To satisfy this condition, and global memory spaces, and have a common data we have taken into account all the data access patterns path connecting host processor and accelerator with L3 that can be found in the OpenVX standard-defined ker- memory [42]. As a direct consequence, applications expe- nels (e.g., local and statistical operators), and that are rience high contention for off-chip memory access, that the most common in image processing algorithms, and may severely limit the final speed-up. then we have defined a set of techniques to support auto- For instance, we consider a platform with an accelera- matic image tiling. Coupling tiling with double-buffering, tor and a DDR3-1600 memory (6400MB/s per channel). we achieve a good overlap between data communication If we reasonably assume that the accelerator has half of and kernel execution on the accelerator, that guarantees the available bandwidth (3200MB/s), and our applica- a higher efficiency in terms of PEs usage. tion need to process a 1920x1080 video source at 60fps, The novelty of our work derives from three main con- then a single image access uses 123.12MB/s. Hence, after tributions: (i) the introduction of a low-level OpenCL accessing 26 image buffers in one frame time the available extension, that can be effectively used to support effi- bandwidth is saturated, but this could not be enough for cient execution of graph-structured workloads with ex- a complex application that instantiates many kernels and plicit DMA transfers; (ii) the automatic mapping of an requires many intermediate results. OpenVX program to a list of host kernels and OpenCL 1 The OpenCL 2.0 standard also enables dynamic paral- low-level graphs; (iii) an algorithm for computing opti- lelism on device side, but most programming environments mal tile sizes for the kernels, taking into account on-chip do not support it yet. memory size limitations, while minimizing main mem- 3

64-bit G-ANoC-L2 SoC PE0 PE1 … PEn GALS GALS GALS GALS I/F CCI I/F CCI I/F CCI I/F CCI CMA Host DMA CC CC CC CC L 2 L1 mem controller (1MB) Cluster0 … ClusterN cluster cluster cluster cluster 0 1 2 3 DDR mem controller L2 mem L3 mem 64-bit G-ANoC-L3

STxP70 STxP70 STxP70-V4B NoC NoC Fig. 1 Reference architectural template 32-KB TCDM IF IF I$ I$ DMA LOW-LATENCY INTERCONNECT FABRIC 2D SNoC links AK N BANK AK 0 BANK IOs CONTROLLER (master + slave) ory utilization. Our framework supports applications of NI Shared any complexity level, with no limitations related to data DMA 0 Tightly-Coupled access patterns. An OpenCL low-level graph is used on DMA 1 Data Memory (TCDM) the accelerator side, with the aim to achieve better tim- ing performance yet minimizing the required bandwidth. Fig. 2 STHORM architecture The use of host kernels is limited to irregular access pat- terns, and the overall orchestration is provided by the memory managed by the DMA controller, which im- framework without any hint by the programmer. plies low-latency and high-bandwidth communication. Experiments have assessed that our solution provides The platform also provides a L2 scratchpad memory at huge benefits in terms of speed-up when compared to the SoC level and an external DDR accessible by means the execution of standard OpenCL code. For instance of a memory controller. Both host cores and PEs can we achieve up to 9.6× speed-up w.r.t. OpenCL, and this access the whole memory space, that is modeled as a is mainly due to a drastic bandwidth reduction. This partitioned global address space (PGAS). reduction also implies high execution efficiency, having more than 95% of the accelerator time spent in active processing. 2.2 Hardware platform

The rest of this paper is organized as follows. In Sec- The real-life platform used to validate our framework tion 2 we present the target architectural template and from an experimental perspective is the STHORM ac- the platform used for experiments. In Section 3 we in- celerator by STMicroelectronics, previously known as troduce the OpenVX programming model. In Section 4 P2012 [4]. STHORM is a many-core computing accelera- we describe our extension to the OpenCL run-time. Sec- tor fabricated in 28nm bulk CMOS technology. Its design tion 5 explains in detail the core of our approach. We is based on clusters interconnected by an asynchronous discuss the experimental results in Section 6. In Section network-on-chip (Figure 2). The accelerator fabric also 7 we discuss the related works. Finally, we summarize includes a fabric controller (FC) core, intended to man- our conclusions and discuss future research directions in age fabric-level run-time and interaction with the host Section 8. processor. Each cluster features 16+1 cores, 16 PEs to perform general-purpose computation and a cluster con- troller (CC) to handle cluster-level run-time. All the 2 Architecture cores (FC, CC, PEs) are dual-issue STxP70 proces- sors, supporting MPMD instruction streams. Moreover, 2.1 Architectural template each cluster contains a multi-banked one-cycle access L1 scratchpad memory, connected by a multi-level logarith- Figure 1 shows the block diagram of a generic architec- mic interconnect, and a dual-channel DMA engine that tural template that can be targeted by our framework. can handle both linear and rectangular transfers. A L2 scratchpad memory is shared by all clusters, and it is This architecture consists of a general-purpose host currently used by software run-time to store accelera- processor coupled with a clustered many-core accelera- tor binaries. The architecture has no data cache, as it is tor inside an on chip (SoC) platform. designed to minimize SoC size and energy consumption. The multi-cluster design is a common solution applied To perform our experiments we have used an eval- to overcome scalability limitations in modern many-core uation board (Figure 3) which is based on ZedBoard accelerators, such as STM STHORM [4], Plurality HAL open-hardware design [51]. This board includes a Xil- [38], KALRAY MMPA [24], Adapteva Epiphany-IV [1] inx Zynq 7020 chip, featuring an ARM Cortex A9 dual and PULP [10]. The processing elements (PEs) inside core host processor operating at 667MHz plus FPGA a cluster are fully independent cores, supporting both programmable logic, and a STHORM chip clocked at SIMD and MIMD parallelism. In addition, the PEs in 430MHz. The ARM subsystem on the Zynq is connected a single cluster are tightly-coupled to a L1 scratchpad to an AMBA AXI interconnection matrix, through which 4

FPGA Table 1 OpenVX framework objects IO periph Context Container for all object instances SNoC SNoC S RAB ARM ARM Kernel Vision kernel implementation, used to create nodes 2AXI IO Graph DAG of nodes implicitly connected by data usage A9 A9 periph L1 L1 Node Instance of a kernel inside a specific graphs FPGA bridge snoop Parameter Reference to a data object used as node parameter

interconnect L2

SNoC L2 ctrl

AXI2 AXI GIC RAB M SNoC DRAM DRAM Table 2 OpenVX framework objects

STHORM STHORM SoC ctrl GPIOs GPIO Scalar Scalar type (integer, floating point, enum) ZYNQ Array Array of scalar or structured types Image Image (including one or ore data planes) STHORM BOARD Matrix MxN matrix Convolution MxN matrix with an associated scaling factor Fig. 3 STHORM evaluation board Distribution 1D or 2D histogram Pyramid Set of images with a fixed scale ratio LUT Lookup table Remap Map of source points to destination points it accesses the DRAM controller. The latter is connected Threshold Set of thresholding values to the on-board DDR3 (500MB), which is the third Delay Time-delayed set of images or arrays memory level in the system (L3) for both ARM and STHORM cores. To allow transactions generated inside the STHORM chip to reach the L3 memory, and transac- ment, defining framework objects (Table 1) and data tions generated inside the ARM system to reach internal objects (Table 2). Framework objects are model en- STHORM L1 and L2 memories, part of the FPGA area tities, while data objects are input/output parame- is used to implement an access bridge. ters which are processed at execution time. The first The FPGA bridge is clocked very conservatively step of an OpenVX program is the creation of a at 40MHz; consequently, the main memory bandwidth valid context by calling vxCreateContext, followed available to the STHORM chip is limited to 250MB/s by the declaration of the data objects which are re- for the read channel and 125MB/s for the write channel, quired as node parameters. Data objects are created via with an access latency of about 450 cycles. Compared vxCreate or retrieved via vxGet. To with ARM processor on the Zynq, which access DDR3 enforce consistency, access to data objects is regulated by through a bus at 533MHz, STHORM chip is severely pe- an acquire/release protocol, via vxAccess and nalized. Clearly, in a full production SoC scenario host vxCommit functions. Data objects exist at the and accelerator share the same silicon die, and conse- context level, they have transparent reference counts and quently the accelerator gets a much larger share of the are not destroyed until their reference count is zero. Ina main memory bandwidth (e.g.. 1/2 instead of 1/12 as any case, data objects are forcibly destroyed at context in the evaluation board). Hence, the evaluation board destruction. represents a very challenging and interesting scenario for any software optimization focusing on reducing main A key feature is the possibility to declare virtual data memory bandwidth needs. objects. These objects are not guaranteed to reside in main memory on a permanent basis, and they may have null size and undefined format (VX DF IMAGE VIRT) at declaration time. Basically, virtual data are used to set 3 OpenVX programming model a dependency between adjacent kernel nodes, and are not associated with any memory area accessible by read- OpenVX [26] is a cross-platform C-based Application /write operations. Programming Interface (API) which aims at enabling hardware vendors to implement and optimize low-level OpenVX graphs are composed of one or more nodes image processing and CV primitives. The final OpenVX that are added by calling node creation functions (in 1.0 framework specification is available as an open, the form vxCreateNode). Nodes are linked to- royalty-free standard ratified by the . gether via data dependencies, without specifying any ex- Most image processing applications can be easily struc- plicit ordering. The OpenVX standard defines a library tured as a set of vision kernels (i.e. basic features or algo- of predefined vision kernels which can be used to cre- rithms) that interact on the basis of input/output data ate nodes, but it also supports the definition of user de- dependencies. Considering this usage scenario, OpenVX fined kernels. The standard defines 41 predefined kernels, promotes a graph-oriented execution model, based on which are fully supported in our implementation. Directed Acyclic Graphs (DAGs) of kernel instances. Graphs must be verified calling vxVerifyGraph be- The standard introduces the software abstractions fore execution, with the aim to guarantee some manda- that are mandatory for an OpenVX execution environ- tory properties: 5

– Input and output requirements must be compliant to – A set of nodes is created and added to the graph as the node interface (data direction, data type, required instances of vision kernels (lines 12-15). vs optional flag). – The vxVerifyGraph function (line 17) checks the – No cycles are allowed in the graph. graph consistency and propagates constraints on vir- – Only a single writer node to any data object is al- tual images; for instance, the format of the virtual lowed. images defined in lines 6-7 is set to VX DF IMAGE S16, – Writes have higher priorities than reads. as required by Sobel kernel validators after a VX DF IMAGE U8 input image. During the verification stage, a validator callback – The vxProcessGraph function (line 19) executes the is called for each node parameter to verify the above- graph in synchronous blocking mode inside a loop, mentioned properties. This is a function defined at kernel that is a typical programming pattern to process an level, and it is also responsible to set dimension and for- incoming stream of input images. mat of virtual data with the aim to respect all functional constraints. Figure 4 shows the DAG derived by this program, out- Graphs can be processed as many times as needed lining input and output images of the program. after their verification. Changes are possible but require a further verification. A graph can be processed in two modes: (i) synchronous blocking mode, which blocks the program execution until the graph processing is com- pleted; (ii) asynchronous single-issue mode, which is non Fig. 4 Edge detector (DAG) blocking and enables the parallel execution of multiple graphs. To introduce OpenVX programming, we consider a basic edge detector. The OpenVX code for this applica- 4 Extended OpenCL run-time tion is shown in Listing 1. In this example, all the ref- erenced kernels are contained in the OpenVX standard. In most cases (see [4], [38], [1] and [10]) the program- Note that an OpenVX program is much more abstract ming environments for many-core accelerators support and concise than an OpenCL program, and at the same OpenCL 1.1 [25]. Figure 5 shows the comparison between time the underlying OpenCL run-time used in our ap- the OpenCL logical model and the STHORM architec- proach is completely hidden to the programmer. More- ture. over, we require no additional information from the pro- In this scenario, data must be transferred into shared grammer to guide the accelerator tuning. local memory prior to computation in order to take ad- 1 vx_context ctx = vxCreateContext(); vantage of low-latency accesses, and it has to be done 2 vx_graph graph = vxCreateGraph(); 3 vx_image imgs[] = { explicitly using OpenCL built-in functions for asyn- 4 vxCreateImage(ctx, width, height, VX_DF_IMAGE_RGB), chronous work-group copy. STHORM cores are hardware 5 vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_U8), mono-threaded, hence the best performance is achieved 6 vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_VIRT), 7 vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_VIRT), by exactly matching any OpenCL ND-Range with the 8 vxCreateVirtualImage(graph, 0, 0, VX_DF_IMAGE_VIRT), STHORM architectural parameters to avoid expensive 9 vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8), 10 }; context switches: that is, programmers should use as 11 vx_node nodes[] = { many work-groups as the number of clusters, and as 12 vxColorConvertNode(graph, imgs[0], imgs[1]), many work-items as the number of processing elements 13 vxSobel3x3Node(graph, imgs[1], imgs[2], imgs[3]), 14 vxMagnitudeNode(graph, imgs[2], imgs[3], imgs[4]), in a cluster. 15 vxThresholdNode(graph, imgs[4], thresh, imgs[5]), In this work we introduce an extended OpenCL run- 16 }; 17 status = vxVerifyGraph(graph); time (referred as CLE), which enables the creation of 18 while (/* input images? */) { low-level graphs containing nodes of different types: 19 /* capture data into imgs[0] */ 20 status = vxProcessGraph(graph); – CreateBuffer – a node that allocates a buffer at the 21 /* use data from imgs[5] */ 22 } specified memory level (L1, L3) associated with a nu- 23 vxReleaseContext(ctx); merical identifier. Listing 1 Edge detector (C code with OpenVX API calls) – CopyBuffer – a node that enqueues a DMA transfer to copy data from a source buffer to a destination The program follows these steps: buffer, specifying their numerical identifiers. – ExecKernel – a node that enqueues the execution of – A context is initially created (line 1) and then re- an OpenCL kernel. leased at the end (line 23). – ReleaseBuffer – a node that releases the specified – A graph is created (line 2). buffer. – Images are defined (lines 4-9), some of them as virtual – EndGraph – a node that triggers the end of the graph (lines 5-8). execution. 6

Fig. 5 STHORM OpenCL Mapping

For each node, the programmer has to specify the ac- Fig. 6 Execution of a single kernel on CLE run-time tual kernel parameters and a set of dependencies: if there is a dependency edge between node A and node B, node A must terminate its execution before node B can start. titioning and scheduling, that provide buffer allocation, OpenCL kernels intended for CLE graphs directly access buffer sizing and CLE graphs creation. The verification buffer parameters (identified by global access modifier stage of an OpenVX graph is performed at run-time, in the kernel source code) without managing data trans- that gives the capabilities of changing the application fers and local buffers in the kernel body. graph with an adaptive approach and supporting dy- The memory management is totally explicit, includ- namic hardware resources. To limit the time of on-line ing the allocation of the stack area used by cores to ex- partition and scheduling phases, we have privileged the ecute. The cleGraphSetBuffer primitive associates a use of heuristics algorithms, but different policies can be standard OpenCL buffer to a graph using a numerical easily plugged in. identifier, which can be used by CLE functions to ad- Overall, nodes can be executed by the host processor dress input/output data provided by the host using L3 or by the accelerator. On the host side, the kernels are memory space. Using clEnqueueGraph, a CLE graph is implemented as plain C functions, using the OpenVX pushed in OpenCL command queues for execution. Each API to access data objects (see Section 4). Our imple- graph is executed on a single cluster, while other clusters mentation for host kernels is based on the reference im- can serve different requests (different graphs or standard plementation provided by Khronos. On the accelerator OpenCL tasks). When the EndGraph node terminates, a side, we provide a set of CLE kernels which follow the notification is sent back to the host side, and the con- guidelines introduced in Section 4. All the boilerplate cerned cluster is made available. An example of a CLE code to enable localized execution is managed by the graph that includes a single kernel is depicted in Figure run-time on the basis of kernel data structures and graph 6. In real applications, images do not fit entirely in L1 verification steps. memory. In the context of our framework, CLE graphs All the presented algorithms are focused on single- contain multiple kernel nodes, and the same set of nodes cluster optimization. The framework supports the execu- is executed multiple times on complementary data sub- tion of OpenVX programs on multi-cluster accelerators sets. This mechanism is explained in greater detail in by applying two methodologies: (i) the use of different Section 5.6. clusters to compute different input sets, and (ii) the par- tition of an input set into multiple parts that can be com- puted independently. The second approach can use the function vxCreateImageFromROI to create a new image 5 Optimization framework for many core object referencing a rectangular region of another im- accelerators age. In both cases, to take effective advantage from the multi-cluster execution OpenVX programmers must or- In this work the main goal is the maximization of exe- chestrate a global execution schema using asynchronous cution efficiency of the accelerator, which is defined as single-issue mode, and then synchronize the execution the amount of time that PEs spend to execute kernel status with the vxVerifyGraph API. code over the total execution time. This goal implies to minimize the total waiting time due to memory trans- fers to/from L3 memory. For this purpose, virtual im- 5.1 Data access patterns ages are not allocated in L3 memory, but they are partly allocated in a set of L1 buffers managed by the frame- In an OpenVX run-time targeting the host processor of work. After the verification steps required by OpenVX an embedded platform, such as the reference implemen- semantics, we integrate a set of algorithms for graph par- tation provided by Khronos, images normally reside in 7

Fig. 8 Image tiling schema for a single kernel Fig. 7 Structure of a tiling descriptor main memory. In our framework, images which repre- sent an input/output for the graph are partitioned into smaller blocks, called tiles, to fit in L1 buffers. Using this approach, virtual images representing an intermediate result are not allocated to L3 memory, at least whenever it is possible without breaking other constraints. The al- lowed size for tiles strictly depends on the data access patterns used by kernels. To describe these patterns, we associate a tiling de- scriptor to each input/output port of OpenVX kernels. For input data, this structure specifies the minimum set of points necessary to compute an output value, in terms of both computing area and neighboring area. For the sake of illustration, we consider the common case of a single point per output value, but in some cases the in- Fig. 9 Classes of image processing kernels put set could contain more adjacent points (e.g., scaling operator), and this condition is managed by the frame- work. Figure 7 describes the structure of a tiling descrip- quire any tile overlap by construction (W = 1,H = tor. W and H are the dimensions of the computing area, 1, ∀ix[i] = 0). that is the set of points used to compute a single output B) Local neighbor operators (e.g. linear operators, mor- value; for output tiles, these values represent the mini- phological operators) compute the value of a point mum number of output points generated by a single iter- in the output image that corresponds to the input ation. x and y values describe the neighboring area, that tile. Local neighbor operators require a complete is the set of additional points contributing to the com- handling of the tile overlap, based on the parame- putation of a single output value, but that may belong ters of the kernel neighboring area (W = 1,H = to other computing or neighboring areas. 1, ∃ix[i] >= 1). The distinction between computing and neighboring C) Recursive neighbor operators (e.g. integral image) are area is really important, since it has a major impact on similar to the previous ones, but in addition they also data partitioning. The partition of input images into tiles consider the previously computed values in the out- is correct when the juxtaposition of the output tile pro- put tile. The managing of tiling is equivalent to local duces a complete output image, equivalent to execute neighbor operators, but we also need to save state the graph in one step, that is when the tiles have the data between tiles (in common cases the borders of same size of the images. In this context, x and y values previous output tiles). exactly correspond to the horizontal/vertical overlap be- D) Global operators (e.g. DFT) compute the value of tween adjacent tiles which is required to compute all the a point in the output image using the whole input points in the output tile. Taking into account a single image. In this case it is impossible to apply tiling to kernel, the size of output tiles is implied by input tiles, input data. and output tiles do not require any overlap (see Figure E) Geometric operators (e.g. affine transforms) com- 8). pute the value of a point in the output image us- Referring to the literature on image processing func- ing a non-rectangular input area. In the most gen- tions [45], we have identified five different classes of op- eral case, we cannot apply a classical input tiling erators (see Figure 9), which cover 100% of the OpenVX due to the generic shape of the neighboring area. For standard defined kernels: some transformations we can specify a tile defining A) Point operators (e.g. color conversion, threshold) a bounding box, even if this causes an overhead in compute the value of each output point from the cor- terms of data that are transferred and not used, and responding input point. These operators do not re- we can derive an equivalent local neighbor operator. 8

F) Statistical operators (e.g. mean, histogram) compute L3 L1 L3 L1 L3

statistical functions of image points. Tiling can be ac- V2 K3 V4 tivated on input images, and we can use a persistent

buffer to implement a reduction pattern “walking” I K1 V1 K2 K5 O through the tiles. V3 K4 V5 To handle the computation of recursive neighbor op- erators, the framework supports the use of state buffers. Each kernel implementation must specify the amount of Fig. 10 Application graph partitioning bytes needed to maintain its state across tiles, as a con- stant value or as a linear combination of the tile border size. After computing the final tile size (this will be ex- approach. Kernels of classes D and E do not support plained in detail in Section 5.4), the framework allocates any tiling scheme, consequently they are executed on the state buffers of the proper size and provide to the ker- host processor and memory boundaries are forced before nel function a pointer to the memory area (in case of and after. Additional memory boundaries are added to borders, a single pointer for each border state). Hence- switch the memory domain of image parameters, in par- forward, each node is responsible to handle the content ticular: (i) a boundary from L3 to L1 is required in input of its state buffer. To satisfy their goal, state buffers are for kernels of class A,B,C or F, (ii) a boundary from L1 persistent for a specific node over multiple tile execu- to L3 is required in output for graph final results, and tions. For instance, we have used the state buffers to im- (iii) all input/output parameters of a node must reside plement the integral image kernel. In this case the area in the same memory domain. sum of the borders is propagated to the adjacent tiles For instance, in the application graph depicted in Fig- using a state buffer with a number of elements exactly ure 10 we suppose that K3 is a statistical kernel (e.g., a equal to the border size. The first PE is responsible to histogram). Tiling cannot be used on its output image update the content of the buffer corresponding to left V4, because each input tile contributes to sparse data and bottom state. in the result set (the histogram bins). Consequently, a State buffers are also used to handle reduction pat- memory boundary is added to its output, and this in- terns when executing statistical operators. For example, cludes all the images read by K5. In this example K5 is the mean kernel can use different accumulator variables a kernel of classes A, B or C, so a memory boundary is (sum of values, number of points) for each executing core, inserted to switch back its input images to L1 domain. and after the last tile a single core is responsible to com- Finally, I and O must reside on L3 domain. pute the reduction and perform a division. At the current stage of development, the framework provides a pointer to the state buffer and some flags (first tile/last tile), and the vendor which provides an accelerated kernel must 5.3 Node scheduling implement the reduction patterns directly in the kernel code. Overall, the tiling approach is totally transparent For each sub-graph extracted at the partitioning stage, to the OpenVX final user. a node scheduling must be determined. The current ver- sion of the framework forces the processing of a graph on a single cluster, and allocates all the processing ele- 5.2 Graph partitioning ments to a single running node. Consequently, the node scheduling algorithm selects a single node at each itera- There are applications for which the resultant graph can- tion, and its final schedule is an ordered list. not be executed allocating all the intermediate tiles in L1 Experimental results on STHORM show that the buffers, basically for two reasons: (i) the graph contains contention for L1 memory is very limited when all the a kernel of classes D, E, or F, or (ii) the buffer sizing al- cores are active, due to the low-latency of the logarithmic gorithm fails to fit all buffers in L1 memory (see Section interconnect and the address interleaving across a large 5.5). In these cases. the graph is automatically parti- number of memory banks. In this scenario, having all the tioned into multiple sub-graphs, each one corresponding PEs executing the same kernel guarantees that prece- to a CLE graph, and the intermediate images that con- dence constraints bound to active kernel can be satisfied nect different graphs are saved into L3 memory. Hence, faster, and the time gaps in the schedule are minimized. the execution of an OpenVX graph is divided into mul- At the same time, the number of output buffers that are tiple stages at run-time level, and the tiling is applied at currently active is the lowest schedulable, accordingly to each stage independently. This process is totally trans- the policy described in Section 5.5. parent to the programmer. To compute the schedule, the algorithm in Figure 11 To support graph partitioning, a memory boundary considers an active set of nodes, that initially contains is added after a kernel of class F, as it presents a cu- all the kernels connected to the input data (head nodes). mulative output which is not compatible with a tiling In the example of Listing 1, we have a single active node 9 at each iteration, and the resulting schedule is trivial (Figure 12).

Fig. 13 Example of tile size propagation 5.4 Tile size propagation

To respect all the constraints imposed by node access but we have observed with the experiments that this as- patterns, the final tile size for each image must be de- pect does not affect the general benefits of data locality. termined taking into account the effect of redundant re- computation. To provide data to kernels which follow in the schedule, a kernel could be required to compute the same values multiple times for adjacent tiles. The final 5.5 Buffer allocation and sizing tile size for all images is computed into two passes: (i) the first pass analyzes the graph forward, simulating an The buffer allocation policy specifies the maximum num- execution (based on computed scheduling) and simulta- ber of buffers that are allocated in L1 memory and their neously collecting the tiling constraints for each kernel; association to input/output kernel image parameters. (ii) the second pass performs a backward analysis, start- The visit order used by the scheduling algorithm guaran- ing from the last simulated node, and sets the buffer tees that allocated buffers are used as soon as possible, final overlap according to all collected constraints. For and nodes that release more data references are executed instance, considering the code of Listing 1 and an output first, so that we can promote buffer reuse to save L1 tile size of 160x120, we get the results depicted in Figure memory space. 13. The Sobel 3x3 kernel specifies a neighboring area of one point (W = 1,H = 1, ∀ix[i] = 1), while other kernels 1. The number of L1 buffers that are initially allocated are simple point operators (W = 1,H = 1, ∀ix[i] = 0). is equal to the number of input images to the graph. On the backward pass, the connection between Color 2. When a kernel is added to the schedule list, we al- Convert and Sobel 3x3 adds a constraint on the inter- locate output images to buffers. If there is a buffer mediate tile B1; in practice, a tile with no overlap and that is no longer used, we reuse it, otherwise we in- a tile with a fixed overlap deal with the same image. crement the buffer count; Due to the double buffering To satisfy this inter-kernel constraint, tile B0 is enlarged policy, buffers that have been use for inputs cannot of the absolute difference between overlapping areas and be reused for outputs. the color conversion kernel is called multiple times to 3. Using a reference counter, we verify whether images re-compute the points on the borders (exactly once per are used by other nodes; if there is no further refer- including tile). ence, we can reuse it, hence we add the buffer to the When applied, redundant re-computation enforces free list. Buffers associated with graph outputs can data locality for intermediate buffers at the cost of trans- not be reused, so they are never added to the free ferring and computing the tile borders multiple times, list. In addition, a buffer usage data structure is allocated 1: while active set is not empty do for each element in the associative map, saving informa- 2:– Select from the active set the kernel tion for next steps. The graph in the example (Listing with more input dependencies; 1) requires three buffers, adding a single buffer to fulfill 3:– Append the selected kernel to the data requirements of virtual images (Figure 14). schedule list; 4:– Remove the selected kernel from the active set, and add it to the visited set; 1. The first buffer is allocated for RGB graph input. 5: if active set is empty then 2. The second buffer is allocated for Color Convert 6:– Compute a new active set, including output. all the nodes that are not in the 3. The first buffer enters the free list. visited set and are connected to graph 4. Sobel 3x3 can reuse the first buffer for its first out- input data or to visited nodes; put, then allocates a third buffer for its second out- 7: end if put. 8: end while 5. The second buffer enters the free list. 6. Magnitude can reuse the second buffer. Fig. 11 Node scheduling algorithm 7. First and third buffers enter the free list. 8. Threshold can reuse the third buffer for its output (the first buffer has been initially allocated to graph ColorConvert −→ Sobel3x3 −→ Magnitude −→ T hreshold input, so it cannot be reused for graph output

Fig. 12 Scheduling order for edge detector Fig. 14 Buffer allocation for edge detector 10

The buffer sizing algorithm computes the maximum buffers, and then program the DMA engine to trans- size for allocated buffers in L1 memory. The heuristic fer the first two input tiles tiles from L3 memory to L1 algorithm that is currently used is depicted in Figure buffers (1). When the first transfer is completed (2), the 15. This approach differs from the typical buffer sizing CC is notified, and then the computation of the kernels is triggered (3). The CC is notified one more time when the second transfer is completed (4), but it does not take 1:– Set the size of each buffer equal to its any immediate action at this time (c), because the PEs upper bound (maximum image size); are still executing the kernels for the first tile. When the 2:– Compute the total memory footprint of the input buffer is no more occupied by any intermediate re- buffers; 3: while total memory footprint > available L1 sults, the CC (d) is notified (5), and a new input data quota do transfer is triggered (6). When the last kernel terminates 4: if iteration is even then its execution, the CC is notified again (7), and the DMA 5:– Halve the width of each buffer when engine is programmed to transfer an output tile from L1 the result is greater than minimum tile width; buffer to L3 memory (8); in the most general case, each 6: else kernel can produce an output image, and so this block 7:– Halve the height of each buffer when could be triggered at the end of any kernel. If a DMA the result is greater than minimum tile input transfer is completed, that is our case, a new tile height; 8: end if computation is triggered (9); however, the next kernel 9: if No changes to buffer size then execution is triggered when both events have occurred. 10:– Infeasible (backtracking); 11: end if 12: end while 5.7 Nested graphs Fig. 15 Buffer sizing algorithm Our framework supports the definition of nested graphs. Each OpenVX kernel can be associated to a child graph, problem for data-flow graphs presented in [19]. Fixing a and the execution of an associated node implies its pro- point on the time axis, each buffer contains a tile of a cessing. A nested graph can be created at node execution specific image, but the buffer is unique and the referred or initialization time. When created at execution time, image changes due to the buffer reuse policy. nested graphs do not require any additional API sup- port w.r.t. standard graphs. The creation of framework and data objects is performed inside the kernel func- tion, and this enables the dynamic execution of differ- 5.6 Run-time graph ent graphs based on parameter run-time values. Nested graphs which are created at initialization time require The generation algorithm for CLE graphs interprets all two additional features, initialization callbacks and graph the decision made in previous steps to build a CLE graph parameters. for sub-graph to be executed on the accelerator. In addi- Initialization callbacks are kernel specific functions tion to preceding constraint and buffer policies, this algo- which are automatically called at node creation time. rithm applies double buffering to achieve a good overlap They can be used to create and validate a nested graph, between data transfer and kernel execution. Figure 16 in this case the node execution just includes the graph depicts the execution schema of an OpenVX application processing. This solution is less dynamic than the pre- equivalent to Listing 1. When considering a target archi- vious one but at the same time it is much more effi- tecture without a cluster controller processor, the same cient, since graph creation and validation are performed tasks could be performed by a thread running on the once in the application time-line, regardless of the num- host side without any loss of generality. ber of graph executions. Graph parameters are a ref- The Cluster Controller (CC) initializes the execution erence to a specific node parameter within the graph. environment (a), in particular the allocation of the L1 These parameters are created with a specific API call (vxAddParameterToGraph), and they can be modified between executions (vxSetGraphParameterByIndex). Using graph parameters no knowledge of the internal structure is required to set node parameters, and this approach is used to support nested graphs. A graph can be created in the initialization callback using exclusively virtual data types, and then a set of graph parameters is defined representing inputs and outputs at graph level. As a further step, graph parameters are associated to Fig. 16 Example of CLE run-time schedule actual parameters already defined in the context of the 11

Benchmark Nodes Accelerator Images I CANNY O (acc./host) sub-graphs (in/out/virtual) Random graph 10 / 0 1 1 / 1 / 10 Edge detector 4 / 0 1 1 / 1 / 4 Object detection 4 / 0 1 2 / 1 / 3 Super resolution 8 / 0 1 3 / 1 / 5 FAST9 4 / 0 1 1 / 1 / 3 82x62 82x62 Disparity 5 / 0 1 2 / 1 / 6 1 N 3 Pyramid 6 / 1 1 1 / 1 / 4 Canny 4 / 1 1 1 / 1 / 5 I S NM 5 ET O Optical 4 / 4 4 1 / 1 / 2 84x64 80x60 Diparity S4 20 / 0 1 2 / 1 / 28 2 P 4 Retina preproc. 165 / 0 8 1 / 4 / 120 82x62 82x62 Table 3 Details on OpenVX benchmarks Fig. 17 OpenVX graph of Canny benchmark with annotated tile size external graph node, which are passed to the initializa- tion callback. After the type match between actual pa- – Random graph is a synthetic benchmark which in- rameters and virtual data, the graph can be verified and cludes 10 morphological nodes, exposing a wider its processing is finally demanded to node execution. branching schema compared to real applications with Since the use of nested graphs implies to call a set the specific aim to stress allocation and scheduling al- of OpenVX API functions, this feature is limited to the gorithms; kernels that are executed by the host. In turn, nested – Edge detector is a basic edge detector including 4 ker- graphs can be executed (in part or totally) on the ac- nels (RGB to gray-scale conversion, Sobel 3x3 filter, celerator, and this methodology can be also applied re- magnitude and thresholding); cursively. Overall, nested graphs are a viable solution to – Object detection is an algorithm to detect objects implement OpenVX kernels that provide higher level fea- that have been abandoned/removed in a set of ad- tures, also facilitating their reuse as software components jacent video frames, and it is based on NCC back- from a software engineering perspective. ground subtraction and morphological operators (as described in [33]); – Super resolution represents the recombination phase 5.8 User-defined nodes typical of a computational photography algorithm, which is used to increase the quality of an image using OpenVX enables a programmer to specify custom nodes, multiple overlapping pictures of a scene [44]; and our framework supports this feature. On the host – FAST9 implements the FAST9 corner detection al- side, programmers have to specify the validator callbacks gorithm [43]; used at verification stage, and a data descriptor specify- – Disparity computes the stereo-matching disparity be- ing image parameters, tiling behavior and state require- tween left and right images; ments. For the kernels that are intended to execute on – Pyramid creates a set of 4 images which are derived the host side, a C function implementing the kernel must from the input one, weighted using a 5x5 Gaussian be provided. On the accelerator side, programmers must kernel and then scaled down; implement a CLE kernel that accesses image parameters – Canny implements a standard Canny edge detector directly in global memory space, without using any in- [7]; termediate local buffer (as described in Section 5). The – Optical is an implementation of the Lucas-Kanade benchmarks defined in Section 6 reference several user- algorithm [31], used to measure optical flow field for defined kernel. a set of keypoints on two adjacent video frames; – Disparity S4 is an extension of Disparity, using four shifted versions of the right image to support a wider 6 Experimental results range for disparity. – Retina preprocessing implements the retina prepro- We have implemented the full framework described in cessing filter described in [37]. the previous section to target the STHORM evaluation board (see Section 2.2). The framework supports the Table 3 reports some implementation details about these data access patterns described in Section 5.1, enabling benchmarks. It specifies the number of nodes executed by the execution of all the kernels included in the OpenVX accelerator and host, the number of sub-graphs executed standard. In addition, we have implemented a library of on the accelerator and the number of involved images user-defined kernels. (organized by type). Pyramid, Canny and Optical are To assess the benefits of our approach on real applica- implemented as a single host node including a nested tions, we have selected a set of benchmarks representing graph which is executed multiple times. In this case, the the main fields of image processing domain, with the ad- values reported in Table 3 are related to the cumulative dition of a single synthetic benchmark: executions of the inner graph. 12

As an example, Figure 17 depicts the OpenVX graph Speed-up w.r.t. OpenCL of Canny benchmark. The nested graph contains five 12.00 nodes, which are Sobel 3x3 (S), elementwise norm (N), 9.61 phase (P), non-maxima suppression (NM) and edge trac- 10.00 ing (ET). The last node (ET) is executed by the host, 8.00 6.73 6.00 5.04 3.86 3.46 while the other ones are scheduled on the accelerator. ET 4.00 3.12 scans the full image and inserts into a stack data struc- 2.00 ture the coordinates of the points that are over a high 0.00 threshold (with a YES status), and then the points be- tween low and high thresholds (with a MAYBE status). Then, it traverses this stack and evaluates the MAYBE points considering the status of their neighborhood. Hav- ing a low number of edge points w.r.t. the full image size, this kernel allows to limit the memory bandwidth in the Fig. 18 Speed-up of OVX CLE w.r.t. standard OpenCL ap- final algorithm stage, and it is computed by the host due proach to its irregular access pattern. The images are annotated with the tile size computed by our algorithm (see Sec- Bandwidth reduction w.r.t. OpenCL tion 5.4), that is not the same for all nodes. Tile 5 is the output of the accelerator sub-graph, and its size (80×60) 12.00 is an exact multiple of the full output image (640×480). 10.00 8.00 Tiles 1-5 are two points wider (82×62) to support the 6.00 application of non-maxima suppression kernel, which re- 4.00 quires a 3×3 window. Tile I is even wider (84×64), with 2.00 the aim of providing proper input to Sobel kernel in ac- 0.00 cordance with its output size. The host kernel (ET) reads the full graph output to produce the final image O , di- rectly accessing L3 memory through Cortex-A9 cache hierarchy. This type of result cannot be obtained by fus- ing kernels and optimally tiling the generated loop, as in the most general case tiling requirements of different ker- Fig. 19 Bandwidth reduction using OVX CLE nels are not homogeneous. Overall, the use of OpenVX graphs enables a global level of optimization which is not possible under a single-function paradigm. implement complex applications, nevertheless it is often disattended by many tools. Retina preprocessing is a brain-inspired visual pro- Using our approach, the orchestration of multiple ker- cessing algorithm described in [37]. It is composed by nel nodes and accelerator sub-graphs is totally transpar- 24 building blocks, which can be represented with an ent to the programmer. In Optical, a single host ker- OpenVX graph including both standard and user-defined nel executes a nested graph multiple times. In Retina nodes. The final graph size is much larger than typical preprocessing, the presence of statistical kernels used for OpenVX applications, but we have included it to give an image normalization (mean and standard deviation) in- idea of how our approach scales very well also to future duces the partition of the OpenVX graph into eight sub- applications that will emerge when OpenVX will start to graphs at CLE run-time level. In this specific case, all the be heavily utilized in industry. In our implementation, referenced kernels are member of classes A, B and C (see some of these blocks (complementary, sum, subtraction) Figure 9), and our framework orchestrates the execution are directly mapped on OpenVX nodes, while the other on the accelerator to get the maximum speed-up. ones have been implemented as the composition of mul- All tests are performed with an input image size of tiple nodes. In particular, we have defined two functions 640×480 pixels. This setup is sufficient to already see the to describe specific building blocks (retina opponency effects of memory bandwidth, due to the limited main and LGN opponency), and another one (LGN) to instan- memory bandwidth available for the STHORM acceler- tiate a sequence of identical blocks that are recurring in ator on the evaluation board. the last algorithm stage. Each function call adds a set of nodes to the OpenVX graph, and overall they are in- voked multiple times to build the full application. This 6.1 Comparison with OpenCL example demonstrates a major benefit of our approach, that is the support to composability. Our framework pro- Figure 18 shows the speed-up of the OpenVX accelerated vides an extendable set of software components which versions w.r.t. the same applications implemented on the can be assembled in various combinations to satisfy spe- standard OpenCL 1.1 run-time provided by STHORM cific functional requirements. This is a key feature to software environment. “OVX CLE” denotes the version 13

Accelerator Efficiency Required bandwidth 100 10000 90 922 1391 80 779 70 1000 290 307 307 359 199 60 71 71 100 34 36 38 44 31 50 24 22 40 MB/s 15 15 18 10 8 8 30 Wait time 20 Executing time 1 10 0

OVX CLE OpenCL Available BW Fig. 21 Breakdown analysis of accelerator efficiency Fig. 20 Bandwidth required by applications for both OVX Accelerator Efficiency CLE and OpenCL 100 90 80 executed using our framework. Each OpenCL application 70 is built using a library of image processing kernels, with 60 Wait time the aim to mimic the component-like approach promoted 50 Executing time by OpenVX. 40 Figure 19 depicts the L3 bandwidth reduction, which 30 is computed as the ratio between the amount of 20 data transferred by the two implementations (OVX 10 CLE/OpenCL). Since each OpenCL kernel copies its out- 0 puts to L3 memory in order to pass data to the next Edge detector Retina preproc. one, the trend of the speed-up is closely related to the L3 Fig. 22 Efficiency of Edge detector and Retina preprocessing bandwidth reduction. This is particularly evident for Op- with memory bridge at 400MHz (simulation) tical, which is composed by a single kernel (Scharr 3×3) executed multiple times. In this case we get no advantage from bandwidth reduction, and the related speed-up is 6.2 Execution efficiency very close to one. This is an expected result, because the baseline OpenCL implementation exploits parallelism as Figure 21 depicts the execution efficiency related to effectively as our run-time if main memory effects are benchmarks’ execution on the accelerator, that is com- not important. puted as the percentage of total graph execution time. Figure 20 reports the bandwidth requirements of In execution time, cores are actually executing kernel in- the benchmarks for both OVX CLE and OpenCL. structions, while the wait time is the time required for These values have been computed as a ratio between transfers and not overlapped to the execution time. the total transferred bytes and the related computa- Edge detector and Retina preprocessing are charac- tion time required by benchmarks. This is an upper- terized by a significant level of inefficiency. In both cases bound case, which suppose a total overlap between the wait time is high, and the execution on the accel- data transfer and computation. The bandwidth required erator is dominated by data transfer times (see Figure by an the OpenCL application exceeds the available 20). In Edge detector this is due to the low computa- one in four cases (Edge detector, Disparity, Retina tional intensity of the algorithms. In Retina preprocess- preprocessing, Disparity S4), limiting the speed-up ing this effect is related to the high number of statistical of the OpenCL solution. In OVX CLE benchmarks, kernels that imply multiple splits in the graph, and the the bandwidth threshold is exceeded twice for a limited traffic to/from L3 memory is increased w.r.t. an appli- amount (Edge detector, Retina preprocessing), and cation with the same number of nodes and a single run- this effect impacts on execution efficiency (see next sec- time graph. These outcomes are not limitations of our tion). framework, since in both cases it guarantees the max- All the measures do not include the initialization time imum overlap for data transfers and computation, but for OpenVX and OpenCL contexts. In OVX CLE, the they are due to the limited accelerator bandwidth of algorithms which verify the graph are polynomial and the STHORM evaluation board. As explained in Sec- the computational complexity is O(n ∗ e), where n and tion 2.2, this is a worst case scenario compared to a fully e are the number of nodes and edges in the CLE graph. integrated SoC, where the accelerator will have a much These algorithms are executed on the host side and they greater share of L3 bandwidth. Nevertheless, we can sim- have not shown any relevant impact into the execution ulate a more realistic scenario using the simulation tool of the selected benchmarks. in the STHORM SDK. Figure 22 reports the efficiency 14

Table 4 Comparison with KernelGenius and a source-to-source compiler. Its model of compu- tation includes a reduction pattern, which can be used Benchmark KernelGenius (ms) OVX CLE (ms) Sobel 3×3 22.2 20.42 to compute statistical and recursive neighbor operators. Convolution 5×5 30.1 25.58 Host code can be used to implement other unsupported Erode/Dilate 5 25.6 12.84 patterns, so the coverage is equivalent to our solution. FAST9 5.4 5.9 HIPAcc supports multiple architectural targets, gener- Canny 182.0 58.2 ating the image processing code through a visit of the Abstract Syntax Tree (AST) using the compiler front- of Edge detector and Retina preprocessing when the L3 end. This approach is different from the one promoted memory interface is clocked at 400MHz. As expected, by OpenVX, which is based on run-time steps. Some rel- the wait time is drastically reduced in both cases. This evant differences are: approach is equivalent to move up the reference line in – An OpenVX graph can be modified at run-time based Figure 20. on external parameters, while HIPAcc code has to be completely re-compiled. – An OpenVX framework can transparently support 6.3 Comparison with similar tools load balancing using dynamic resource management, while in HIPAcc this feature is not target agnostic KernelGenius [30] is a tool that enables the high-level and must be guided by the programmer . description of vision kernels using a custom programming – The validation step in OpenVX programs requires language. Starting from an application described as a set an initial overhead, but it simplifies debugging and of vision kernels, KernelGenius aims at generating an enables additional features (such as nested levels of optimized OpenCL kernel for the STHORM platform, execution, which are not supported by HIPAcc). with a totally transparent management of the DMA data Both Halide and HIPAcc include OpenCL in their transfers. The structure of the tiling problem for a single target list, but the generated code is not intended to kernel is analogous to the formulation we are using in run on a many-core accelerator. Future extensions of this work, but there is no support for most data patterns these works would enable us to compare them with our (classes C, D, E and F of Figure 9). Table 4 reports a OpenVX framework. comparison between OVX CLE and KernelGenius for a subset of our benchmarks that are compatible with KernelGenius current limitations. 7 Related Works Halide [40] is a tool specifically designed to de- scribe image processing pipelines, with a strong focus The role of OpenVX for performance optimization has on computational photography. To implement an algo- been introduced in [41]. The authors are active actors rithm with Halide, the programmer must specify a func- of the standardization process, and they discuss how tional description using a domain-specific language em- OpenVX run-times could provide both kernel and sys- bedded in Python or C++. Halide defines a model based tem level optimizations. In the context of embedded vi- on stencil pipelines, with the aim to find a trade-off be- sion system, in the last years many FPGA-based solu- tween locality, exploitation of parallelism and redundant tions has been presented (e.g., [6] and [29], and [18]). re-computation. Compared with our approach, Halide We can also find several examples of domain-specific ar- presents the following limitations: chitectures. CHARM [9] and AXR-CMP [8] propose a – Irregular algorithms and data patterns are not sup- framework for composable accelerators assembled from ported by the language. accelerator building blocks with dedicated DMA engines, – Composability of software modules is limited, as pro- while NeuFlow [15] is a special-purpose data-flow pro- grammers can just express single algorithms or inde- cessor tailored for vision algorithms. Both FPGA and pendent pipelines. domain-specific architectures have emerged to satisfy de- – Schedule and tile size are always explicit, there is mands for power-efficient and high-performance multi- no default to implicitly select the best choice for a processing. Our solution is based on a general-purpose specific target. accelerator, using a programming model which takes into account the specific needs of CV domain but still pro- Halide targets high-end platforms such as Intel multi- moting a general-purpose programming paradigm. Dark- cores and high performance GPGPUs, while OVX CLE room [22] is a tool that synthesizes hardware descriptions is focused on many-core accelerators for the embedded for ASIC or FPGA, or optimized CPU code using an op- market. Since Halide and OVX CLE do not include a timally scheduled pipeline. The tiling approach is simi- common target platform, a comparison based on bench- lar to our solution, but there are some limitations. They mark timings cannot be reported. consider just two access patterns (point-wise and sten- HIPAcc [34] is a framework for image processing that cil), that are the only to be compatible with the pipeline includes a domain specific language embedded in C++ execution. 15

OpenCV [36] is an open-source and cross-platform Graph-structured program abstractions have been library featuring high-level for Computer Vision. studied for years in the context of streaming languages OpenCV is the de-facto standard in desktop comput- (e.g. StreamIt [48]). In these approaches, static graph ing environment, its mainstream version is optimized for analysis enables stream compilers to simultaneously opti- multi-core processors but it is not suitable for accelera- mize data locality by interleaving computation and com- tion on embedded many-core systems. Some vendors pro- munication between nodes. However, most research has vide accelerated versions of OpenCV which have been focused on 1D streams, while image processing kernels optimized for their hardware (e.g. OpenCV for Texas can be modeled as programs on 2D and 3D streams. Instruments embedded platforms [11] or OpenCV for The model of computation required by image processing [47]). As an alternative to OpenCV, Qualcomm is also more constrained than general streams, because provides a specific library for ECV which includes the it is characterized by specific data access patterns. Some most frequently used vision processing function. This li- good results have been achieved with special-purpose brary is called FastCV[39], and it is optimized for ARM- data-flow processors targeted for vision algorithms (e.g. based processors and tuned to take advantage of Qual- NeuFlow [15]). Our model takes into account the spe- comm’s Snapdragon processors. As a matter of fact, cific characteristics of image processing domain, but we OpenCV needs a lower-level middleware for accelerating target a general-purpose accelerator. image processing primitives. This is precisely the goal of Stencil kernels are a class of algorithms applied to OpenVX, which aims at providing a standardized set of multi-dimensional arrays, in which an output point is accelerated primitives, thereby enabling platform agnos- updated with weighted contributions from a subset of tic acceleration. neighbor input points (called window or stencil). Our definition of tiles is equivalent to a 2D stencil. Many opti- OpenCL [46] is a very widespread programming envi- mization techniques have been proposed to execute sten- ronment for both many-core accelerators and GPGPUs, cil kernels on multi-core platforms [13], but an effective and it is supported for an increasing number of het- solution for many-core accelerators executing heteroge- erogeneous architectures; for instance, supports neous vision kernels has not been proposed yet. Such a OpenCL on its FPGA architecture [12]. The OpenCL solution has to consider all the data access pattern spe- memory model is too constrained for SoC solutions, cific of this domain, handling the possible overlapping of hence many extensions have been proposed by vendors. input windows and providing a solution for the access For instance, AMD provides a zero-copy mechanism to patterns that are not properly describable in terms of share data between host and GPU in Fusion APU prod- stencil computation. ucts, also enabling the access to GPU local memory by Many previous works have proposed specific op- host side through a unified north-bridge with full cache timizations for architectures with explicitly managed coherence [5]. In a many-core accelerator we need even scratchpad memories, such as Cell BE [28, 20, 17]. In more control on data allocation, because cores are not this paper we propose a totally transparent approach for working in lock-step. In addition, we need the possibility the well-defined scope of image processing applications. to map the logical global space at different levels of the We do not introduce any new programming model which memory hierarchy, to efficiently maintain state between programmers should learn in addition to the standard kernels. OpenVX API, and overall we also manage all the cases There are well-known extensions for imperative pro- in which the localized execution is not directly possible. gramming languages which provide mechanisms to ex- ecute code on accelerators, and it is worth mentioning OpenMP [3], OpenACC [50] and Sequoia [16]. With these approaches, the compiler usually performs most opti- 8 Conclusions and future works mization at loop level. Conversely, vision graphs assume the form of a sequence of loops nests, potentially located In this paper we have presented a framework which aims over different compile units. The connections among ker- at improving the execution efficiency of image process- nels are not taken into account, and accordingly applica- ing algorithms on many-core accelerators. The proposed tions may not be optimized at a global level, considering solution applies a set of algorithmic steps to map an issues like data locality. Our focus is limited to the image OpenVX application into an OpenCL low-level graph. processing domain, since in this context we can define Experimental results on the STHORM platform show an application with a component-based approach, and that our approach provides huge benefits in terms of most details are implicit (including the memory trans- speed-up, considering both a sequential version and an fers). Moreover, the above are general-purpose parallel accelerated OpenCL version, and also in terms of execu- programming environment, while in this work we take tion efficiency and bandwidth reduction. the route of domain-specific languages, since image pro- Our future work will be focused on the study of new cessing is a large and very active domain that justifies techniques to automatically split and execute a vision this kind of approach [27]. graph over multiple accelerators. Moreover, we are plan- 16 ning to support different architectural targets (GPU and ence on Field Programmable Logic and Applications FPGA). (FPL) 13. Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2009) Optimization and performance mod- eling of stencil computations on modern micropro- cessors. SIAM review References 14. Embedded Vision Alliance (2015) http://www.embedded-vision.com/ 1. Adapteva, Inc (2015) Epiphany- 15. Farabet C, Martini B, Corda B, Akselrod P, Cu- IV 64-core 28nm Microprocessor. lurciello E, LeCun Y (2011) Neuflow: A runtime http://www.adapteva.com/products/silicon- reconfigurable dataflow processor for vision. In: devices/e64g401/ 2011 IEEE Computer Society Conference on Com- 2. Agosta G, Barenghi A, Pelosi G, Scandale M (2014) puter Vision and Pattern Recognition Workshops Towards Transparently Tackling Functionality and (CVPRW) Performance Issues across Different OpenCL Plat- 16. Fatahalian K, Horn DR, Knight TJ, Leem L, Hous- forms. In: Computing and Networking (CANDAR), ton M, Park JY, Erez M, Ren M, Aiken A, Dally 2014 Second International Symposium on WJ, et al (2006) Sequoia: programming the memory 3. Ayguad´eE, Badia RM, Bellens P, Cabrera D, Duran hierarchy. In: Proceedings of the 2006 ACM/IEEE A, Ferrer R, Gonz`alezM, Igual F, Jim´enez-Gonz´alez conference on Supercomputing D, Labarta J, et al (2010) Extending OpenMP to 17. Franceschelli A, Burgio P, Tagliavini G, Marongiu A, survive the heterogeneous multi-core era. Interna- Ruggiero M, Lombardi M, Bonfietti A, Milano M, tional Journal of Parallel Programming Benini L (2011) MPOpt-Cell: A High-performance 4. Benini L, Flamand E, Fuin D, Melpignano D (2012) Data-flow Programming Environment for the CELL P2012: Building an ecosystem for a scalable, mod- BE Processor. In: Proceedings of the 8th ACM In- ular and high-efficiency embedded computing accel- ternational Conference on Computing Frontiers erator. In: Design, Automation Test in Europe Con- 18. Gehrig SK, Eberli F, Meyer T (2009) A real-time ference Exhibition (DATE) low-power stereo vision engine using semi-global 5. Boudier P, Sellers G (2011) Memory System on Fu- matching. In: Computer Vision Systems, Springer sion APUs. AMD fusion developer summit 19. Geilen M, Basten T, Stuijk S (2005) Minimising 6. Canis A, Choi J, Aldham M, Zhang V, Kam- buffer requirements of synchronous dataflow graphs moona A, Anderson JH, Brown S, Czajkowski T with model checking. In: Proceedings of the 42nd (2011) LegUp: high-level synthesis for FPGA-based annual Design Automation Conference processor/accelerator systems. In: Proceedings of 20. Gonz`alez M, Vujic N, Martorell X, Ayguad´e E, the 19th ACM/SIGDA international symposium on Eichenberger AE, Chen T, Sura Z, Zhang T, O’Brien Field programmable gate arrays K, O’Brien K (2008) Hybrid access-specific software 7. Canny J (1986) A computational approach to edge cache techniques for the Cell BE architecture. In: detection. IEEE Transactions on Pattern Analysis Proceedings of the 17th international conference on and Machine Intelligence Parallel architectures and compilation techniques 8. Cong J, Liu C, Ghodrat MA, Reinman G, Gill M, 21. Greengard S (2014) Computational photography Zou Y (2011) AXR-CMP: Architecture Support in comes into focus. Commun ACM Accelerator-Rich CMPs 22. Hegarty J, Brunhaver J, DeVito Z, Ragan-Kelley J, 9. Cong J, Ghodrat MA, Gill M, Grigorian B, Reinman Cohen N, Bell S, Vasilyev A, Horowitz M, Hanrahan G (2012) CHARM: A Composable Heterogeneous P (2014) Darkroom: Compiling high-level image pro- Accelerator-rich Microprocessor. In: Proceedings of cessing code into hardware pipelines. In: Proceedings the 2012 ACM/IEEE International Symposium on of the 41st International Conference on Computer Low Power Electronics and Design Graphics and Interactive Techniques (SIGGRAPH) 10. Conti F, Rossi D, Pullini A, Loi I, Benini L (2014) 23. Heinecke A, Klemm M, Bungartz H (2012) From Energy-efficient vision on the PULP platform for GPGPU to Many-Core: Fermi and Intel ultra-low power parallel computing. In: 2014 IEEE Many Integrated Core Architecture. Computing in Workshop on Signal Processing Systems (SiPS) Science & Engineering 11. Coombs J, Prabhu R (2011) OpenCV on TIs DSP+ 24. KALRAY Corporation (2015) ARM R platforms: Mitigating the challenges of port- http://www.kalray.eu/ ing OpenCV to embedded platforms. Texas Instru- 25. Kronos Group (2015) The ments OpenCL 1.1 Specifications. 12. Czajkowski TS, Aydonat U, Denisenko D, Freeman http://www.khronos.org/registry/cl/specs/opencl- J, Kinsner M, Neto D, Wong J, Yiannacouras P, 1.1.pdf Singh DP (2012) From OpenCL to high-performance hardware on FPGAs. In: 22nd International Confer- 17

26. Kronos Group (2015) The OpenVX API for hard- and compiler for optimizing parallelism, locality, ware acceleration. http://www.khronos.org/openvx and recomputation in image processing pipelines. In: 27. Lee H, Brown KJ, Sujeeth AK, Chafi H, Rompf Proceedings of the 34th ACM SIGPLAN conference T, Odersky M, Olukotun K (2011) Implementing on Programming language design and implementa- domain-specific languages for heterogeneous parallel tion computing. Ieee Micro 41. Rainey E, Villarreal J, Dedeoglu G, Pulli K, Lep- 28. Lee J, Seo S, Kim C, Kim J, Chun P, Sura Z, Kim ley T, Brill F (2014) Addressing System-Level Opti- J, Han S (2008) COMIC: a coherent shared memory mization with OpenVX Graphs. In: 2014 IEEE Con- interface for Cell BE. In: Proceedings of the 17th in- ference on Computer Vision and Pattern Recogni- ternational conference on Parallel architectures and tion Workshops (CVPRW) compilation techniques 42. Rogers P, FELLOW A (2013) Heterogeneous system 29. Lei Y, Gang Z, Si-Heon R, Choon-Young L, Sang- architecture overview. In: Hot Chips Ryong L, Bae KM (2008) The platform of image ac- 43. Rosten E, Porter R, Drummond T (2010) Faster and quisition and processing system based on DSP and better: A machine learning approach to corner de- FPGA. In: International Conference on Smart Man- tection. IEEE Transactions on Pattern Analysis and ufacturing Application Machine Intelligence 30. Lepley T, Paulin P, Flamand E (2013) A Novel Com- 44. Schubert F, Schertler K, Mikolajczyk K (2009) A pilation Approach for Image Processing Graphs on a hands-on approach to high-dynamic-range and su- Many-core Platform with Explicitly Managed Mem- perresolution fusion. In: Applications of Computer ory. In: Proceedings of the 2013 International Con- Vision (WACV), 2009 Workshop on ference on Compilers, Architectures and Synthesis 45. Sonka M, Hlavac V, Boyle R, et al (2008) Image for Embedded Systems processing, analysis, and machine vision. Thomson 31. Lucas BD, Kanade T, et al (1981) An iterative image Toronto registration technique with an application to stereo 46. Stone JE, Gohara D, Shi G (2010) OpenCL: A par- vision. In: IJCAI allel programming standard for heterogeneous com- 32. Maghazeh A, Bordoloi UD, Eles P, Peng Z (2013) puting systems. Computing in science & engineering General purpose computing on low-power embedded 47. Tegra Android Development GPUs: Has it come of age? In: 2013 International Documentation Website (2015) Conference on Embedded Computer Systems: Archi- http://docs.nvidia.com/tegra/index.html tectures, Modeling, and Simulation (SAMOS XIII) 48. Thies W, Karczmarek M, Amarasinghe S (2002) 33. Magno M, Tombari F, Brunelli D, Di Stefano L, StreamIt: A language for streaming applications. In: Benini L (2009) Multimodal abandoned/removed Compiler Construction, Springer object detection for low power video surveillance sys- 49. Vajda A (2011) Programming many-core chips. tems. In: Sixth IEEE International Conference on Springer Advanced Video and Signal Based Surveillance 50. Wienke S, Springer P, Terboven C, an Mey D (2012) 34. Membarth R, Reiche O, Hannig F, Teich J, Ko- OpenACC First Experiences with Real-World Ap- rner M, Eckert W (2015) Hipacc: A Domain-Specific plications. In: Euro-Par 2012 Parallel Processing, Language and Compiler for Image Processing. IEEE Springer Transactions on Parallel and Distributed Systems 51. Zedboardorg (2015) Zedboard product page. 35. Movidius, Ldt (2015) Myriad 1 Mobile Vi- http://zedboard.org/product/zedboard sion Processor. http://www.movidius.com/our- technology/myriad-2-platform/ 36. OpenCV Library Homepage (2015) http://www.opencv.com/ 37. Park S, Maashri AA, Irick KM, Chandrashekhar A, Cotter M, Chandramoorthy N, Debole M, Narayanan V (2012) System-on-chip for biologically inspired vision applications. Information Processing Society of Japan: Transactions on System LSI De- sign Methodology 38. Plurality Ltd (2015) The HyperCore Processor. http://www.plurality.com/hypercore.html 39. Qualcomm (2015) Computer Vision (FastCV). https://developer.qualcomm.com/computer-vision- fastcv 40. Ragan-Kelley J, Barnes C, Adams A, Paris S, Du- rand F, Amarasinghe S (2013) Halide: a language