Diplomat: Mapping of Multi-Kernel Applications Using a Static Dataﬂow Abstraction

Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction Bruno Bodin Luigi Nardi & Paul H. J. Kelly Michael F. P. O’Boyle The University of Edinburgh, Imperial College London, The University of Edinburgh, Edinburgh, United Kingdom London, United Kingdom Edinburgh, United Kingdom Email: [email protected] Email: fl.nardi,[email protected] Email: [email protected] Abstract—In this paper we propose a novel approach to They provide an automatic mapping, yet these models are heterogeneous embedded systems programmability using a task- overly restrictive. As an example, many vision applications graph based framework called Diplomat. Diplomat is a task- include dynamic behavior, such as early exit conditions, graph framework that exploits the potential of static dataflow modeling and analysis to deliver performance estimation and which are not supported by those models. Tools such as CPU/GPU mapping. An application has to be specified once, and Qualcomm Symphony [3] (previously known as MARE) are then the framework can automatically propose good mappings. more expressive. Symphony lets a programmer represent an We evaluate Diplomat with a computer vision application on two application using a task-graph, a more general representation embedded platforms. Using the Diplomat generation we observed of the dataflow model where communications are no longer a 16% performance improvement on average and up to a 30% improvement over the best existing hand-coded implementation. static. However, Symphony tasks are dynamically defined and mapping decisions are performed manually. Ideally we would like a framework that is both flexible and able to provide high I. INTRODUCTION performance mappings automatically. Mobile and embedded systems on chip (SoC) integrate In this paper, we address the mapping problem of multi- multiple computing resources. They contain multiple CPU kernel applications. For this purpose we introduce the Diplo- cores, a GPU and in some cases also a DSP. Such complex mat framework, a task-graph model embedded in the Python architectures have the potential for good performances but are language and combined with off-line performance charac- hard to program. They have a mapping problem, i.e. how terization to compute relevant mappings. It mainly targets to associate computational tasks to hardware resources in streaming applications, which are common in computer vision. order to meet the application’s response time, throughput In Diplomat, an application is defined as a task-graph. In objectives and power budget. OpenCL [1] has emerged as contrast with static dataflow modeling, the tasks and com- a standard to program heterogeneous systems and allows munication channels are not necessarily known at compile- program portability, i.e. the ability of a single program to time. Diplomat automatically collects performance profiling be executed on different devices. The OpenCL programming information to evaluate tasks’ workload; then it uses a static interface allows a programmer to express a task under the form dataflow model to evaluate performance and explores different of a function (called kernel in OpenCL) and to execute it on mappings for a given platform. The library generates C++ compatible resources. However, developing applications using code, which can then be integrated into an embedded system’s such systems and achieving acceptable performance is a non- firmware. This code is derived automatically from the task- trivial task. Even if a programmer has perfect knowledge of graph, and uses a runtime library to implement the necessary the target architecture and a deep understanding of OpenCL, communication and buffering between tasks. We show that a performance portability remains a challenge. class of computer vision applications can be abstracted to a Meanwhile, the emergence of computationally intensive static dataflow representation via the Diplomat task-graph. We mobile vision applications makes the use of embedded het- observe up to 30% speed improvement over the best existing erogeneous platforms increasingly important. Mobile vision hand-written implementation. applications have performance constraints and need efficient The contributions of this paper are: mappings. They are excellent candidates to evaluate new • A task-graph model, suitable for dynamic streaming mapping techniques. applications, and which can be easily abstracted into a Prior work on mapping exists. While most work deals with Synchronous Dataflow (SDF) model. single-kernel applications, recent research focuses on multi- • A framework which implements this abstraction and kernel applications. These represent more realistic real-world performs workload profiling together with static analysis applications. Static dataflow frameworks such as StreamIt [2] to find a mapping that achieves best performance on provide languages to statically express multi-kernel applica- heterogeneous platforms. tions and communication channels with specific constraints • We show that our approach adapts automatically to differ- on the amount of transferred data through the channels. ent architectures for truly performance-portable software. II. MOTIVATION 1 2 To illustrate choices that have to be made while designing an embedded system application on modern devices, we will Tasks' implementation in DIplomat DSL C++, OpenMP, OpenCL, ... Task graph representation consider the SLAMBench benchmarking framework [4] and in particular the KFusion benchmark. KFusion performs real- time localization and scene reconstruction for a camera mov- ing through an unknown environment, i.e. it estimates the pose Time profiling of a depth camera while building a highly detailed 3D model 3 of the environment. This application is composed of more than 10 different communicating tasks (refer to Figure 1). To im- Static dataflow plement this application and to reach acceptable performance, 4 abstraction a designer must optimize its algorithm while considering a good mapping at the same time. Thus, in Figure 2 we present Dataflow analysis, several possible mapping solutions. 5 e.g. mapping Running the KFusion benchmark on just one 2GHz ARM A15 core of an embedded SoC (Samsung Exynos 5422, see Code generation Table II), we observe low performance on a single CPU core, less than 2.5 Frames Per Second (FPS). By exploiting data 6 parallelism (using the KFusion OpenMP implementation), the Generated source code speed-up is 2.5x. We can also take advantage of the task paral- C++, OpenMP, OpenCL, ... lelism through a task-graph runtime (namely Symphony [3]), and again, we improve performance by a further 20%. Thus, Fig. 3: Overview of the Diplomat framework. The user pro- with these optimizations and by only using the multi-core vides (1) the task implementations in various languages and CPU, this application can reach 7.4 FPS. Alternatively, we (2) the dependencies between the tasks. Then in (3) Diplomat can use the GPU implementation (with OpenCL), and yields performs timing analysis on the target platform and in (4) a 2x speed-up comparing with the best CPU version, which is abstracts the task-graph as a static dataflow model. Finally, a 14.9 FPS. dataflow model analysis step is performed in (5) and in (6) It is evidently appealing to use both CPU and GPU at the the Diplomat compiler performs the code generation. Input frames same time and to combine those speed-ups through a more complex mapping, but implementing such behavior is not Tracking depth2vertex easy. The communication between CPU and GPU becomes preprocessing vertex2normal Integrate Raycast mm2meter halfsample integrate raycast a potential bottleneck, and the resulting performance might be bilateralfilter track reduce worse. For single-kernel applications, various solutions have been proposed to handle the mapping problem. But these are Reconstructed not well suited to multi-kernel applications. As an example, RenderDepth RenderTrack RenderVolume 3D scene following a speed-up based mapping methodology [5] we obtain a 16% slow-down compared to the OpenCL version. A partitioning technique similar to the one presented in [6] Fig. 1: The KFusion algorithm is composed of 7 main tasks, also results in a 16% slow-down. a total of 12 different OpenCL kernels. In contrast, using our approach, we are able to generate a mapping that reaches 17.2 FPS, which is a 15% speed-up Sequential (CPU) 2.5 compared to the OpenCL version. This example shows that OpenMP (CPU) 6.3 mapping using dataflow analysis tools can outperform existing Task-graph (CPU) 7.4 techniques. OpenCL (GPU) 14.9 Smart Mapping (GPU/CPU) 12.8 III. DIPLOMAT FRAMEWORK Partitioning (GPU+CPU) 12.8 Diplomat (GPU+CPU) 17.2 A. Framework Overview 0 5 10 15 Frames Per Second (FPS) The Diplomat framework combines task-graph modeling and static dataflow analysis to perform the mapping of stream- Fig. 2: Performance comparison of different possible mappings ing applications. An overview of the framework is given in for KFusion (higher FPS is better). Figure 3. 1) Front-end: The Diplomat front-end allows the frame- Bilateral work to gather fundamental information about the application: Filter the different possible implementations of the tasks, their expected input and output data sizes, and the existing data dependencies between each of them. 0 2) Static analysis: At compile-time, the framework per- ScaledDepth Vertex Normal forms static analysis. In order to benefit from existing dataflow i-1 i i i analysis techniques, the initial task-graph needs to be turned into a dataflow model. As

Diplomat: Mapping of Multi-Kernel Applications Using a Static Dataﬂow Abstraction

Setting up Odroid C2 – Throughtput Testing

Improving the Beaglebone Board with Embedded Ubuntu, Enhanced GPMC Driver and Python for Communication and Graphical Prototypes

Suzanne's Microcluster Slides

Passmark - Android Device List

Security Monitor for Mobile Devices: Design and Applications

Cubieboard Cubieboard2 Cubietruck Beaglebone Black

Passmark Android Benchmark Charts - CPU Rating

Building a Datacenter with ARM Devices

Smart Walker IV

An ARM Cluster for Running CMSSW Jobs Lirim Osmani1 and Tomas Linden´ 2 1Department of Computer Science, University of Helsinki, 2Helsinki Institute of Physics

Machine Learning/AI Processing on Mobile and Iot Edge Devices

For the ODROID-N2  September 1, 2019