Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction

Bruno Bodin Luigi Nardi & Paul H. J. Kelly Michael F. P. O’Boyle The University of Edinburgh, Imperial College London, The University of Edinburgh, Edinburgh, United Kingdom London, United Kingdom Edinburgh, United Kingdom Email: [email protected] Email: {l.nardi,p.kelly}@imperial.ac.uk Email: [email protected]

Abstract—In this paper we propose a novel approach to They provide an automatic mapping, yet these models are heterogeneous embedded systems programmability using a task- overly restrictive. As an example, many vision applications graph based framework called Diplomat. Diplomat is a task- include dynamic behavior, such as early exit conditions, graph framework that exploits the potential of static dataflow modeling and analysis to deliver performance estimation and which are not supported by those models. Tools such as CPU/GPU mapping. An application has to be specified once, and Qualcomm Symphony [3] (previously known as MARE) are then the framework can automatically propose good mappings. more expressive. Symphony lets a programmer represent an We evaluate Diplomat with a computer vision application on two application using a task-graph, a more general representation embedded platforms. Using the Diplomat generation we observed of the dataflow model where communications are no longer a 16% performance improvement on average and up to a 30% improvement over the best existing hand-coded implementation. static. However, Symphony tasks are dynamically defined and mapping decisions are performed manually. Ideally we would like a framework that is both flexible and able to provide high I.INTRODUCTION performance mappings automatically. Mobile and embedded systems on (SoC) integrate In this paper, we address the mapping problem of multi- multiple computing resources. They contain multiple CPU kernel applications. For this purpose we introduce the Diplo- cores, a GPU and in some cases also a DSP. Such complex mat framework, a task-graph model embedded in the Python architectures have the potential for good performances but are language and combined with off-line performance charac- hard to program. They have a mapping problem, i.e. how terization to compute relevant mappings. It mainly targets to associate computational tasks to hardware resources in streaming applications, which are common in computer vision. order to meet the application’s response time, throughput In Diplomat, an application is defined as a task-graph. In objectives and power budget. OpenCL [1] has emerged as contrast with static dataflow modeling, the tasks and com- a standard to program heterogeneous systems and allows munication channels are not necessarily known at compile- program portability, i.e. the ability of a single program to time. Diplomat automatically collects performance profiling be executed on different devices. The OpenCL programming information to evaluate tasks’ workload; then it uses a static interface allows a programmer to express a task under the form dataflow model to evaluate performance and explores different of a function (called kernel in OpenCL) and to execute it on mappings for a given platform. The library generates C++ compatible resources. However, developing applications using code, which can then be integrated into an embedded system’s such systems and achieving acceptable performance is a non- firmware. This code is derived automatically from the task- trivial task. Even if a programmer has perfect knowledge of graph, and uses a runtime library to implement the necessary the target architecture and a deep understanding of OpenCL, communication and buffering between tasks. We show that a performance portability remains a challenge. class of computer vision applications can be abstracted to a Meanwhile, the emergence of computationally intensive static dataflow representation via the Diplomat task-graph. We mobile vision applications makes the use of embedded het- observe up to 30% speed improvement over the best existing erogeneous platforms increasingly important. Mobile vision hand-written implementation. applications have performance constraints and need efficient The contributions of this paper are: mappings. They are excellent candidates to evaluate new • A task-graph model, suitable for dynamic streaming mapping techniques. applications, and which can be easily abstracted into a Prior work on mapping exists. While most work deals with Synchronous Dataflow (SDF) model. single-kernel applications, recent research focuses on multi- • A framework which implements this abstraction and kernel applications. These represent more realistic real-world performs workload profiling together with static analysis applications. Static dataflow frameworks such as StreamIt [2] to find a mapping that achieves best performance on provide languages to statically express multi-kernel applica- heterogeneous platforms. tions and communication channels with specific constraints • We show that our approach adapts automatically to differ- on the amount of transferred data through the channels. ent architectures for truly performance-portable software. II.MOTIVATION 1 2 To illustrate choices that have to be made while designing an embedded system application on modern devices, we will Tasks' implementation in DIplomat DSL C++, OpenMP, OpenCL, ... Task graph representation consider the SLAMBench benchmarking framework [4] and in particular the KFusion benchmark. KFusion performs real- time localization and scene reconstruction for a camera mov- ing through an unknown environment, i.e. it estimates the pose Time profiling of a depth camera while building a highly detailed 3D model 3 of the environment. This application is composed of more than 10 different communicating tasks (refer to Figure 1). To im- Static dataflow plement this application and to reach acceptable performance, 4 abstraction a designer must optimize its algorithm while considering a good mapping at the same time. Thus, in Figure 2 we present Dataflow analysis, several possible mapping solutions. 5 e.g. mapping Running the KFusion benchmark on just one 2GHz ARM

A15 core of an embedded SoC (Samsung 5422, see Code generation Table II), we observe low performance on a single CPU core, less than 2.5 Frames Per Second (FPS). By exploiting data 6 parallelism (using the KFusion OpenMP implementation), the Generated source code speed-up is 2.5x. We can also take advantage of the task paral- C++, OpenMP, OpenCL, ... lelism through a task-graph runtime (namely Symphony [3]), and again, we improve performance by a further 20%. Thus, Fig. 3: Overview of the Diplomat framework. The user pro- with these optimizations and by only using the multi-core vides (1) the task implementations in various languages and CPU, this application can reach 7.4 FPS. Alternatively, we (2) the dependencies between the tasks. Then in (3) Diplomat can use the GPU implementation (with OpenCL), and yields performs timing analysis on the target platform and in (4) a 2x speed-up comparing with the best CPU version, which is abstracts the task-graph as a static dataflow model. Finally, a 14.9 FPS. dataflow model analysis step is performed in (5) and in (6) It is evidently appealing to use both CPU and GPU at the the Diplomat compiler performs the code generation.

Input frames same time and to combine those speed-ups through a more complex mapping, but implementing such behavior is not Tracking depth2vertex easy. The communication between CPU and GPU becomes preprocessing vertex2normal Integrate Raycast mm2meter halfsample integrate raycast a potential bottleneck, and the resulting performance might be bilateralfilter track reduce worse. For single-kernel applications, various solutions have been proposed to handle the mapping problem. But these are Reconstructed not well suited to multi-kernel applications. As an example, RenderDepth RenderTrack RenderVolume 3D scene following a speed-up based mapping methodology [5] we obtain a 16% slow-down compared to the OpenCL version. A partitioning technique similar to the one presented in [6] Fig. 1: The KFusion algorithm is composed of 7 main tasks, also results in a 16% slow-down. a total of 12 different OpenCL kernels. In contrast, using our approach, we are able to generate a mapping that reaches 17.2 FPS, which is a 15% speed-up Sequential (CPU) 2.5 compared to the OpenCL version. This example shows that OpenMP (CPU) 6.3 mapping using dataflow analysis tools can outperform existing Task-graph (CPU) 7.4 techniques. OpenCL (GPU) 14.9 Smart Mapping (GPU/CPU) 12.8 III.DIPLOMAT FRAMEWORK Partitioning (GPU+CPU) 12.8 Diplomat (GPU+CPU) 17.2 A. Framework Overview 0 5 10 15 Frames Per Second (FPS) The Diplomat framework combines task-graph modeling and static dataflow analysis to perform the mapping of stream- Fig. 2: Performance comparison of different possible mappings ing applications. An overview of the framework is given in for KFusion (higher FPS is better). Figure 3. 1) Front-end: The Diplomat front-end allows the frame- Bilateral work to gather fundamental information about the application: Filter the different possible implementations of the tasks, their expected input and output data sizes, and the existing data dependencies between each of them. 0 2) Static analysis: At compile-time, the framework per- ScaledDepth Vertex Normal forms static analysis. In order to benefit from existing dataflow i-1 i i i analysis techniques, the initial task-graph needs to be turned into a dataflow model. As the dataflow graph will not be used to generate the code, a representation of the application does HalfSample Depth2Vertex Vertex2Normal not need to be precise. But it needs to be accurate enough to obtain good performance estimations. Diplomat performs the for i = 1 to N for i = 0 to N following steps. First, the initial task-graph is abstracted into a Fig. 4: The Diplomat model is a composition of kernels, static dataflow formalism. This includes a timing profiling step buffers, and loops. In this figure kernels are represented with to estimate task durations and communication delays. Then, by rounded boxes, buffers with straight boxes and loops with using existing static analysis techniques [7], [8], a throughput dotted boxes. This graph is related to a subgraph issued from evaluation and a mapping of the application are performed. the KFusion application visible on Figure 7. 3) Code generation: Once a potential mapping has been se- lected, a functional C++ code is automatically generated. This produced implementation takes advantage of task-parallelism 2) Buffer: The Buffer class represents memory locations and data-parallelism. It can use OpenMP, OpenCL and it as logical regions. To make the code generation possible, data may apply partitioning between CPU and GPU when it is type and data size of these memory locations are required. beneficial. After this short overview, we now discuss the Diplomat Buffer(name="floatDepth", framework in further detail. ptr= True, type="float", B. Diplomat front-end size="n") Diplomat is composed of three classes, namely Kernel, Buffer and Loop. These classes enable the definition of In this example, floatDepth is a float pointer. In an application as a task-graph, such as in Figure 4. Using the previous Kernel example, the variables inBuffer and the Diplomat framework implies two requirements: a) to outBuffer were referring to Buffer instances. identify the tasks that compose an application; b) to provide 3) Loop: In the Diplomat task-graph representation, a task different implementations of these tasks for CPU and GPU can only be executed once. Because this requirement is restric- hardware resources, i.e. C++, OpenMP and OpenCL. Once tive, we introduce the Loop operator, and two more concepts, these requirements are fulfilled, and the task-graph expressed, the buffer instances and the buffer initialization values. all the remaining work is automated. We now define the three A Loop expresses the replication of Kernel. Let us classes of Diplomat. consider a classical loop of N − 1 iterations of a function 1) Kernel: The Kernel class defines tasks. First, it speci- (e.g. halfSample): fies the name of the C++ function or of the equivalent OpenCL kernel that this task is referring to. In order to respect data for(int i=1;i

22: integrate(volume,floatDepth,P) integrate reduce 23: end if 24: raycast(V,N,volume,P) renderVolume renderTrack 25: renderVolume(out,volume) 26: renderTrack(out,tracking) 27: renderDepth(out,floatDepth) Fig. 7: Graphical representation of KFusion Diplomat internal 28: end for representation. Here buffers are represented by arrows. Algorithm 1: KFusion algorithm model: the bigger this number is, the slower an update. -l is the tracking precision: the smaller this value is, the In our work, we focus on the SLAMBench [4] implemen- longer the tracking step. tation of KFusion. SLAMBench is a benchmark that pro- vides portable KFusion implementations in C++ (sequential), These parameters imply that the configurations 2 and 3 are the OpenMP, CUDA and OpenCL, that have been evaluated on fastest, but also less accurate. In contrast, the configuration 0 desktop and embedded sytems [4]. This fulfills the Diplomat and 1 are slower, but more accurate. prerequisites of implementations. A simplified pseudo-code of B. Targeted heterogeneous platforms KFusion is given in Algorithm 1. For the sake of brevity, we removed arguments and behaviors that do not induce any data As the motivation of this work is to ease mapping of applica- dependencies. tion on heterogeneous platforms, we applied our methodology 2) Configuration selection: To fulfill static analysis require- on two embedded platforms, the Arndale and the HardKernel ments, four KFusion configurations have been selected (see ODROID-XU3 boards. A brief summary of these platforms is Table I). We denote them by the configurations 0, 1, 2 and 3. available on Table II. These configurations have been selected by a (external to our work) computer vision expert. Each parameter is extremely C. Implementing KFusion with Diplomat sensitive, and a wrong combination of them could result in With a basic knowledge of the KFusion algorithm, describ- incorrect behavior (i.e. a loss of the camera position or a poor ing KFusion with the Diplomat framework took less than scene 3D reconstruction). These parameters also have a strong a day. This includes the debugging effort. Furthermore, the impact on the application speed and workload. whole description is less than 200 lines of code. However, -c is the input image ratio: the bigger this number is, the we do not claim this positive result can be generalized to any smaller is the input data size. applications. The purpose of Diplomat is not to be directly -v is the data size of the 3D model representation. used by a programmer, but to provide efficient mapping -r is the update frequency of the 3D model: an update is solution for higher-level languages, such as DSLs for vision. computationally intensive, the smaller this number is, the Once an application is described using Diplomat, an internal slower the application. representation of the task-graph is available; Figure 7 shows -m is a distance that defines the update region in the 3D this graph for the KFusion application. 1-CPU Perf. (FPS) 1-GPU Perf. (FPS) Config. Arguments Accuracy (cm) Arndale ODROID Arndale ODROID 0 -c 1 -m 0.2 -l 1e-04 -v 128 -r 20 < 4.1cm 0.24 0.33 3.48 3.76 1 -c 2 -m 0.2 -l 1e-05 -v 128 -r 10 < 4.2cm 0.76 1.28 9.26 10.43 2 -c 4 -m 0.075 -l 1e-06 -v 256 -r 5 < 4.4cm 1.49 2.50 11.69 14.79 3 -c 4 -m 0.1 -l 1e-05 -v 128 -r 10 < 4.6cm 2.34 3.91 15.69 20.04 TABLE I: Selected configurations of KFusion for the evaluation of Diplomat. Each of these parameters (-c,-m...) affects the application behavior and its workload.

Diplomat (CPU) 18 Diplomat (CPU/GPU) Manual-OpenMP (CPU) Speedup-Mapping (CPU/GPU) 1.5 16 +2.2% +25.2% Partitioning (CPU/GPU) +4.5% +5.9% +0.7% +0.4% 14 Manual-OpenCL (GPU) 12 1.0 +30.5% 10 +40.8% 8 6 0.5 Speedup over sequential Speedup over sequential 4 2 0.0 0 ARN ARN ARN ARN ARN ARN ARN ARN 0 1 2 3 0 1 2 3 (a) CPU-Only on Arndale (b) CPU and GPU on Arndale

Diplomat (CPU) Diplomat (CPU/GPU) 16 5.0 Manual-OpenMP (CPU) Speedup-Mapping (CPU/GPU) 4.5 +7.5% 14 +25.0% Partitioning (CPU/GPU) +31.8% Manual-OpenCL (GPU) 4.0 12 3.5 +28.3% 10 +19.2% 3.0 +17.2% 8 2.5 +16.5% 2.0 6 +13.5% Speedup over sequential Speedup over sequential 1.5 4 1.0 0.5 2 0.0 0 ODR ODR ODR ODR ODR ODR ODR ODR 0 1 2 3 0 1 2 3 (c) CPU-Only on ODROID (d) CPU and GPU on ODROID Fig. 8: Evaluation of the best result obtained with Diplomat for CPU and GPU configurations, and compared with hand- written solutions (OpenMP, OpenCL) and automatic heuristics (Partitioning, Speed-up mapping). The associated number is the configuration number and the percent on top of Diplomat bars are the speed-up over the MANUAL implementation.

V. PERFORMANCE RESULTS • SPEED-UP-MAPPING is a mapping strategy [5] which selects the fastest resource to compute each kernel. We evaluated four configurations of KFusion on two differ- The performance of the mapping obtained with Diplomat are ent platforms. We compared the performance of Diplomat with shown in Figure 8. For each case, the mapping selected by two families of solutions. We considered the original OpenMP Diplomat is the best known, after having evaluated the 20 and OpenCL implementations of SLAMBench [4] which are best estimated mappings. hand-written implementations, and we also considered two automatic mapping methods. A. CPU only

• MANUAL-OpenMP and MANUAL-OpenCL are respec- To compare the performance of Diplomat on CPU-only tively the OpenMP and OpenCL hand-written version platforms, we used MANUAL-OpenMP, the OpenMP imple- from SLAMBench. mentation provided in SLAMBench. This implementation is • PARTITIONING is a partitioning [6] of every kernel not using vectorization. Diplomat does not apply code trans- between CPU and GPU in order to maximize the speed formation on the kernel code, thus it does not use vectorization and based on timing profiling. either. 1) Arndale: The Arndale has a two-core CPU and, as PARTITIONING and SPEED-UP-MAPPING can provide visible in Figure 8a, the benefit of MANUAL-OpenMP on this performance improvement. For the configurations 0, thanks to device is limited (+30% on average). With this reduced number an important workload, the PARTITIONING provides 25% of cores the mapping possibilities for Diplomat are small. But speed-up over the manual. For configurations 1,2 and 3, as Diplomat takes advantage of task-parallelism and provides a their workloads diminish, the PARTITIONING performance small performance improvement on OpenMP (no more than decreases. The SPEED-UP-MAPPING performs badly with 5%). the large workloads of configuration 0: this is due to the 2) ODROID: As shown in Figure 8c, MANUAL-OpenMP higher impact of bad mapping decisions. But when the is more efficient on the ODROID. The ODROID has a workload is smaller, this mapping strategy can schedule tasks heterogeneous CPU composed of 4 big and 4 LITTLE cores. more efficiently; we see performance improvement of 13% MANUAL-OpenMP yields a 4x speed-up for the configuration over the manual with configuration 3. 0. This configuration’s workload is large, this is the best Because Diplomat has sufficient knowledge of the ap- condition for OpenMP data-parallelism. The performance of plication, it can avoid bad mapping decisions. For the OpenMP diminishes when the workload is reduced, but for large workload case, it will privilege a strategy similar to the smallest workload from configuration 3 this is still a 3x PARTITIONING, while for the other configurations, it will speed-up. The benefit of Diplomat in this context remains adapt the mapping decision in consequence. We also no- important and exploits task-parallelism. With the increase in ticed that for two cases (configurations 0 and 3), Diplo- number of cores, Diplomat is able to exploit more parallelism mat just selected strategies similar to PARTITIONING and across configurations. On average, it provides a 20% speed- SPEED-UP-MAPPING because it did not find better solutions. up. While the MANUAL-OpenMP will sequentially execute preprocessing tracking and rendering, Diplomat executes them VI.ANALYSIS simultaneously. In this section, we analyse the evaluation process of Diplo- B. CPU and GPU mat and its prediction accuracy. We compared Diplomat with A. Timing profiling • MANUAL-OpenCL, the OpenCL implementation of For each platform, Diplomat profiled the task duration and SLAMBench, the communications overhead of each configuration within less • PARTITIONING, a partitioning solution [6] based on than one hour. timing profiling, • and SPEED-UP-MAPPING, a speed-up based mapping B. Throughput evaluation technique [5]. For each configuration and each platform, an upper bound 1) Arndale: In Figure 8b the performance of of the maximum possible throughput is computed. Those are MANUAL-OpenCL is good: we see a speed-up between 7x available in Table III. The upper bound is not realistic, as and 14x depending on the configuration. This implementation it considers instant memory copy and an unlimited number has been made in a way that it minimizes the data movement, of resources. But this analysis provides potentially useful most of the computation happens on the GPU. On the Arndale information about the application behavior. Typically, the platform, both PARTITIONING and SPEED-UP-MAPPING static dataflow scheduler identifies the critical circuit of the perform worse than MANUAL-OpenCL. The main reason for application for these platforms which is passing through the this performance drop is the communication between kernels. integrate, raycast, track and reduce kernels. In a If automatic mapping decisions are locally good for a kernel, scenario where a programmer is hand-tuning the application, these imply communications in the whole application. This in order to improve performance, one of the kernels in the is also amplified because the Arndale CPU is not powerful critical circuit must be improved. This circuit is visible in enough to provide substantial benefits. Figure 7. For this platform Diplomat achieved performance improve- ment mostly due to the code generation method we used. C. Mapping exploration Indeed the best mapping selected by Diplomat on the Arndale The mapping heuristic used in Diplomat provided more board still uses the GPU only. The main difference with than 20 potentially good mappings in less than an hour. the hand-written version concerns the way OpenCL jobs are distributed. They are concurrently launched by several POSIX Throughput estimation (FPS) Configuration threads automatically generated by Diplomat. Such a way of Arndale ODROID running OpenCL jobs gives a more efficient use of the Arndale 0 2.55 2.85 GPU. This modification could have been done manually, but 1 9.16 9.53 2 10.11 10.03 this clearly is something a programmer would not consider. 3 24.24 19.50 2) ODROID: With the ODROID, Figure 8d, the MANUAL-OpenCL performs well too. However this device TABLE III: This table shows the maximum throughput esti- has a better CPU, thus classical mapping techniques such as mated by Diplomat. meddsraigssesadi rbbytems similar most the probably is and systems streaming embedded actual of instead model prediction ordering mapping the execution. that the using means 9), performed This similar. be Figure is can in performance actual done them the considering (as of by performance that, by is ordered predictions, observa- those important execution regarding The case application. tion worst is dynamic and a there model predict static to that performance. a times see actual use we can and because prediction we is This between same, gap the if significant is Even a magnitude boards. of ODROID ac- order the 9. and predictions the the mapping Arndale Figure compared the the in represent on and 9(b) respectively performance Figure them predicted and 9(a) of the Figure to best performance the tual ran manually We case last The 2. the configuration on using performance II ( actual Table their of platforms with possible several compared on performance mappings mapping of Estimation 9: Fig. 4C1624 The Period (ms) Period (ms) 1000 1000 1100 1200 1300 1400 1500 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 0 0 edlsRT Deadalus

stesqeta case. sequential the is ) 83B095 1E925D 9A143F D484FF 653793 851444 7AA045 56929B minimum estimation C0868E minimum estimation 1093C8 40310A 0B4DC0

I.R VII. 34F25F 583B04 4BAE6A a ODROID-XU3 (a) b rdl board Arndale (b) 1]faeoktreshard-real-time targets framework [12] EEFB59 E403D5 150AAC 47A4DA D484FF 11ABB4 LTDWORK ELATED F3830E AC0307 Mappings (id) Mappings (id) E7D286 83B095 FEB454 A668D0 E403D5 1EDD2E 1E925D A60A6A AD53A5 C8D4B9 052BCC E69A54 A53DE2 851444 07E6F5 BE7749 40E211 A53EBC 9FAE22 59A457 BBD138 A223EA 8A128B 315553 680FF1 F4E4EA 6A1F32 AED40B 241E19 FCF3CC 07AFC3 6A1F32 4C1624 4C1624 ie oee,ti olrqie omnal pcf the specify manually same to the requires at tool DSP mapping. and this GPU However, CPU, time. both heterogeneous using target SoC to proposed embedded the been recently, has runtime policies More task-graph efficient. dynamic less their platforms, becoming to are heterogeneous also applies With aforementioned platforms. runtime extension. limitation embedded the same target [20], to the Yet, extension improved is StarPU policy recent scheduling mappings a good In predict or beforehand. information performance enough achievable cur- deliver estimate the not to do of they knowledge Then, static application. consider rent limited only The they with that platforms. scheduling is data homogeneous runtime frameworks these by on of connected limitation focused main tasks initially dynamically tasks of were schedule to set and dynamically possible frameworks a These it as dependencies. make application that an express systems runtime are (DAG) Diplomat. in Graph used sophisticated Acyclic one less the a the Directed is than For which representation a application, (CP). the mapping use Program of a representation they Constraint then a part, and solving first mapping by performed is provided kernels is the of tuning is Diplomat multi-application. application. with single are dealing a devices not serve currently where to designed systems longer embedded no partitioning modern task on is analysis supports important a MAPS applications that Concurrent nor note analysis. to multi-application back-end important is OpenCL the it trace- Meanwhile, than policy. an using accurate provide Furthermore, possible less not Diplomat. remains only by collected which are information tools, mapping Through- analysis language. and based (CPN) networks analysis described process are put for Applications C model. the KPN using the using limitation inrgrigteepesvt,adde o aepartitioning take not does limita- account. and same into expressivity, the the from regarding suffers of tion representation framework SDF This an application. from an C++) and OpenCL (using code considering. are we support applications to make vision enough support modern expressive to not not the required are do and models tools generation, these OpenCL analysis But static decisions. mapping the efficient less all is provide model They this as C like re- task-graph toolchains will the Similarly, considered of real- such expressive. model abstraction no the SDF is an produce The there main to contrary, Diplomat. used the in be On constraint programs schedule. can loop PPN application generated nested the time affine CSDF But static the application. Cyclo-Static to and the limited on Poly- is analyse relies the expressivity to and language, (CSDF) (PPN), input Dataflow Network specific Process a hedric provides It approach. The contrast, In The proposed. is framework two-step similar a [16] In The heterogeneous generate to proposed is framework a [14], In 1]epesapiain hog ttcdtflwmodel. dataflow static a through applications express [13] PetaBricks MAPS StarPU 1]faeokudrae hsexpressivity this undertakes framework [15] 2]faeoksae omnfeatures common shares framework [21] [17], OmpSs 1]and [18] StreamIt OpenStream 2 or [2] Symphony MAPS Sigma- does [19] [3] with Diplomat but it mainly deals with single-kernel applica- [9] G. Kahn, “The semantics of a simple language for parallel program- tions. In PetaBricks, a DSL is used to represent an algorithm ming,” Information processing, 1974. [10] E. A. Lee and D. G. Messerschmitt, “Synchronous dataflow,” Proceed- as the combination of several methods and transformation ings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987. rules. In a recent extension [22], it also considers OpenCL [11] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. code generation and partitioning. Expressing a multi-kernel Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFu- sion: Real-time dense surface mapping and tracking,” in ISMAR, 2011. application in this formalism is difficult: it requires to express [12] M. Bamakhrama, J. Zhai, H. Nikolov, and T. Stefanov, “A methodology too many rules and the search space becomes too big. for automated design of hard-real-time embedded streaming systems,” To improve the information provided by OpenCL, He- in Design, Automation Test in Europe Conference Exhibition (DATE), 2012, March 2012, pp. 941–946. lium [23] provides a high-level representation of the appli- [13] T. Goubier, R. Sirdey, S. Louise, and V. David, “ΣC: A programming cation. It avoids unnecessary data movements and integrates model and language for embedded manycores,” in Algorithms and multi-kernel transformations such like kernel merging, which Architectures for Parallel Processing. Springer, 2011, pp. 385–394. [14] L. Schor, A. Tretter, T. Scherer, and L. Thiele, “Exploiting the par- are not considered in Diplomat. This framework is dynamic allelism of heterogeneous systems using dataflow graphs on top of and it has the advantage to be transparent for the user. How- ,” in Embedded Systems for Real-time Multimedia (ESTIMedia), ever, Helium does not gather required information, i.e. timings, 2013 IEEE 11th Symposium on, Oct 2013, pp. 41–50. [15] J. Castrillon, R. Leupers, and G. Ascheid, “Maps: Mapping concurrent to make mapping decisions. dataflow applications to heterogeneous mpsocs,” Industrial Informatics, IEEE Transactions on, vol. 9, no. 1, pp. 527–545, Feb 2013. ONCLUSION VIII.C [16] E. Paone, F. Robino, G. Palermo, V. Zaccaria, I. Sander, and C. Silvano, In this work we apply dataflow static analysis to perform “Customization of OpenCL Applications for Efficient Task Mapping under Heterogeneous Platform Constraints,” in Design, Automation & static mapping of multi-kernel applications in heterogeneous Test in Europe (DATE), 2015. embedded platforms. Using our methodology, we observed [17] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: performance improvement over state-of-the-art for a streaming a unified platform for task scheduling on heterogeneous multicore architectures,” Concurrency and Computation: Practice and Experience, computer vision application. vol. 23, pp. 187–198, 2011. In future work we will consider more expressive dataflow [18] A. Duran, E. Ayguade,´ R. M. Badia, J. Labarta, L. Martinell, X. Mar- models. To cope with the poor expressivity of the SDF torell, and J. Planas, “Ompss: a proposal for programming heterogeneous multi-core architectures,” Parallel Processing Letters, vol. 21, no. 02, pp. model used in this work, several dynamic variables have been 173–193, 2011. fixed to perform static analysis of the application. To fix [19] A. Pop and A. Cohen, “A OpenStream: Expressiveness and Data-Flow these values automatically, Diplomat has to consider more Compilation of OpenMP Streaming Programs,” TACOS, vol. V, no. January, 2013. expressive models such as the PSDF [24], the SADF [25] [20] H. Zhou and C. Liu, “Task mapping in heterogeneous embedded or PiMM [26]. From the HPC community, DAGuE [27] is systems for fast completion time,” Proceedings of the 14th International a runtime system modeled using an intermediate language, Conference on Embedded Software - EMSOFT ’14, pp. 1–10, 2014. [21] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and namely JDF. This language is based on the compact DAG S. Amarasinghe, “Petabricks: A language and compiler for algorithmic representation [28], which is similar to the model proposed choice,” in Proceedings of the 30th ACM SIGPLAN Conference on by Diplomat. An interesting research direction is to consider Programming Language Design and Implementation, ser. PLDI ’09. New York, NY, USA: ACM, 2009, pp. 38–49. JDF as a Diplomat front-end and to directly tackle the static [22] P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe, analysis from this language. “Portable performance on heterogeneous architectures,” in Proceedings of the Eighteenth International Conference on Architectural Support REFERENCES for Programming Languages and Operating Systems, ser. ASPLOS ’13. [1] Khronos, “The OpenCL 1.2 specification,” 2012. New York, NY, USA: ACM, 2013, pp. 431–444. [2] W. Thies, M. Karczmarek, and S. Amarasinghe, “StreamIt: A language [23] T. Lutz, C. Fensch, and M. Cole, “Helium: a transparent inter-kernel for streaming applications,” Compiler Construction, pp. 179–196, 2002. optimizer for opencl,” in Proceedings of the 8th Workshop on General [3] Qualcomm, “Qualcomm MARE: Enabling Applications for Heteroge- Purpose Processing using GPUs, GPGPU@PPoPP 2015, San Fran- neous Mobile Devices,” Qualcomm Inc., Tech. Rep., 2014. cisco, CA, USA, February 7, 2015, 2015, pp. 70–80. [4] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, [24] B. Bhattacharya and S. Bhattacharyya, “Parameterized dataflow model- A. J. Davison, M. Lujan,´ M. F. P. O’Boyle, G. Riley, N. Topham, ing for dsp systems,” Signal Processing, IEEE Transactions on, vol. 49, and S. Furber, “Introducing SLAMBench, a performance and accuracy no. 10, pp. 2408–2421, Oct 2001. benchmarking methodology for SLAM,” in IEEE Intl. Conf. on Robotics [25] S. Stuijk, M. Geilen, B. D. Theelen, and T. Basten, “Scenario-aware and Automation (ICRA), May 2015, arXiv:1410.2167. dataflow: Modeling, analysis and implementation of dynamic appli- [5] V. T. Ravi, M. Becchi, W. Jiang, G. Agrawal, and S. Chakradhar, cations,” in 2011 International Conference on Embedded Computer “Scheduling concurrent applications on a cluster of cpu-gpu nodes,” in Systems: Architectures, Modeling, and Simulation, SAMOS XI, Samos, Proceedings of the 2012 12th IEEE/ACM International Symposium on Greece, July 18-21, 2011, 2011, pp. 404–411. Cluster, Cloud and Grid Computing (Ccgrid 2012), ser. CCGRID ’12. [26] K. Desnos, M. Pelcat, J.-F. Nezan, S. Bhattacharyya, and S. Aridhi, Washington, DC, USA: IEEE Computer Society, 2012, pp. 140–147. “Pimm: Parameterized and interfaced dataflow meta-model for mpsocs [6] K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer, “An automatic input- runtime reconfiguration,” in Embedded Computer Systems: Architec- sensitive approach for heterogeneous task partitioning,” in Proceedings tures, Modeling, and Simulation (SAMOS XIII), 2013 International of the 27th International ACM Conference on International Conference Conference on, July 2013, pp. 41–48. on Supercomputing, ser. ICS ’13. ACM, 2013, pp. 149–160. [27] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and [7] A. H. Ghamarian, M. Geilen, S. Stuijk, T. Basten, A. Moonen, M. J. J. Dongarra, “Dague: A generic distributed dag engine for high perfor- Bekooij, B. D. Theelen, and M. Mousavi, “Throughput Analysis of mance computing,” ICL, University of Tennessee, Tech. Rep., 2010. Synchronous Data Flow Graphs,” in International Conference on Appli- [28] M. Cosnard, E. Jeannot, and T. Yang, “Compact dag representation and cation of Concurrency to System Design (ACSD’06), 2006, pp. 25–36. its symbolic scheduling,” J. Parallel Distrib. Comput., vol. 64, no. 8, [8] A. Singh, M. Shafique, A. Kumar, and J. Henkel, “Mapping on pp. 921–935, Aug. 2004. multi/many-core systems: Survey of current and emerging trends,” in Design Automation Conference (DAC), May 2013, pp. 1–10.