Shared-Variable Synchronization Approaches for Dynamic Dataflow
Total Page:16
File Type:pdf, Size:1020Kb
Shared-variable synchronization approaches for dynamic dataflow programs Apostolos Modas1, Simone Casale-Brunet2, Robert Stewart3, Endri Bezati2, Junaid Ahmad4∗, Marco Mattavelli1 1EPFL SCI STI MM, Ecole´ Polytechnique Fed´ erale´ de Lausanne, Switzerland 2SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland 3Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, United Kingdom 4J Nomics, Manchester, United Kingdom Abstract—This paper presents shared-variable synchronization architectural components with low-level, target specific code approaches for dataflow programming. The mechanisms do not synthesis. require any substantial model of computation (MoC) modifi- By relying on isolation and non-sharing, an actor can access cation, and is portable across both for hardware (HW) and software (SW) low-level code synthesis. With the shared-variable its own state without fear of data-races. The approach can formalization, the benefits of the dataflow MoC are maintained, introduce inefficiencies in generated code, e.g. allocations however the space and energy efficiency of an application can be causing high memory use, and high levels of actor idle time significantly improved. The approach targets Dynamic Process whilst data is explicitly copied. As an example, in the context Network (DPN) dataflow applications, thus making them also of video compression, the output data generally depends on suitable for less expressive models e.g. synchronous and cyclo- static dataflow that DPN subsumes. The approach is validated intermediate data structures that different actors are obliged to through the analysis and optimization of a High-Efficiency Video replicate since they cannot be shared. This can severely impact Coding (HEVC) decoder implemented in the RVC-CAL dataflow both the time and space performance and the energy efficiency language targeting a multi-core platform. Experimental results of an application. show how, starting from an initial design that does not use In this paper, we present a set of shared-variable synchro- the shared-variable formalism, frames per second throughput performance is increased by a factor of 21. nization approaches that do not require any substantial MoC modification and that are portable both for HW and SW low- I. INTRODUCTION level code synthesis. The main advantage of this formalization In recent years, there has been a renewed interest in the is the fact that the benefits of the dataflow MoC are main- field of dataflow programming. This has been driven by the tained, however, the space and time performance and energy limitation of the frequency increases of deep sub-micron efficiency of an application can be significantly improved. CMOS silicon technology, which has shifted the evolution of The method targets Dynamic Process Network (DPN) dataflow processing platforms to systems comprising heterogeneous ar- applications, thus making them also suitable for less expressive rays of parallel processors. The emergence of these manycore models e.g. synchronous and cyclo-static dataflow that DPN architectures poses new problems and challenges for compiling subsumes. applications efficiently to them. A major challenge is the The paper is structured as follows: Section II provides portability of applications across heterogeneous architectures, an overview of current dataflow MoCs and shared-variable in particular, the portability of the parallelism present in approaches. Section III presents a novel and generic shared each specific application [1]. Dataflow programming, in all variable paradigm that can be introduced in a dataflow MoC its different model of computations (MoCs), is well placed to without fear of data-races. Section IV describes how this overcome the challenges of exploiting efficient heterogeneous methodology has been implemented in the standardized RVC- architectures. Dataflow MoCs are widely used for the specifi- CAL dataflow language. Experimental results are provided in cation of data-driven algorithms in many application areas, e.g. Section V, where an HEVC video decoder implemented using video and audio processing, bioinformatics, financial trading, the RVC-CAL dataflow language has been optimized with the packet switching applications. In these domains, scalability use of shared variables. Finally, Section VI concludes the paper and composability of systems are increasingly important re- and discusses future work directions. quirements, and dataflow MoCs are an efficient way of imple- II. BACKGROUND WORK menting algorithms in standardized high-level languages [2], A. Dataflow model of computations [3], [4]. Dataflow MoCs are architecture agnostic, making them Dataflow programming models have a long and rich history highly valuable for the specification of performance portable dating back to the early 1970s [5], [6]. As depicted in Fig. 1a, a applications that can be deployed on a wide variety of com- dataflow program is defined as a (hierarchical) directed graph puting platforms. Dataflow programs are mapped to specific in which nodes (called actors) represent the computational ker- nels and directed edges (called buffers) represent the lossless, ∗This work has been done while author was with EPFL SCI STI MM order preserving, and point-to-point communication channels between actors. Buffers are used to communicate sequences need to be present for the actor to execute a step (i.e., for it to of atomic data packets (called tokens). In literature, several be enabled). For a given input sequence, the firing functions variants of dataflow Models of Computation (MoC) have been determine a sequence/state combination for which the actor is introduced [6], [7], [8]. One of their common properties is enabled according to the firing rule, the output tokens produced that individual actors encapsulate their own state which is in such step, and, if applicable, the next actor state. It must not shared among other actors of the same program. Instead, be observed that at each step only one action can be selected actors communicate with each other exclusively by sending and fired. In general, DPN actors may be non-deterministic, and receiving tokens by means of buffers connecting them. which means that the firing function may yield more than The absence of race conditions makes the behavior of dataflow one combination of outputs and next states. Furthermore, the programs more robust to different execution policies, whether execution can be totally dynamic, meaning that the number of those be truly parallel or some interleaving of the individual consumed/produced tokens may vary according to the input actors. sequence, which severely limits any compile time analysis of DPN actors. b1 b2 Producer Filter Consumer B. Dataflow programming languages (a) An example of a dataflow network with three actors (i.e. Producer, In the last decades, a plethora of different programming Filter and Consumer) and two buffers (i.e. b1 and b2). languages has been used to model and implement dataflow programs [9]. Imperative languages (e.g. C/C++, Java, Python) actor Producer() ==> i n t Y : i n t cnt := 0; have been extended with parallel constructs, or pure dataflow produce : action ==> Y : [ cnt ] languages (e.g. Ptolemy, Esterel) have been formalized. The guard cnt < 3 do RVC-CAL language [10] is the sole standardized dataflow cnt := cnt + 1; end programming language which fully captures the behavioral end features of DPN. As an example, the RVC-CAL program re- (b) Producer.cal: at each firing it produces a token on output port Y. ported in Fig. 1a is composed of three actors (i.e., Producer, Filter and Consumer) and two buffers (i.e., b1 and b2). actor Filter() int X ==> i n t Y : copy : action X : [ a ] ==> Y : [ a ] end The Producer actor (see Fig. 1b) has an output port Y i n v e r t : action X : [ a ] ==> Y:[−a ] end connected to b1, the actor internal variable cnt and the action schedule fsm state1 : state1(copy) −> state2 ; (i.e., firing function) labeled as produce. During each firing state2(invert) −> state1 ; end of the action, the value of cnt is modified and a token is end produced on b1. Furthermore, the execution of the action is (c) Filter.cal: at each firing it consumes a token from input port X and cnt Filter it produces a token on output port Y. guarded by the guard condition on . The actor (see Fig. 1c) has an input port X connected to b1, an output port actor Consumer() int X ==> : Y connected to b2, the two actions copy and invert, and an consume : action X : [ a ] ==> end end actor state machine (FSM) which drives the action selection. The selected action can be fired only if a token is available (d) Consumer.cal: at each firing it consumes a token from input port X. on b1 and there is at least one token place on b2. During each Fig. 1. RVC-CAL program example: dataflow network configuration and firing, a token is consumed from b1 and a token is produced actors source code. on b2. The Consumer actor (see Fig. 1d) is composed by the input port X connected to b2 and the action consume. The DPN is an expressive MoC in comparison with other action can be fired only if a token is available on b2. During dataflow models, e.g. by supporting dynamic branching with each firing, a token is consumed from b2. actor firings predicated on token values. This makes the DPN model sufficient for expressing dynamic, complex algorithms, C. Shared-variable approaches but comes at the cost of analysis and optimization opportuni- Conventional dataflow encapsulates isolated state inside ties. To execute, DPN actors perform a sequence of discrete actors. Their only means of communication is by sending computational steps (or firings). During each step, an actor and receiving data through dataflow edges. For example, the can consume a finite number of input tokens, produce a finite only way for the Filter and Consumer actors to know number of output tokens, and modify its own internal state the value of the cnt variable in Producer is via explicit variables if it has any.