Design-time application mapping and platform exploration for MP-SoC customised run-time management

Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer, Fr. Catthoor and H. Corporaal

Abstract: In an Multi-Processor system-on-Chip (MP-SoC) environment, a customized run-time management layer should be incorporated on top of the basic services to allevi- ate the run-time decision-making and to globally optimise costs (e.g. energy consumption) across all active applications, according to application constraints (e.g. performance, user requirements) and available platform resources. To that end, to avoid conservative worst-case assumptions, while also eliminating large run-time overheads on the state-of-the-art RTOS kernels, a Pareto-based approach is proposed combining a design-time application and platform exploration with a low-complexity run-time manager. The design-time exploration phase of this approach is the main contribution of this work. It is also substantiated with two real-life applications (image pro- cessing and video codec multimedia). These are simulated on MP-SoC platform simulator and used to illustrate the optimal trade-offs offered by the design-time exploration to the run-time manager.

1 Introduction of parallel computing on multiple processors with single- integration of SoCs. They provide high computational An Operating System (OS, also called run-time manage- performance at a low energy cost, where as typical embed- ment layer) is a middleware acting as a glue layer ded systems (e.g. handheld devices such as Personal Digital between both application and platform layers. Just like Assistants (PDAs) and ) are limited by the ordinary glue, an ideal OS should be adapted to the proper- restricted amount of processing power and memory. As ties and requirements of the environment it has to be used the application complexity grows, the major challenge is for. In a Multi-Processor System-on-Chip (MP-SoC) still the right parallelisation (both data level and functional environment, this OS should efficiently combine different level, both coarse grain and fine grain) of these applications aspects already present in different disciplines: to and their mapping on the MP-SoC platform. implement dynamic sets of applications as in the work- † Third, the PEs in the platform communicate with each station environment, to manage different types of platform other independently and concurrently. Traditional shared resources as in the parallel and distributed environment medium communication architectures (e.g. buses) cannot and to handle non-functional aspects as in the embedded support the massive data traffic. A flexible interconnect environment: Network-on-Chip (NoC) [3, 4] must be adopted to provide reliable and scalable communication [5, 6]. † First, mobile systems are typically battery-powered and Growing SoC complexity makes communication subsystem have to support a wide range and dynamic set of multimedia design as important as computation subsystem design [7]. applications (e.g. video messaging, web browsing, video The communication infrastructure must efficiently accom- conferencing), three-dimensional games and many other modate the communication needs of the integrated compu- compute-intensive tasks [1]. These applications are becom- tation and storage elements. In application domains such as ing more heterogeneous, dynamic with multiple use cases multimedia processing, the bandwidth requirements are and data-intensive. Hence, MP-SoC platforms have to be already in the range of several hundred Mbps and are con- flexible and to fulfill Quality-of-Service (QoS) requirements tinuously growing [8]. In switched NoCs, switches set up of the user (e.g. reliability, performance, energy consump- communication paths that can change over time, and tion and video quality). Also the OS must be able to run run-time channel and bandwidth reservation must be sup- all active applications in an optimal way. ported by the OS. Designing such an NoC becomes a † Second, the OS has to support platforms (e.g. TI OMAP major task for future MP-SoCs, where the communication and ST [1, 2] which consist of a large number of cost is becoming much larger than the computation cost. heterogeneous processing elements (PE), each with its own A large fraction of the timing delay is spent on the signal set of capabilities. These platforms combine the advantages propagation on the interconnect, and a significant amount of energy is also dissipated on the wires. Therefore an opti- # The Institution of Engineering and Technology 2007 mised NoC floorplan is of great importance for MP-SoC doi:10.1049/iet-cdt:20060031 performance and energy consumption. Paper first received 17th February and in revised form 20th November 2006 † Finally, for memory-intensive applications such as multi- Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer and Fr. media applications, the memory subsystem represents an Catthoor are with IMEC V.Z.W., Kapeldreef 75, Leuven 3001, Belgium important component in the overall energy cost. In the H. Corporaal is with Technical University Eindhoven, The Netherlands memory subsystem, ScratchPad Memories (SPM) are used Fr. Catthoor is also with Katholikke Universiteit Leuven, Belgium [9, 10], as they perform better than caches in terms of E-mail: [email protected] 120 IET Comput. Digit. Tech., 2007, 1, (2), pp. 120–128

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. energy per access, performance, on-chip area and predict- adapts the platform parameters, loads the task binaries ability. However, unlike caches, SPMs require complex from the shared memory in the corresponding local mem- design-time application analysis to carefully decide which ories and issues the execution of the code versions accord- data to assign to the SPM and software allocation tech- ing to the newly selected Pareto points. When Application niques [11, 12]. Design-time allocation loads the SPM A starts, it is assigned to three PEs with a slow clock once at the start of the application execution, whereas (ck2). As soon as Application B starts, a Pareto point run-time allocation changes the SPM contents during switch is needed to map A on only two PEs. By speeding the execution to reflect the dynamic behaviour of the up the clock (ck1), the application deadline is still met. application. After A stops, B can be spread over three PEs in order to reduce the energy consumption. To alleviate the OS in its run-time decision-making, and to avoid conservative worst-case assumptions, a very inti- The design-time exploration phase of our approach is the mate link between the application layer and the platform main contribution of our study. The resulting Pareto set of layer is required. To this end, we have shown in previous optimal mappings, to be stored into the MP-SoC platform, work [13, 14] that it is better to add a customised ultra- and used as input for our run-time manager, is essential to lightweight run-time management layer on top of the alleviate the OS in its run-time decision-making, and to basic OS services. This layer was developed for scheduling avoid conservative worst-case assumptions. Two represen- concurrent tasks on embedded systems. It was intended to tative real-life applications, an image processing algorithm optimise the energy consumption while respecting the and a video codec multimedia one, simulated on our application deadlines only. Our approach is currently MP-SoC platform simulator, are also used to illustrate the extended in the MP-SoC context to map the applications optimal trade-offs offered by the design-time exploration on the platform, according to application constraints (e.g. to the run-time manager. performance, user requirements) and available platform resources. Our extended approach consists of both the following phases: 2 Related work In recent years, industrial MP-SoC components have been † First, a design-time mapping and platform exploration introduced by companies like and ST per application leads to a multi-dimensional Pareto set of Microelectronics. Current OSs like the TI DSP/BIOS optimal mappings (Fig. 1). Each mapping is characterised kernel with the DSP/BIOS link, the Quadros RTXC by a code version together with an optimal combination RTOS and the Enea Systems OSE RTOS are clearly of application constraints, used platform resources and focused on low-level run-time management (i.e. multiplex- costs. The different code versions refer to different paralle- ing the hardware and providing uniform communication lisations of the application into concurrent tasks and to primitives). They only provide an abstraction layer on top different data transfers between SPMs and local memories. of the hardware, they expand and link together existing † Second, a low-complexity run-time manager, incorpor- technologies, but they are not designed for the emerging ated on top of the basic OS services, maintains the high MP-SoC environment. Support for SPMs, NoCs, dynamic quality of the exploration. Whenever the environment is power management, QoS-aware and application-specific changing (e.g. when a new application/use case starts, run-time management, is lacking. Hence, none of these or when the user requirements change), for each active existing RTOS represents the ideal glue layer for application, our run-time manager reacts as follows: MP-SoCs. The user is supposed to implement his own run-time manager on top of the RTOS kernel. 1. It selects in a predictable way a mapping from its Pareto State-of-the-art tools and design practice also are not in a set, according to the available platform resources, in order shape yet to meet the needs presented previously. Currently, to minimise the total energy consumption of the platform, two diverging strategies are developed to cope with the while respecting all application constraints. design complexity of application-specific and hetero- 2. It performs Pareto point switches (Fig. 2, restricted to geneous MP-SoC platforms: either the IP-driven approach, two dimensions), that is it assigns the platform resources, or the design-flow-driven approach [15]. Related to the IP-driven approach, to reuse cores on different MP-SoC platforms, a single standard interface

3 PEs 3 PEs Energy Energy 2 PEs Ck1 1 PE Ck1 2 PEs 1 PEs Pareto point switch Pareto point switch 4 PEs 4 PEs Ck2 3 PEs Ck2 3 PEs 2 PEs 2 PEs

0 246 0 246 Performance Performance Application A Application B Proc 0 Proc 1 Proc 2 Proc 3

A starts B starts A stops B stops Time

Fig. 1 Pareto set generated by our design-time exploration Fig. 2 Pareto point switch

IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007 121

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. should be agreed upon. Several innovative design method- Scheduling algorithms can be divided into design-time ologies have recently appeared: (also called static or off-line) and run-time (also called dynamic or on-line) algorithms. However at design time, † The approach developed at Philips, Eindhoven is based it is unknown which applications are simultaneously on a task-level interface [16], called TTL. This is used active and which resources are available on the platform. both for modelling concurrent applications and as a plat- So, a design-time scheduling cannot solve the problem form interface for integrating hardware and software cores efficiently. on MP-SoC platforms. As static task graph models are no Energy consumption is increasingly an issue not only for longer sufficient, new reconfiguration services have been battery-operated devices. Even if unlimited power is avail- introduced in [17]. The underlying inter-task communi- able, a large number of components tightly packed onto a cation protocol is presented in [18]. chip poses cooling and reliability problems. An important † However when the application imposes performance way to reduce the energy consumption is to shut down or constraints and cost optimisations, a design customisation slow down functional components which are idle or under- is required in addition of the pre-designed component inte- utilized, by combining DVS with DPM. A survey of system- gration. To solve this issue, the approach developed at level design techniques can be found in [14] and [24], TIMA, Grenoble, proposes a unified model [19] to represent respectively. The energy management problem for real-time all interfaces for SoC design. This model, called Service applications decomposed into concurrent tasks and Dependency Graph (SDG), describes interactions between implemented on MP-SoC platforms with DVS is already hardware and software as services, and it enables full addressed for several years. The most recent scheduling system simulation at different abstraction levels. The approaches, combining application mapping and DVS, can associated design methodology based on a two-layer be found in [13, 25–28]. hardware-dependent software (HdS), generated from the Other approaches combine application mapping with SDG, is described in [20]. some run-time communication management. Reference † Other academic (Metropolis, SysExplorer) and commer- [29] evaluates a run-time spatial and temporal mapping cial tools (CoWare ConvergenC, Summit Design Visual algorithm. This finds the right processor for a certain task Elite) also exist supporting modelling and analysis of SoCs. and the appropriate communication path, using a library of different implementations per application (for an ARM, In these IP-driven approaches, synthesis is still or a DSP, or an FPGA). It minimises the global energy con- component-centric and has no integral view on the entire sumption, while guaranteeing all application performances system on the MP-SoC platform. Each application is and platform resource constraints. The application is synthesised separately without considering the interacting mapped when it is started. However, when certain events data-dependent tasks and the shared global resources. happen, the mapping might be reconsidered and/or the Scheduling the accesses to the global resources is deter- communication links might be re-routed. Reference [30] mined by local decisions. The designer has to ensure manu- presents a unified single-objective algorithm, coupling ally that the system communicates properly via buses and spatial mapping of tasks, spatial routing of communication shared memories. and temporal mapping assigning time slots to these routes. Related to the design-flow-driven approach, several The real-time communication requirements are considered global optimisation issues are considered in the academic to guarantee that application performances are met. world: application parallelisation, task scheduling and Reference [31] introduces a novel energy-aware algorithm dynamic reconfiguration. Next we focus on task scheduling which statically schedules application-specific communi- and dynamic reconfiguration, for which our design-time cation transactions and computation tasks onto hetero- exploration offers trade-offs. geneous NoC architectures.

2.2 Dynamic reconfiguration 2.1 Task scheduling Multimedia applications are becoming more complex, and Task scheduling is needed to guarantee application multiple use cases and optimal mappings need to be sup- performances and platform resource constraints, and to opti- ported. Moving from one mapping to another is the result mise objectives such as response time, energy consumption from user interactions or changes in the platform resource and battery lifetime enhancement [21]. For distributed availability when new applications are activated. Adapting systems [22], it has the following characteristics. First, it the mapping of an active application is called dynamic assumes an homogeneous architecture. Second, the most reconfiguration,ortask migration. The key challenge is of important optimisation objective is performance. Third, course to maintain the real-time behaviour and the data the focus is mainly on computation rather than on communi- integrity of the overall set of active applications. cation. For MP-SoC platforms, task scheduling becomes Nevertheless, dynamic reconfiguration is a powerful mech- more complicated [23]. Its impact on the energy consump- anism to improve the MP-SoC platform utilisation, avoiding tion becomes more significant, whereas performance is a some idle computing resources, while others are constraint to be satisfied. In this context, task scheduling overloaded. mainly consists of the following actions: (a) spatial Dynamic reconfiguration has been an important topic in mapping, determining on which processor a task must be distributed systems [32–37], where the main focus is to executed, (b) temporal mapping, deciding the order in achieve transparent system maintenance and evolution. which those tasks are executed, (c) communication Although our context is MP-SoC platforms, the distributed routing, both spatial and temporal, (d) dynamic voltage/ nature of the applications and the multitude of processors frequency scaling (DVS/DFS), determining the processor share many characteristics with distributed systems. Some supply voltage and clock frequency if it is allowed and (e) common issues are task state representation and message dynamic power management (DPM), either shutting down consistency during reconfiguration [38]. or swapping platform components into some sleep mode, For MP-SoC platforms, following issues also receive whenever these enter an idle state. attention. Reference [39] supports incremental

122 IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. reconfiguration, at the level of tasks, ports and channels. phase is the main contribution of our study. It is detailed Reference [40] offers high reconfiguration freedom at the in later on. communication level. Reference [41] and [42] present Dependent on the application constraints, and on the different reconfiguration interfaces. Reference [43] and availability of the platform resources, any one of these [44] propose efficient run-time support for reconfiguration Pareto points, representing application mappings, will be of inter-task communication and for resource management, best selected by the run-time manager. Unlike [13, 14], that is, workload sharing and admission control, to improve each Pareto point is also annotated with a code version. the platform resource utilisation, while preserving all timing The different code versions refer to different parallelisations constraints. To perform task migration, tasks are often of the application into parallel tasks and to data transfers assumed to be suspendable at any time, which can be diffi- (also called block transfers [47]) between SPMs and local cult to achieve, and possibly for significant periods of time, memories. The characterisation and efficient merging of which is unacceptable in real-time behaviour. Reference all these code versions, called Pareto-based application [45] introduces a technique to migrate tasks without sus- specification, is presented in [48]. pending them. Hence, in total, our Pareto set is made up for any appli- Other important issues for both task scheduling and cation of optimal mappings characterised by a code dynamic reconfiguration in MP-SoC platforms are: (a) the version together with an optimal combination of application design-time application mapping and platform exploration, constraints, used platform resources and costs. The descrip- and (b) the run-time decision-making for the overall set of tion of data structures storing information related to this active applications, according to application constraints and Pareto set and the Pareto points is out of the scope of this platform resources. Design-time exploration allows to avoid study. conservative worst-case assumptions, while also eliminating The full exploration is done at design time, whereas the large run-time overheads on the state-of-the-art RTOS critical decisions are taken during the second phase of our kernels. Reference [46] proposes an exploration technique approach by a low-complexity run-time manager (Fig. 3). restricted to SoC platforms with only one processor core, This latter provides the following services: and exploring SoC configurations with different power and performance trade-offs, for any application to be mapped on † Whenever a new application is activated, our run-time the SoC platform. In contrast to [46], our study is intended manager parses its Pareto set provided by the design-time for MP-SoC platforms, and it proposes a combined appli- exploration and stores it in the shared memory of the cation mapping and platform exploration approach. MP-SoC platform, including all task binaries. † Whenever the environment is changing (e.g. when a new application/use case starts, or when the user requirements 3 Overview of our customised run-time change), for each active application, our run-time management manager reacts as follows. First, it selects in a predictable way a mapping from its Pareto set, according to the avail- To meet the needs presented earlier, that is, to alleviate the able platform resources, in order to minimise the total OS in its run-time decision-making, and to avoid conserva- energy consumption of the platform, while respecting all tive worst-case assumptions, we propose a customised application constraints. A heuristic solving this optimisation run-time management approach to map the applications on problem, and being fast enough for MP-SoC platforms, is the platform, consisting of two phases: (a) a design-time presented in [49]. Second, it performs Pareto point switches mapping and platform exploration per application; (b) a low- (Fig. 2, restricted to two dimensions), as explained in complexity run-time manager incorporated on top of the earlier. The Pareto point switch technique bears some basic OS services. This run-time manager globally optimises resemblance with dynamic reconfiguration. It can switch costs (e.g. energy consumption) across all active appli- other mappings, but, in contrast to dynamic reconfiguration, cations, according to their constraints (e.g. performance, it involves more complex run-time trade-offs offered by the user requirements) and available platform resources. It also multi-dimensional Pareto sets. performs low cost switches between possible mappings of a same application, as required by environment changes. 4 Demonstrator A similar conceptual approach [13, 14] was already developed for scheduling concurrent tasks on embedded An important contribution of this study is also the practical systems. However, this was intended to optimise the flow that is required for our design-time exploration, energy consumption while respecting the application deadlines only. Other differences are presented in the follow- ing too. Application A Application B In contrast to the conventional approaches that generate Refined Pareto set Refined Pareto set only one solution for each application, the first phase of application application code: Others code: Others our approach is a design-time application mapping and plat- Version 1 Memory Version 1 Memory form exploration. For each application, this exploration Version 2 usage Version 2 usage

Energy PE usage Energy PE usage exploration

Design-time ...... generates a set of optimal mappings in a multi-dimensional Performance Performance design space (Fig. 1), instead of a two-dimensional one in [13, 14]. Current dimensions are application constraints (e.g. performance, user requirements), used platform Customized resources (e.g. memory usage, number of used processors run-time manager Constraints Platform per processor type, communication bandwidth, clocks and information

processor supply voltage if it is allowed) and costs (e.g. run-time layer

Low-complexity RTOS kernel energy consumption). Only points being better than the other ones in at least one dimension are retained. They are called Pareto points. The resulting set of Pareto points Fig. 3 Our Mulit-processor System-on-chips run-time is called the Pareto set. This design-time exploration management

IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007 123

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. applied to two real-life applications and simulated on our MP-SoC platform simulator.

4.1 Cavity detector application

The first driver application (Fig. 4) that this study uses is a cavity detector algorithm. This is an image-processing application used mainly in the medical field for detecting tumour cavities in computer tomography pictures [50]. The starting algorithm, expressed in C code, has one image frame of M Ã N pixels as input, and one as output. The main bottlenecks of this algorithm are: (a) important data transfer and storage because of large data arrays (for an image of 256 K pixels, as the one shown in Fig. 4 and used for our experiments, the internal data size is about 2 MB); (b) bad performance because of the presence of a Fig. 5 Our platform simulator complex loop (M Ã N iterations and a large body with many data dependencies). controllers, providing high-level services to processors and shared memories for efficient data transfers; (d) input/ 4.2 QSDPCM application output (I/O) nodes; (e) a communication architecture, being the AEthereal NoC. As second driver application, an inter-frame compression The main platform parameters that can be explored at technique for video images, called Quadtree Structured present are: the network clock, the maximum number of Difference Pulse Code Modulation (QSDPCM) is used time slots, the number of routers, the processor clock and [51]. It is representative for many of today’s video codec supply voltage if it is allowed, the memory clock, the multimedia applications. It involves a three-stage hierarch- read/write communication bandwidth between a processor ical motion estimation (ME4, ME2 and ME1), followed by and a shared memory, the number of processors to be a quadtree-based encoding of the motion compensated used by the application, the local memory usage and some frame-to-frame difference signal, a quantization, and a QoS requirements (either guaranteed throughput, or best Huffmann-based compression (QC). Two image resolutions effort). Here, we use an in-house TI C62 DSP ISS to simu- are allowed: either QCIF, with image size 176 Ã 144 pixels, late processes, a network clock period of 4 ns, and guaran- or VGA, with image size of 640 Ã 480 pixels. In our exper- teed throughput as QoS requirement. iments, the QCIF resolution is used. The starting algorithm, expressed in C code too, has two image frames (the previous and current ones) as input, 5 Design-time application and platform and one bit stream as output (Fig. 6a). It presents similar exploration bottlenecks to the ones in the cavity detector. In our run-time management approach, presented earlier, a 4.3 MP-SoC platform simulator multi-dimensional Pareto set of mappings is generated by a design-time exploration for any application to be Our MP-SoC platform simulator (Fig. 5) assumes a platform composed of: (a) processor nodes with local memories and local buses; (b) distributed shared memory nodes; (c) com- munication assists similar to direct memory Access (DMA)

Fig. 4 Cavity detector application Fig. 6 QSDPCM application a Starting code a Tuned algorithm b Tuned code b Relevant parallelisations

124 IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. mapped on the MP-SoC platform. This exploration is done Experiments: Related to the functional-level parallelisa- in the following steps. tion, the QSDPCM can be naturally partitioned into either three tasks (ME42, ME1 and QC), or two tasks (ME42 5.1 Application code tuning and ME1 merged with QC). To further alleviate the compu- tation effort of ME1, the input frames can be divided into Code tuning is a preprocessing step needed to derive a clean row blocks to parallelise ME1 at the data level. Up to five application specification with efficient data management parallel ME1 tasks have been considered, beyond which and processing. Ad hoc code transformations, commonly no performance gain is reached any more because of very used by designers, consist in: large task synchronisation and block transfers (BT) over- head. This is illustrated later. These QSDPCM parallelisa- † Minimising the size of internal large arrays. This can tions are illustrated in Fig. 6b. be done with the help of the MEMORYCOMPACTION tool Parallelising the cavity detector is not relevant enough [52, 53] in ATOMIUM [54]. and only one processor is considered. † Optimising the performance, mainly the loop perform- ance, especially by reducing the number of clock cycles 5.3 Block transfer exploration per iteration, and by achieving software pipelining. Typical transformations to improve the performance are: To optimise both performance and energy consumption in avoid function calls by explicitly inlining relevant func- the memory subsystem, parts of data arrays stored in the tions, unroll loops with a small number of iterations, SPM are copied in the processor local memory from remove conditions, split loops with a large body into where they are accessed multiple times [47]. These copy smaller ones and simplify modulo operations mainly in operations (also called BT) are performed through function array index computations. This last transformation can be calls in the application code, first to issue a BT, and next to done with the help of the RACE tool [55] in ATOMIUM. synchronise its completion with processing. This allows to perform BTs in parallel with processing and hence to In our approach, to preserve these optimisations in later improve the application performance. This is illustrated in code refinements, and to derive several code versions with Fig. 7, where a BT into a copy cp_prev_frame is per- minimum code duplication, any optimised loop is encapsu- formed in parallel with a for loop processing. This allows lated in a function, called kernel. to reduce the waiting time for this BT completion and, in this small example, to reach a performance gain of 16 Cavity detector experiments: The internal arrays can be cycles per iteration. compacted, and for an image of 256 K pixels, this memory Hence, the application tuned code must still be further compaction yields a reduction of 99.6% in the internal data refined as follows: (a) select the data arrays to be stored size: three internal arrays can be compacted from in SPMs; (b) explore the size of copies, needed to access ar[M][N] into r[4][N], whereas three other ones can these arrays, and stored in the processor local memory; be reduced to a scalar. For the starting code, the assembly (c) explore the places in the code where to insert BT calls code generated by the TI C62 compiler indicates no feasible (to preserve previous performance optimisations, these BT software pipelining, and an average number of clock cycles calls may not be inserted inside kernels). This exploration per pixel of 1139. can be done with the help of the extended MHLA tool To improve the performance, four kernels are identified [47] in ATOMIUM. Several efficient solutions, yielding (Fig. 4b), where each kernel manipulates a complete pixel different local memory usages and performances, exist for line at a time. Now, in contrast to the starting code, the the copy sizes and the places in the code where to insert assembly code generated by the TI C62 compiler indicates these BT calls. that software pipelining is possible inside each kernel. The Hence, considering all combinations of BT solutions in average number of clock cycles per pixel is now 51, yield- all tasks of any parallelised application gives rise to ing a performance gain of 95.5%. a huge number of different application code versions. For this application, the code size is negligible: 116 K A Pareto-based specification, merging all of them, and bytes after compilation with the TI C62 compiler. allowing efficient loading of any task binary into the plat- form is required. This specification is characterised in [48].

QSDPCM experiments: The starting code requires 45 M Cavity detector experiments: The compacted internal cycles per QCIF frame. The resulting algorithm is illus- arrays are small enough to be stored in the processor trated in Fig. 6a, where each module is a loop manipulating local memory. However, the arrays Image_in and two pixel blocks at each iteration (the one from the current frame, and the other from the previous frame). The opti- mised code requires 6 M cycles per QCIF frame, yielding a performance gain of 86.6%. The code size is still negli- gible: 263 Kbytes.

5.2 Parallelisation exploration

Parallelising an application can be done both at functional and data levels. At the functional level, the algorithm is partitioned into smaller tasks, and synchronisation require- ments between them are identified to allow pipelined execution of these tasks. At the data level, for instance, in video applications, images can be divided into block of rows. Any task parallelised at the data level deals with its Fig. 7 Block transfers form scratchpad memories to processor own block of rows. local memory

IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007 125

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. Image_out, storing the input and output images, respect- ively, must still be stored in the SPM. To access them, two copies are needed to allow to read (resp. write) BTs: Image_in_cp (resp. Image_out_cp). Related to the copy sizes, two relevant possibilities emerge: (a) either the copy is able to store only one pixel line; then the BT must be completed during the current iter- ation, before the kernel functions start to manipulate it; (b) or the copy is able to store two pixel lines: one manipulated during the current iteration, and another used by the BT per- Fig. 8 Influence of code version and processor speed on formed in parallel with this iteration, and that must be ready performance for the next iteration only. Related to BT calls, the following consideration must be taken into account: (a) Image_in_cp being read in with processing, the slower processor, the smaller BT kernel1() implies that the corresponding read BT must waiting time. Hence, the best performing code is the one be synchronised before calling this kernel; (b) Image_ with the most BTs in parallel with processing. But this out_cp being written in kernel4() implies that the corre- has a price, which is the local memory usage with larger sponding write BT can be issued only after the execution of data copies to transfer data in advance for future processing. this kernel, and that it must be synchronised before the next † Requesting more read/write BW also allows to increase call of this kernel. Except these constraints, some freedom is the performance, but only to some threshold from which the left where to place the BT calls, and this is explored. maximum throughput of the application is achieved. Five code versions are retained, with different local memory usages (storing both initial internal data arrays Cavity detector experiments: In Fig. 8, the processor and copies) and BT calls (Table 1). These BTs are per- speed is explored and its influence on both BT waiting formed either sequentially or in parallel with processing, and processing times is reported. This exploration[Note 1] yielding different performances. is performed on three different application codes: Code 0, where both read and write BTs are performed sequentially QSDPCM experiments: Three arrays (storing the current with processing, which implies that the BT waiting time image frame, the previous one, and some internal data is independent from the processor speed; Code 1 and required in ME42()) are too large and must be stored in Code 2 (Table 1) with larger data copies, but with BTs the SPM. Several efficient BT solutions are explored. parallel with processing. As mentioned earlier, the proces- Table 2 reports for the task ME1 the resulting processor sing time is only determined by the processor clock local memory usage. Similar BT solutions are derived for period and the number of processing cycles of the appli- the tasks QC and ME1_QC, whereas only one efficient cation (i.e. 28.7 per pixel). Code 2 is the best performing BT solution is derived for ME42. code. However, compared with Code 1, it needs 28.6% more local memory to be 29.3% faster (for a processor 5.4 Platform exploration clock period of 8 ns). In Fig. 9, the requested read/write BW to the SPM is In our multi-dimensional design space, dimensions related explored[Note 2] for Code 1, and the shown Pareto to the used platform resources are memory usage, number points indicate the threshold from which requesting still of used processors per processor type, read/write communi- more BW does not make sense any more. cation BandWidth (BW) between the SPM and the pro- Fig. 10 illustrates the code distribution[Note 3] in the cessor local memory, clocks and processor supply voltage Pareto set. Dependent on the requested read BW, the if it is allowed. The main platform parameters that can required performance of the application, and the available still be explored at this step are: the processor clock local memory space, any of the six codes can be best period and the requested read/write communication BW. selected by the run-time manager. The number of used processors is fixed by the previous par- allelisation exploration. The memory usage is fixed by the QSDPCM experiments: In Fig. 11, the number of pro- previous block transfer exploration. cessors is explored[Note 4] for different read/write BW. We can observe that: The shown Pareto points indicate the threshold from which requesting still more BW or mare processors does † The processor speed influences the BT waiting time as not make sense any more. As already mentioned no per- follows: (a) the more BTs in parallel with processing, the formance gain is reached any more because of very large smaller the BT waiting time; (b) for any BT in parallel task synchronisation and BT overhead.

Table 1: Block transfer exploration for the cavity detector 6 Conclusion

Code version 1 2 3 4 5 In this study, we describe our design-time application and Loc. mem. usage (bytes) 12 800 17 920 15 360 15 360 15 360 platform exploration that enables a customised run-time management for MP-SoCs. This exploration generates for each application a multi-dimensional Pareto set of optimal Table 2: Block transfer exploration for MEI task in mappings. The optimal trade-offs offered by the design-time QSDPCM Note 1: A read/write bandwidth to the SPM of 40 MB/s is requested. Note 2: A processor clock period of 2 ns is considered. Code version 1 2 3 4 Note 3: For a processor clock period of 2 ns, and a requested write BW of Loc. mem. usage (bytes) 1540 1796 1972 2228 100 MB/s. Note 4: A processor clock period of 2 ns is considered.

126 IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. 8 References

1 Cumming, P.: ‘The TI OMAP platform approach to SoC’ (Kluwer Academic, 2003) 2 Wolf, W.: ‘The future of multiprocessor systems-on-chips’. Proc. Design Automation Conference, 2004, pp. 681–685 3 Dielissen, J., Radulescu, A., Goossens, K., and Rijpkema, E.: ‘Concepts and implementation of the Philips network-on-chip’. Proc. of the IP-based SOC Design, November 2003 4 Adriahantenaina, A., Charley, H., Greiner, A., Mortiez, L., and Zeferino, C.A.: ‘SPIN A scalable, packet switched, on-chip micro-network’. Proc. of the Conference on Design, Automation and Test in Europe, 2003, pp. 70–73 5 Benini, L., and Micheli, G.: ‘Networks on chips: a new SoC paradigm’, IEEE Comput., 2002, pp. 70–78 6 Ye, T., and Micheli, G.D.: ‘Physical planning for on-chip multiprocessor networks and switch fabrics’. Proc. Application-Specific Systems, Fig. 9 Bandwidth exploration Architectures, and Processors, 2003 7 Bertozzi, D., Jalabert, A., Srinivasan, M., Tamhankar, R., Stergiou, S., Benini, L., and Micheli, G.D.: ‘NoC synthesis flow for customized domain specific multiprocessor systems-on-chip’, IEEE Trans. Parallel Distributed Syst., 2005, 16, (2), pp. 113–129 8 Murali, S., and Micheli, G.D.: ‘Bandwidth-constrained mapping of cores onto NoC architectures’. Proc. Conference on Design, Automation and Test in Europe, Paris, France, February 2004 9 Tanenbaum, A.S.: ‘Distributed operating systems’ (Prentice-Hall, New Jersey, 1996) 10 Verma, M., Wehmeyer, L., and Marwedel, P.: ‘Dynamic overlay of scratchpad memory for energy minimization’. Proc. International Conference on Hardware/Software Codesign and System Synthesis, 2004, pp. 104–109 11 Mamagkakis, S., Atienza, D., Poucet, C., Catthoor, F., Soudris, D., and Mendias, J.: ‘Custom design of multi-level dynamic memory management subsystem for embedded systems’. Proc. IEEE Workshop on Signal Processing Systems, October 2004, pp. 170–175 12 Poletti, F., Marchal, P., Atienza, D., Benini, L., Catthoor, F., and Mendias, J.: ‘An integrated hardware/software approach for run-time scratchpad management’. Proc. Design Automation Conference, San Diego, CA, USA, June 2004, pp. 238–243 Fig. 10 Code distribution 13 Yang, P., and Catthoor, F.: ‘Dynamic mapping and ordering tasks of embedded real-time systems on multiprocessor platforms’. Proc. International Workshop on Software and Compilers for Embedded Systems, Springer, Lecture Notes in Computer Science, September 2004, vol. 3199, pp. 167–181 14 Ykman-Couvreur, C., Catthoor, F., Vounckx, J., Folens, A., and Louagie, F.: ‘Energy-aware dynamic task scheduling applied to a real-time multimedia application on an Xscale board’, J. Low Power Electron., 2005, 1, (3), pp. 226–237 15 Kogel, T., and Meyr, H.: ‘Heterogeneous MP-SoC – the solution to energy-efficient signal processing’. Proc. Design Automation Conference, 2004, pp. 686–691 16 van der Wolf, P., de Kock, E., Henriksson, T., Kruitzer, W., and Essink, G.: ‘Design and programming of embedded multiprocessors: an interface-centric approach’. Proc. International Conference on Hardware/Software Codesign and System Synthesis, Stockholm, Sweden, September 2004, pp. 206–217 17 Henriksson, T., Kang, J., and van der Wolf, P.: ‘Implementation of dynamic streaming applications on heterogeneous multi-processor applications’. Proc. International Conference on Hardware/Software Fig. 11 Processor exploration Codesign and System Synthesis, Jersey City, NJ, September 2005, pp. 57–62 18 Reyes, V., Bautista, T., Marrero, G., and Nunez, A.: ‘A multicast exploration to the run-time manager are illustrated on two inter-task communication protocol for embedded multiprocessor systems’. Proc. International Conference on Hardware/Software real-life applications (image processing and video codec Codesign and System Synthesis, Jersey City, NJ, September 2005, multimedia). After tuning the starting code, several code pp. 267–272 versions are derived trading different performances for 19 Sarmento, A., Kriaa, L., Grasset, A., Youssef, M.-W., Bouchhima, A., memory and processor usage, according to the available Rousseau, F., Cesario, W., and Jerraya, A.: ‘Service dependency platform resources. Our future work includes the run-time graph, an efficient model for hardware/software interfaces modeling and generation for SoC design’. Proc. International Conference on support to allow Pareto point switch and the analysis of Hardware/Software Codesign and System Synthesis, Jersey City, the resulting run-time overhead. NJ, September 2005, pp. 261–266 20 Yoo, S., Youssef, M., Bouchhima, A., and Jerraya, A.: ‘Multi-processor SoC design methodology using a concept of two-layer hardware-dependent software’. Proc. Conference on 7 Acknowledgment Design, Automation and Test in, Europe, Paris, February 2004 21 Ahmed, J., and Chakrabarti, C.: ‘A dynamic task scheduling algorithm for battery powered DVS systems’. Proc. International Symposium on This work has partly been funded by the 4S project (IST Circuits and Systems, May 2004, vol. 2, pp. 813–816 001908), financed by the European Commission in the 22 Brucker, P.: ‘Scheduling algorithms’ (Springer-Verlag, 2001, 3rd context of Framework Programme 6. edn.), ISBN:3-540-41510-6

IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007 127

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. 23 Cho, Y., Yoo, S., Choi, K., Zergainoh, N.-E., and Jerraya, A.: 40 Goossens, K.: ‘A protocol and memory manager for on-chip ‘Scheduler implementation in MP SOC design’. Proc. of the Asia communication’. Proc. International Symposium on Circuits and South Pacific Design Automation Conference, Shangai, China, Systems, Sydney, IEEE Circuits and Systems Society, May 2001, January 2005 vol. 2, pp. 225–228 24 Benini, L., Bogliolo, A., and Micheli, G.D.: ‘A survey of design 41 Rutten, M.J., Pol, E.J., Van Eijndhoven, J., Walters, K., and techniques for system-level dynamic power management’, IEEE Essink, G.: ‘Dynamic reconfiguration of streaming graphs on a Trans. VLSI Syst., June 2000, 8, (3), pp. 299–316 heterogeneous multiprocessor architecture’. Proc. IS&T/SPIE’s 25 Andrei, A., Schmitz, M., Eles, P., Peng, Z., and Al-Hashimi, B.: Annual Symposium on Electronic Imaging: Multimedia Processing ‘Overhead-conscious voltage selection for dynamic and leakage and Applications, San Jose, California, USA, January 2005, energy reduction of time-constrained systems’. Proc. Conference on pp. 101–106 Design, Automation and Test in Europe, Paris, France, February 42 Kang, J., Henriksson, T., and van der Wolf, P.: ‘An interface for the 2004, pp. 518–523 design and implementation of dynamic applications on 26 Leung, L.-F., Tsui, C.-Y., and Ki, W.-H.: ‘Minimizing energy multi-processor architectures’. Proc. Workshop on Embedded consumption of hard real-time systems with simultaneous tasks Systems for Real-Time Multimedia, New York, USA, September scheduling and voltage assignment using statistical data’. Proc. Asia 2005, pp. 101–106 South Pacific Design Automation Conference, 2004, pp. 663–665 43 Nollet, V., Marescaux, T., Avasare, P., Mignolet, J.-Y., and Verkest, 27 Schaumont, P., Lai, B.-C.C., Qin, W., and Verbauwhede, I.: D.: ‘Centralized run-time resource management in a network- ‘Cooperative multithreading on embedded multiprocessor on-chip containing reconfigurable hardware tiles’. Proc. Conference architectures enables energy-scalable design’. Proc. Design on Design, Automation and Test in Europe, Munich, Germany, Automation Conference, June 2005, pp. 27–30 March 2005, pp. 252–253 28 Yaldiz, S., Demir, A., Tasiran, S., Ienne, P., and Leblebici, Y.: 44 Pellizzoni, R., and Caccamo, M.: ‘Adaptive allocation of software and ‘Characterizing and exploiting task load variability and correlation hardware real-time tasks for FPGA-based embedded systems’. Proc. for energy management in multi core systems’. Proc. Workshop on IEEE Real-Time and Embedded Technology and Applications Embedded Systems for Real-Time Multimedia, New York, USA, Symposium, 2006 September 2005, pp. 135–140 45 Gericota, M., Alves, G., Silva, M., and Ferreira, J.: ‘On-line 29 Smit, L., Smit, G., Hurink, J., Boersma, H., Paulusma, D., and defragmentation for run-time partially reconfigurable FPGAs’. Proc. Wolkotte, P.: ‘Run-time mapping of applications to a heterogeneous International Conference on Field Programmable Logic and reconfigurable tiled system on chip architecture’. Proc. International Applications, Montpellier, France, September 2002 Symposium on System-on-Chip, Tampere, Finland, November 2005 46 Givargis, T., Vahid, F., and Henkel, J.: ‘System-level exploration for 30 Hansson, A., Goossens, K., and Radulescu, A.: ‘A unified approach to pareto-optimal configurations in parameterized system-on-a-chip’, constrained mapping and routing on Network-on-Chip architectures’. IEEE Trans. VLSI Syst., 2002, 10, (4), pp. 416–422 Proc. International Conference on Hardware/Software Codesign 47 Brockmeyer, E., Miranda, M., Corporaal, H., and Catthoor, F.: ‘Layer System Synthesis, Jersey City, NJ, September 2005, pp. 75–80 assignment techniques for low energy in multi-layered memory 31 Hu, J., and Marculescu, R.: ‘Communication and task scheduling of organisations’. Proc. Conference on Design, Automation and Test in application-specific networks-on-chip’, IEE Proc. – Comput. Europe, 2003, pp. 1070–1075 Digital Tech., September 2005, 152, (5), pp. 643–651 48 Ykman-Couvreur, C., Nollet, V., Marescaux, T., Brockmeyer, E., 32 Hofmeister, C.: ‘Dynamic reconfiguration of distributed applications’, Catthoor, F., and corporaal, H.: ‘Pareto-based application PhD thesis, Dept of Computer Science, University of Maryland, specification for MP-SoC customized run-time management’. Proc. College Park, USA, 1993 International Conference on Embedded Computer Systems: 33 Hofmeister, C., and Purtilo, J.: ‘A framework for dynamic Architectures, Modeling, and Simulation, Samos, Greece, July 2006, reconfiguration of distributed programs’. Proc. International pp. 78–84 Conference on Distributed Computing Systems, 1991, pp. 560–571 49 Ykman-Couvreur, C., Nollet, V., Catthoor, F., and Corporaal, H.: 34 Hofmeister, C., and Purtilo, J.: ‘Dynamic reconfiguration in ‘Fast multi-dimension multi-choice knapsack heuristic for distributed systems: adapting software modules for replacement’. MP-SoC run-time management’. Proceedings of the International Proc. International Conference on Configurable Distributed Systems, Symposium on System-on-Chip, Tampere, Finland, November May 1996, pp. 62–69 2006 35 Kramer, J., and Magee, J.: ‘The evolving philosophers problem: 50 Bister, M., Taeymans, Y., and Cornelis, J.: ‘Automatic segmentation dynamic change management’, IEEE Trans. Software Eng., 1990, of cardiac MR images’, Comput. Cardiol., 1989, pp. 215–218 16, (11), pp. 1293–1306 51 Strobach, P.: ‘QSDPCM a new technique in scene adaptive coding’. 36 Mitchell, S., Naguib, H., Coulouris, G., and Kindberg, T.: Proc. Eur. Signal processing Conference, Grenoble, France, ‘Dynamically reconfiguring multimedia components: a model-based September 1988, pp. 1141–1144 approach’. Proc. ACM SIGOPS European Workshop, Sintra, 52 Greef, E.D., Catthoor, F., and Man, H.D.: ‘Optimization for embedded Portugal, September 1998, pp. 40–47 parallel multimedia applications’, Parallel Comput., 1997, 23, 37 Webb, D., Wendelborn, A., and Varyssiere, J.: ‘A study of pp. 1811–1837 computational reconfiguration in a process network’. Proc. 53 Greef, E.D., Catthoor, F., and Man, H.D.: ‘Program Workshop on Integrated Data Environments Australia, Victor transformation strategies for memory size and power reduction Harbor, Australia, February 2000, pp. 51–55 of pseudo-regular multimedia subsystems mapped on 38 Steketee, C., Zhu, W., and Moseley, P.: ‘Implementation of process multi-processor architectures’, Trans. Circuits Syst. Video migration in amoeba’. Proc. International Conference on Distributed Technol., 1998, 8, (6), pp. 719–723 Computing Systems, 1994, pp. 194–201 54 Atomium. http://www.imec.be/atomium 39 Nieuwland, A., Kang, J., and Gangwal, O.Pr.: ‘C-HEAP: a 55 Miranda, M., Catthoor, F., Janssen, M., and Man, H.D.: ‘ADOPT: heterogeneous multi-processor architecture template and scalable Efficient hardware address generation in distributed memory and flexible protocol for the design of embedded signal processing architectures’. Proc. International Symposium on System Synthesis, systems’, Design Automat. Embedded Syst., 2002, 7, (3), pp. 233–270 1996, pp. 20–25

128 IET Comput. Digit. Tech., Vol. 1, No. 2, March 2007

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply.