Design-Time Application Mapping and Platform Exploration for MP-Soc Customised Run-Time Management

Design-time application mapping and platform exploration for MP-SoC customised run-time management Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer, Fr. Catthoor and H. Corporaal Abstract: In an Multi-Processor system-on-Chip (MP-SoC) environment, a customized run-time management layer should be incorporated on top of the basic Operating System services to alleviate the run-time decision-making and to globally optimise costs (e.g. energy consumption) across all active applications, according to application constraints (e.g. performance, user requirements) and available platform resources. To that end, to avoid conservative worst-case assumptions, while also eliminating large run-time overheads on the state-of-the-art RTOS kernels, a Pareto-based approach is proposed combining a design-time application and platform exploration with a low-complexity run-time manager. The design-time exploration phase of this approach is the main contribution of this work. It is also substantiated with two real-life applications (image processing and video codec multimedia). These are simulated on MP-SoC platform simulator and used to illustrate the optimal trade-offs offered by the design-time exploration to the run-time manager. 1 Introduction of parallel computing on multiple processors with single- chip integration of SoCs. They provide high computational An Operating System (OS, also called run-time manage- performance at a low energy cost, where as typical embed- ment layer) is a middleware acting as a glue layer ded systems (e.g. handheld devices such as Personal Digital between both application and platform layers. Just like Assistants (PDAs) and smartphones) are limited by the ordinary glue, an ideal OS should be adapted to the proper- restricted amount of processing power and memory. As ties and requirements of the environment it has to be used the application complexity grows, the major challenge is for. In a Multi-Processor System-on-Chip (MP-SoC) still the right parallelisation (both data level and functional environment, this OS should efficiently combine different level, both coarse grain and fine grain) of these applications aspects already present in different disciplines: to and their mapping on the MP-SoC platform. implement dynamic sets of applications as in the work- † Third, the PEs in the platform communicate with each station environment, to manage different types of platform other independently and concurrently. Traditional shared resources as in the parallel and distributed environment medium communication architectures (e.g. buses) cannot and to handle non-functional aspects as in the embedded support the massive data traffic. A flexible interconnect environment: Network-on-Chip (NoC) [3, 4] must be adopted to provide reliable and scalable communication [5, 6]. † First, mobile systems are typically battery-powered and Growing SoC complexity makes communication subsystem have to support a wide range and dynamic set of multimedia design as important as computation subsystem design [7]. applications (e.g. video messaging, web browsing, video The communication infrastructure must efficiently accom- conferencing), three-dimensional games and many other modate the communication needs of the integrated compu- compute-intensive tasks [1]. These applications are becom- tation and storage elements. In application domains such as ing more heterogeneous, dynamic with multiple use cases multimedia processing, the bandwidth requirements are and data-intensive. Hence, MP-SoC platforms have to be already in the range of several hundred Mbps and are con- flexible and to fulfill Quality-of-Service (QoS) requirements tinuously growing [8]. In switched NoCs, switches set up of the user (e.g. reliability, performance, energy consump- communication paths that can change over time, and tion and video quality). Also the OS must be able to run run-time channel and bandwidth reservation must be sup- all active applications in an optimal way. ported by the OS. Designing such an NoC becomes a † Second, the OS has to support platforms (e.g. TI OMAP major task for future MP-SoCs, where the communication and ST Nomadik [1, 2] which consist of a large number of cost is becoming much larger than the computation cost. heterogeneous processing elements (PE), each with its own A large fraction of the timing delay is spent on the signal set of capabilities. These platforms combine the advantages propagation on the interconnect, and a significant amount of energy is also dissipated on the wires. Therefore an opti- # The Institution of Engineering and Technology 2007 mised NoC floorplan is of great importance for MP-SoC doi:10.1049/iet-cdt:20060031 performance and energy consumption. Paper first received 17th February and in revised form 20th November 2006 † Finally, for memory-intensive applications such as multi- Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer and Fr. media applications, the memory subsystem represents an Catthoor are with IMEC V.Z.W., Kapeldreef 75, Leuven 3001, Belgium important component in the overall energy cost. In the H. Corporaal is with Technical University Eindhoven, The Netherlands memory subsystem, ScratchPad Memories (SPM) are used Fr. Catthoor is also with Katholikke Universiteit Leuven, Belgium [9, 10], as they perform better than caches in terms of E-mail: [email protected] 120 IET Comput. Digit. Tech., 2007, 1, (2), pp. 120–128 Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on December 4, 2008 at 04:23 from IEEE Xplore. Restrictions apply. energy per access, performance, on-chip area and predict- adapts the platform parameters, loads the task binaries ability. However, unlike caches, SPMs require complex from the shared memory in the corresponding local mem- design-time application analysis to carefully decide which ories and issues the execution of the code versions accord- data to assign to the SPM and software allocation tech- ing to the newly selected Pareto points. When Application niques [11, 12]. Design-time allocation loads the SPM A starts, it is assigned to three PEs with a slow clock once at the start of the application execution, whereas (ck2). As soon as Application B starts, a Pareto point run-time allocation changes the SPM contents during switch is needed to map A on only two PEs. By speeding the execution to reflect the dynamic behaviour of the up the clock (ck1), the application deadline is still met. application. After A stops, B can be spread over three PEs in order to reduce the energy consumption. To alleviate the OS in its run-time decision-making, and to avoid conservative worst-case assumptions, a very inti- The design-time exploration phase of our approach is the mate link between the application layer and the platform main contribution of our study. The resulting Pareto set of layer is required. To this end, we have shown in previous optimal mappings, to be stored into the MP-SoC platform, work [13, 14] that it is better to add a customised ultra- and used as input for our run-time manager, is essential to lightweight run-time management layer on top of the alleviate the OS in its run-time decision-making, and to basic OS services. This layer was developed for scheduling avoid conservative worst-case assumptions. Two represen- concurrent tasks on embedded systems. It was intended to tative real-life applications, an image processing algorithm optimise the energy consumption while respecting the and a video codec multimedia one, simulated on our application deadlines only. Our approach is currently MP-SoC platform simulator, are also used to illustrate the extended in the MP-SoC context to map the applications optimal trade-offs offered by the design-time exploration on the platform, according to application constraints (e.g. to the run-time manager. performance, user requirements) and available platform resources. Our extended approach consists of both the following phases: 2 Related work In recent years, industrial MP-SoC components have been † First, a design-time mapping and platform exploration introduced by companies like Texas Instruments and ST per application leads to a multi-dimensional Pareto set of Microelectronics. Current OSs like the TI DSP/BIOS optimal mappings (Fig. 1). Each mapping is characterised kernel with the DSP/BIOS link, the Quadros RTXC by a code version together with an optimal combination RTOS and the Enea Systems OSE RTOS are clearly of application constraints, used platform resources and focused on low-level run-time management (i.e. multiplex- costs. The different code versions refer to different paralle- ing the hardware and providing uniform communication lisations of the application into concurrent tasks and to primitives). They only provide an abstraction layer on top different data transfers between SPMs and local memories. of the hardware, they expand and link together existing † Second, a low-complexity run-time manager, incorpor- technologies, but they are not designed for the emerging ated on top of the basic OS services, maintains the high MP-SoC environment. Support for SPMs, NoCs, dynamic quality of the exploration. Whenever the environment is power management, QoS-aware and application-specific changing (e.g. when a new application/use case starts, run-time management, is lacking. Hence, none of these or when the user requirements change), for each active existing RTOS represents the ideal glue layer for application, our run-time manager reacts as follows: MP-SoCs. The user is supposed to implement his own run-time manager on top of the RTOS kernel. 1. It selects in a predictable way a mapping from its Pareto State-of-the-art tools and design practice also are not in a set, according to the available platform resources, in order shape yet to meet the needs presented previously. Currently, to minimise the total energy consumption of the platform, two diverging strategies are developed to cope with the while respecting all application constraints. design complexity of application-specific and hetero- 2.

Design-Time Application Mapping and Platform Exploration for MP-Soc Customised Run-Time Management

FAN53525 3.0A, 2.4Mhz, Digitally Programmable Tinybuck® Regulator

An Emerging Architecture in Smart Phones

Sitara™ Am57x Processor with Dual ARM® Cortex®-A15 Cores

OMAP 3 Family of Multimedia Applications

A3MAP: Architecture-Aware Analytic Mapping for Networks-On-Chip Wooyoung Jang and David Z

Comparative Study of Various Systems on Chips Embedded in Mobile Devices

Nomadik Application Processor Andrea Gallo Giancarlo Asnaghi ST Is #1 World-Wide Leader in Digital TV and Consumer Audio

ARM Was Developed at Acron Computer Limited Of

Mediatek Inc

OMAP-L137 C6000 DSP+ARM Processor

Table of Contents

Parallel Applications Mapping Onto Heterogeneous Mpsocs Interconnected Using Network on Chip Dihia Belkacemi, Daoui Mehammed, Samia Bouzefrane