Lecture Notes in Computer Science 6448 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany René van Leuken Gilles Sicard (Eds.)
Integrated Circuit and System Design
Power and Timing Modeling, Optimization and Simulation
20th International Workshop, PATMOS 2010 Grenoble, France, September 7-10, 2010 Revised Selected Papers
13 Volume Editors
René van Leuken Delft University of Technology 2628 CD Delft, The Netherlands E-mail: [email protected]
Gilles Sicard TIMA Laboratory 38031 Grenoble, France E-mail: [email protected]
Library of Congress Control Number: 2010940964
CR Subject Classification (1998): C.4, I.6, D.2, C.2, F.3, D.3
LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
ISSN 0302-9743 ISBN-10 3-642-17751-4 Springer Berlin Heidelberg New York ISBN-13 978-3-642-17751-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Welcome to the proceedings of the 20th International Workshop on Power and Timing Modeling, Optimization and Simulations, PATMOS 2010. Over the years, PATMOS has evolved into an important European event, where researchers from both industry and academia discuss and investigate the emerging chal- lenges in future and contemporary applications, design methodologies, and tools required for the development of the upcoming generations of integrated cir- cuits and systems. PATMOS 2010 was organized by the TIMA Laboratory, France, with the sponsorship of Joseph Fourier University, CEA LETI, Mina- logic, CNRS, Grenoble Institute of Technology and the technical co-sponsorship of the IEEE France Section. Further information about the workshop is available at: http://patmos2010.imag.fr. The technical program of PATMOS 2010 contained state-of-the-art technical contributions, three invited keynotes, a special session organized by the “Beyond DREAMS (Catrene 2A717)” project on “High-Level Modeling of Power-Aware Heterogeneous Designs in SystemC-AMS” and a special session organized by Minalogic presenting the results of four projects. The technical program focused on timing, performance, and power consump- tion, as well as architectural aspects with particular emphasis on modeling, de- sign, characterization, analysis, and optimization in the nanometer era. The Technical Program Committee, with the assistance of additional expert reviewers, selected the 24 papers presented at PATMOS. The papers were or- ganized into six oral sessions. As is customary for the PATMOS workshops, full papers were required for review, and a minimum of three reviewers were received per manuscript. Beyond the presentations of the papers, the PATMOS technical program was enriched by a series of talks offered by world-class experts, on important emerging research issues of industrial relevance. Kiyoo Itoh Fellow of Central Research Laboratory, Hitachi, Ltd., spoke about “Variability-Conscious Circuit Designs for Low-Voltage Memory-Rich Nano-Scale CMOS LSIs,” Marc Belleville of CEA, LETI, MINATEC, spoke about “3D Integration for Digital and Imagers Circuits: Opportunities and Challenges,” and S´ebastien Marchal of STMicroelectonics spoke about “Signing off Industrial Designs on Evolving Technologies.” We would like to thank our colleagues who voluntarily worked to make this edition of PATMOS possible: the expert reviewers; the members of the Technical Program and Steering Committees; the invited speakers; and last but not least, the local personnel who offered their skill, time, and extensive knowledge to make PATMOS 2010 a memorable event.
September 2010 Ren´evanLeuken Gilles Sicard
Organization
Organizing Committee
Ren´e van Leuken TU Delft, The Netherlands (Program Chair) Gilles Sicard TIMA Laboratory, France (General Chair) Anne-Laure Fourneret-Itie TIMA Laboratory, France Laurent Fesquet TIMA Laboratory, France Katell Morin–Allory TIMA Laboratory, France Florent Ouchet TIMA Laboratory, France Julie Correard TIMA Laboratory, France
Technical Program Committee
Atila Alvandpour Link¨oping University, Sweden David Atienza EPFL, Switzerland Nadine Azemard University of Montpellier, France Peter Beerel USC, USA Davide Bertozzi University of Ferrara, Italy Naehyuck Chang Seoul University, Korea Jorge Juan Chico University of Seville, Spain Joan Figueras University of Catalonia, Spain Eby Friedman University of Rochester, USA Costas Goutis University of Patras, Greece Eckhard Grass IHP, Germany Jos´es Lu´ıs G¨untzel University of Santa Catarina, Brazil Oscar Gustafsson Link¨oping University, Sweden Shiyan Hu Michigan Technical University, USA Nathalie Julien University of Bretagne-Sud, France Domenik Helms OFFIS Research Institute, Germany Ren´e van Leuken TU Delft, The Netherlands Philippe Maurine University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Vasily Moshnyaga University of Fukuoka, Japan Tudor Murgan Infineon, Germany Wolfgang Nebel University of Oldenburg, Germany Dimitris Nikolos University of Patras, Greece Antonio Nunez University of Las Palmas, Spain Vojin Oklobdzija University of Texas at Dallas, USA Vassilis Paliouras University of Patras, Greece Davide Pandini ST Microelectronics, Italy Antonis Papanikolaou NTUA, Greece VIII Organization
Christian Piguet CSEM, Switzerland Massimo Poncino Politecnico di Torino, Italy Ricardo Reis University of Porto Alegre, Brazil Donatella Sciuto Politecnico di Milano, Italy Gilles Sicard TIMA Laboratory, France Dimitrios Soudris NTUA, Athens, Greece Zuochang Ye Tsinghua University, Beijing, China Robin Wilson ST Microelectronics, France
Steering Committee
Antonio J. Acosta University of Seville, Spain Nadine Azemard University of Montpellier, France Joan Figueras University of Catalonia, Spain Reiner Hartenstein TU Kaiserslautern, Germany Jorge Juan-Chico University of Seville, Spain Enrico Macii Politecnico di Torino, Italy Philippe Maurine University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Wolfgang Nebel OFFIS, Germany Vassilis Paliouras University of Patras, Greece Christian Piguet CSEM, Switzerland Dimitrios Soudris NTUA, Athens, Greece Ren´e Van Leuken TU Delft, The Netherlands Diederik Verkest IMEC, Belgium Roberto Zafalon ST Microelectronics, Italy
Executive Steering Committee
Vassilis Paliouras University of Patras, Greece Nadine Azemard University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Table of Contents
Session 1: Design Flows
A Power-Aware Online Scheduling Algorithm for Streaming Applications in Embedded MPSoC ...... 1 Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, and Guillaume Blanc
An Automated Framework for Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software ...... 11 Christian Bachmann, Andreas Genser, Christian Steger, Reinhold Weiß, and Josef Haid
System Level Power Estimation of System-on-Chip Interconnects in Consideration of Transition Activity and Crosstalk ...... 21 Martin Gag, Tim Wegner, and Dirk Timmermann
Residue Arithmetic for Designing Low-Power Multiply-Add Units ...... 31 Ioannis Kouretas and Vassilis Paliouras
Session 2: Circuit Techniques 1
An On-chip Flip-Flop Characterization Circuit ...... 41 Abhishek Jain, Andrea Veggetti, Dennis Crippa, and Pierluigi Rolandi
A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion ...... 51 Lida Ramezani
Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient VLSI Circuits ...... 62 Massimo Alioto, Elio Consoli, and Gaetano Palumbo
A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework ...... 73 Dimitris Bekiaris, Antonis Papanikolaou, Christos Papameletis, Dimitrios Soudris, George Economakos, and Kiamal Pekmestzi X Table of Contents
Session 3: Low Power Circuits An Efficient Low Power Multiple-Value Look-Up Table Targeting Quaternary FPGAs ...... 84 Cristiano Lazzari, Jorge Fernandes, Paulo Flores, and Jos´eMonteiro On Line Power Optimization of Data Flow Multi-core Architecture Based on Vdd-Hopping for Local DVFS ...... 94 Pascal Vivet, Edith Beigne, Hugo Lebreton, and Nacer-Eddine Zergainoh Self-Timed SRAM for Energy Harvesting Systems ...... 105 Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev L1 Data Cache Power Reduction Using a Forwarding Predictor ...... 116 P.Carazo,R.Apolloni,F.Castro,D.Chaver,L.Pinuel,and F. Tirado
Session 4: Self-Timed Circuits Statistical Leakage Power Optimization of Asynchronous Circuits Considering Process Variations ...... 126 Mohsen Raji, Alireza Tajary, Behnam Ghavami, Hossein Pedram, and Hamid R. Zarandi Optimizing and Comparing CMOS Implementations of the C-Element in 65nm Technology: Self-Timed Ring Case ...... 137 Oussama Elissati, Eslam Yahya, S´ebastien Rieubon, and Laurent Fesquet Hermes-A – An Asynchronous NoC Router with Distributed Routing ... 150 Julian Pontes, Matheus Moreira, Fernando Moraes, and Ney Calazans Practical and Theoretical Considerations on Low-Power Probability- Codes for Networks-on-Chip ...... 160 Alberto Garcia-Ortiz and Leandro S. Indrusiak
Session 5: Process Variation Logic Architecture and VDD Selection for Reducing the Impact of Intra-die Random VT Variations on Timing ...... 170 Bahman Kheradmand-Boroujeni, Christian Piguet, and Yusuf Leblebici Impact of Process Variations on Pulsed Flip-Flops: Yield Improving Circuit-Level Techniques and Comparative Analysis ...... 180 Marco Lanuzza, Raffaele De Rose, Fabio Frustaci, Stefania Perri, and Pasquale Corsonello Table of Contents XI
Transistor-Level Gate Modeling for Nano CMOS Circuit Verification Considering Statistical Process Variations ...... 190 Qin Tang, Amir Zjajo, Michel Berkelaar, and Nick van der Meijs White-Box Current Source Modeling Including Parameter Variation and Its Application in Timing Simulation ...... 200 Christoph Knoth, Irina Eichwald, Petra Nordholz, and Ulf Schlichtmann
Session 6: Circuit Techniques 2 Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer ...... 211 Abdelkrim Kamel Oudjida, Ahmed Liacha, Mohamed Lamine Berrandjia, and Rachid Tiar An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis...... 218 Oliver Schrape, Frank Winkler, Steffen Zeidler, Markus Petri, Eckhard Grass, and Ulrich Jagdhold Clock Network Synthesis with Concurrent Gate Insertion ...... 228 Jingwei Lu, Wing-Kai Chow, and Chiu-Wing Sham Modeling Time Domain Magnetic Emissions of ICs ...... 238 Victor Lomn´e, Philippe Maurine, Lionel Torres, Thomas Ordas, Mathieu Lisart, and J´erome Toublanc
Special Session 1: High-Level Modeling of Power-Aware Heterogeneous Designs in SystemC-AMS (Abstracts) Power Profiling of Embedded Analog/Mixed-Signal Systems ...... 250 Jan Haase and Christoph Grimm Open-People: Open Power and Energy Optimization PLatform and Estimator ...... 251 Daniel Chillet Early Power Estimation in Heterogeneous Designs Using SoCLib and SystemC-AMS ...... 252 Fran¸cois Pˆecheux, Khouloud Zine El Abidine, and Alain Greiner
Special Session 2: Minalogic (Abstracts) ASTEC: Asynchronous Technology for Low Power and Secured Embedded Systems...... 253 Pr. Marc Renaudin XII Table of Contents
OPENTLM and SOCKET: Creating an Open EcoSystem for Virtual Prototyping of Complex SOCs ...... 254 Laurent Maillet-Contoz
Keynotes (Abstracts)
Variability-Conscious Circuit Designs for Low-Voltage Memory-Rich Nano-Scale CMOS LSIs ...... 255 Kiyoo Itoh
3D Integration for Digital and Imagers Circuits: Opportunities and Challenges ...... 256 Marc Belleville
Signing off Industrial Designs on Evolving Technologies ...... 257 S´ebastien Marchal
Author Index ...... 259
A Power-Aware Online Scheduling Algorithm for Streaming Applications in Embedded MPSoC
Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, and Guillaume Blanc
CEA, LIST, Embedded Computing Laboratory, 91191 Gif-sur-Yvette CEDEX, France [email protected]
Abstract. As application complexity grows, embedded systems move to multiprocessor architectures to cope with the computation needs. The is- sue for multiprocessor architectures is to optimize the processing resources usage and power consumption to reach a higher energy efficiency. These optimizations are handled by scheduling techniques. To tackle this issue we propose a global online scheduling algorithm for streaming applica- tions. It takes into account data dependencies between pipeline tasks to optimize processor usage and reduce power consumption through the use of DPM and DVFS modes. An implementation of the algorithm on a vir- tual platform, executing a WCDMA application, demonstrates up to 45% power consumption gain while guaranteeing regular data throughput.
Index Terms: scheduling, low-power, multiprocessor, streaming applications.
1 Introduction
As embedded applications become more complex, future embedded architectures will have to provide higher computing performances, while respecting strong sur- face and consumption constraints. Embedded devices will not only execute more computing intensive applications but also cross-domain ones, including telecom and video processing application . To cope with these demands an emerging trend in embedded system design lies in the conception of MultiProcessor Systems-on- Chips (MPSoC). These new architectures with a high density of processing elements have a strong energy dissipation. This dissipation must be taken into account to match an embedded-compliant power budget and to limit ageing phenomenon. To han- dle these thermal and energy issues, MPSoC designer integrate DVFS and DPM capabilities in their platform. To leverage MPSoCs processing capabilities, applications need to be highly parallelized. A simple way to increase application parallelism and data through- put is to pipeline sequential applications into streaming ones. This applies to the WCDMA application whose parallelism can be drastically increased. Then pipeline stages must be efficiently allocated to the processing resources while
R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 1–10, 2011. c Springer-Verlag Berlin Heidelberg 2011 2 T. Sassolas et al. taking into account data dependencies between them. As applications become more prone to execution time variation, online control solution are needed to dy- namically schedule tasks and increase processor load. This variations can stem from the differences in input data for data processing application; or from the application structure itself. For instance the WCDMA application differently processes a pilot frame from a user frame. Only a global scheduler with a complete view of the computation resource and task states can perform an optimal scheduling. The choice of global scheduling pushes forward the use of a central control solution. In addition, an online cen- tral control solution must react quickly to platform events. Therefore, online scheduling must remain simple and must find a balance between accuracy and execution speed. In this article, we propose an online power-aware scheduling al- gorithm that matches these conditions. This algorithm focuses on the scheduling of streaming applications. Our scheduling algorithm also tackles power consump- tion issues through an efficient use of Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Power Management (DPM) modes of the processing re- sources. This paper is organized as follows: section 2 will study existing solutions in the field of power-aware streaming application scheduling. Then, section 3 will describe the proposed power-aware scheduling algorithm. Section 4 will detail implementation issues focusing on the simulation framework and the targeted MPSoC platform. Results will be presented in section 5 where the impact of our scheduling algorithm in terms of Quality of Service (QoS) and power consump- tion gain will be evaluated. Finally section 6 will discuss this new streaming application scheduling algorithm capabilities and its future improvements.
2 Related Work
We focus our study on power-aware sheduling algorithms that rely on DVFS and DPM techniques [1]. First of all, we will briefly present the DPM and DVFS techniques and their impact on energy consumption. Then we will present a survey of previous works in the field of offline power-aware scheduling techniques for streaming processing. Finally we will expose online low-power scheduling techniques for dependant tasks. ThedissipatedpowerinaCMOSdesigncan be divided into two major sources: the dynamic power consumption and the static one. The dynamic consumption part is mainly due to transistor state switching and it can be drastically reduced by lowering the supply voltage. As the transistor delay is a function of the supply voltage, lowering the supply voltage imposes an adapted frequency reduction. This technique is called DVFS. The static consumption is due to various current leakages in the transistor. The DVFS technique has some impact on the static power consumption thanks to the supply voltage reduction. Nonetheless this is not sufficient to drastically reduce static power consumption. To cut down static power consumption the only viable solution consists in switching off unused parts of a circuit. This A Power-Aware Online Scheduling Algorithm for Streaming Applications 3 technique is called DPM. Contrary to the DVFS technique the resource is made unavailable. The main drawback of these two techniques lies in the timing and consump- tion mode switching penalties. If the timing penalties for the DVFS are rather constrained, it is not the same for the DPM where wake-up time can reach a hundred milliseconds (136ms for the PXA270 [2]). Therefore, for a processor im- plementing both techniques, the issue is to find when reducing the voltage and frequency couple is more energy efficient than running at full speed then switch of the processor. This matter is summarized in Fig 1. For a given technological process, the issue is thus to evaluate the duration of future inactivity periods of the resource. Having introduced the DVFS and DPM technique and the opti- mization problem they imply, we will now present offline low-power scheduling technique for streaming applications.
Fig. 1. DPM (left)and DVFS (right) technique timing issues
Given the fact that scheduling on a multiprocessor environment is an NP com- plete problem [3], adding power consumption optimization to the problem makes the issue of power-aware scheduling for multiprocessor harder to solve. Streaming application can be seen as a set of tasks linked by their data dependencies. Thus, scheduling dependent tasks allows to schedule streaming applications. Many of- fline solutions have been proposed to solve this optimality issue assuming task dependencies and their execution lengths were available. They mainly vary in the way they describe the problem, changing which parameters have to be taken into account, and the computing optimization method used to solve the problem like in [4]. To the authors’ knowledge no previous work has been done to find an of- fline low-power multiprocessor scheduling dedicated to streaming application. Nonetheless an interesting line of work has been developed with the same scope but for monoprocessor environment. In [5] the authors study the power optimiza- tion by using DVFS technique on a streaming application described as a directed acyclic graph with a constant output rate. Their solution allows to find the lower consumption scheduling given buffer size or finding the buffer size given a power budget. A similar approach is taken in [6] with DPM utilization. To meet more realistic application they describe the production rate as a random variable fol- lowing a given probability rule. Nonetheless, variations in the effective execution time limit the performance of offline solutions. To handle this dynamism, online low-power solution have been proposed for streaming applications. 4 T. Sassolas et al.
Many online solutions have been designed for the case of independent tasks [7,8] but they cannot apply for streaming applications. Online scheduling that han- dle task dependency issues are uncommon. Interesting solutions for dependant task scheduling have been proposed by [9,10]. Nonetheless, these solutions rely on a partitionning of resources. Partionning solution are necessarily sub-optimal as they only handle resources separately. A global scheduling can potentially reach a better resource usage. We remind for the reader’s knowledge a few online power management tech- niques used for mono processor architecture in the case of streaming applica- tion described with a Directed Acyclic Graph (DAG). In [11] the author take into account potential blocking communication between tasks to always run the data producer at full speed in that case but lower the energy consumption oth- erwise. [12] presents another example of inter task communication buffer size optimization, with this time an online scheduler handling slack time accumu- lated with buffer use. None of the strategies listed above take into account the online scheduling of streaming applications that allow a pipelined execution and potential output rate improvements in an MPSoC environment.
3 Power-Aware Streaming Application Scheduling
We believe that a more power-efficient scheduling for dynamic streaming appli- cations can be found by the use of an online global scheduling. In this section, we will first remind the application description used by our algorithm. Then we will explain the grounds of our algorithm, before presenting it in detail. Our scheduling algorithm has been written to handle streaming applications de- scribed in a specific way. An application is a set of tasks with consumer/producer relationships. Data is transferred from a producer task to a consumer task through a circular buffer. Only one task can write on a buffer while it can be read by mul- tiple consumer tasks. This creates a divergence in the data flow. A consumer task can also read multiple input buffers, creating a convergence in the data flow. This allows the description of parallelism in the processing flow of a given data. Given the previously described application model, one can make a few obser- vations. A streaming application throughput is constrained by the duration of its slowest stage. As a result other pipeline stages can be slowed down to meet the same output rate as the slowest stage. This can be performed by using a slower DVFS mode for the resources with a too high output rate. Besides, tasks that are further in the pipeline stream than the slowest task are to be blocked waiting for data. These tasks should be preempted if other tasks can execute instead, or the resource should be shut down if not. This implies the use of DPM functionalities. Given these observations, our algorithm will use DVFS to bal- ance the pipeline stage length and DPM to shut down unused resources. Our objective is to maintain the same data throughput as if the task were executing at full speed while making substantial energy saving. To be able to balance an application pipeline, we need additional information on the dynamic output rate of a task. Thus we introduce monitors on every A Power-Aware Online Scheduling Algorithm for Streaming Applications 5 communication buffer. For every buffer we specify how many dataset it can contain. We also specify two thresholds. When the higher threshold is reached we assume that the producer is executing to fast. When the lower threshold is reached we assume that the producer is not executing fast enough. A specific event is sent to the scheduler when a threshold is crossed. It contains the writing task identifier. An event is also sent when a task is blocked reading an empty buffer, as well as when a task is blocked writing a full buffer. The buffer monitors are summarized in Fig. 2. One objective of balancing pipeline stage length is to prevent buffers from getting full, which would block the producer. And to never reach an empty buffer, which would block the consumer and could result in an increase of the data processing length.
Fig. 2. Summary of buffer monitors and scheduling implications
To keep our scheduling algorithm as simple as possible the task priorities are made of a static and a dynamic part. We will list the different priority parts by level of importance. First we check the blocked task status, as we do not want to give the priority to a blocked task. Then the application priority is taken into account. After that, we study pipeline position priority. Every task is given a priority depending on its position in the streaming pipeline. This allows to give the priority to tasks handling older dataset, i.e the ones that are deeper in the pipeline. Finally for tasks that have the same pipeline position priority, we give the priority to the task with the emptier buffer. The complete scheduling loop is described in Algorithm 1.
4 Implementation
To study and validate our algorithm we implemented it on a virtual MPSoC. In this section we will first present the SESAM simulation framework. Then, we will describe the specificities of the simulated MPSoC. Finally we will shortly present the WCDMA application used for our performance analysis. SESAM [13] is a tool that has been specifically built up to ease the design of asymmetric multiprocessor architectures. This framework is described with the SystemC description language, and allows MPSoC exploration at the TLM level with fast and cycle accurate simulation. Besides, SESAM uses approximate- timed TLM with explicit time to provide a fast and accurate simulation of com- plex NoC communications [14]. It performs simulations with an accuracy of 90% 6 T. Sassolas et al.
Algorithm 1. The Power-Aware Streaming Application Scheduling Loop 1: procedure scheduling(task to schedule[nb tasks], status proc[nb proc]) ♦ First we take into account buffer events 2: for all tasks to schedule do 3: if task is waiting for data then 4: remove task from task to schedule 5: else if task output buffer reached Higher Threshold then 6: reset task’s buffer priority bit 7: else if task output buffer reached Lower Threshold then 8: set task’s buffer priority bit 9: end if 10: end for ♦ Then we order the tasks by priority 11: ordered tasks[nbproc] ← sort task by priority(task to schedule) ♦ We handle already in execution tasks to limit preemption/migration 12: for all task already in execution in ordered tasks do 13: remove task from ordered tasks 14: remove proc executing task from freeproc 15: end for ♦ We allocate tasks not in execution on any processor yet 16: for all task left in ordered tasks do 17: execute task on freeproc 18: end for ♦ Finally we handle the consumption 19: for all proc do 20: if proc is free then 21: proc mode ← idle mode 22: else if Task on proc reached lower threshold then 23: proc mode ← turbo mode 24: else if Task on proc reached higher threshold then 25: proc mode ← half mode 26: end if 27: end for 28: end procedure compared to fully cycle accurate models. In addition, the programming model of SESAM is specifically adapted to dynamic applications and global scheduling methods. It is based on the explicit separation of the control and the computa- tion parts. The processing elements of the SESAM simulator are functional Instruction Set Simulators (ISS) generated by the ArchC tool. Thus, we extended the ArchC ISS to integrate DVFS and DPM models to the SESAM environment. To avoid multiple context switches and accelerate simulation, every ArchC ISS executes multiple instructions at a time then waits for the time it should have spent executing them. For every DVFS mode, we calculate the smallest couple (a, b)so that a/b equals the DVFS mode slowing factor. Then, we multiply the number of instructions to be executed by a and the time to wait for these instructions by b. A Power-Aware Online Scheduling Algorithm for Streaming Applications 7
We also calculate the energy spent during the execution of a set of instruction and keep the total energy consumption for each ISS. A DVFS mode switch is modelled as an interruption for the ISS. When it occurs, the ISS computes the time and energy spent in its previous mode. Then, it waits for the adequate switching latency, takes into account its switching energy penalty and finally resumes its execution with the (a, b) couple of the new DVFS mode. So as to model realistic processors we used the PXA270 Power State Machine (PSM) values [2]. We chose to use only two DVFS modes, Turbo and Half-turbo,and one DPM mode, Deep Idle, as they have acceptable switching latencies compared to our task execution times. To perform a realistic analysis of our scheduling algorithm we modelled with the SESAM simulator an asymmetric MPSoC platform. This platform is build of a set of Processing Elements (PE) made of a processor equipped with a TLB, a 1KB instruction cache and a 1KB data one. They are connected to a set of shared 2ns-latency L2 memory through a 2ns-latency multibus. Communication between tasks are made possible thanks to HAL functions. Data coherency is guaranteed by a memory Management Unit (MMU). The buffers used for our al- gorithm are modelled using a specific HAL and the buffer thresholds are handled by the MMU. Preemption and migration of tasks are possible and their costs is reduced thanks to the shared memory and the virtualization of the memory space enabled by the use of TLBs [13]. The central controller is made of a processor with its own caches and memory. It is connected to the PEs and the MMU through another timed multibus. Its specific HAL enables to send configuration, execution, preemption or consump- tion mode switch orders. It can also be interrupted by any PE to be informed of a task execution end. The MMU also interrupts the controller whenever a task is blocked (or no longer blocked) waiting for input data or output space, as well as when a buffer threshold is crossed. We did not set the number of PE so as to study how our scheduling algorihtm can cope with different processor loads. To evaluate our algorithm impact on a streaming application, we used a well- known telecommunication application: a WCDMA encoder/decoder [15]. The application was pipelined and implemented on the simulated target MPSoC. The WCDMA application integrates an encoder followed by a decoder and is consequently built of 13 tasks. This allows having more tasks than resources on the SCMP platform to stress the potential scheduling anomalies. This application is characterized by an unbalanced pipeline whose slowest tasks are the FIR filters. In addition dynamism, is found in the task execution length as pilot frame get processed instead of actual data.
5Results
To study the impact of our scheduling algorithm we chose to compare it to two simpler versions of the algorithm. The first version does not handle power issues. It simply schedules tasks relying on pipeline stage position and blocked states. All processor are kept in Turbo mode. It is referred as the no energy handling 8 T. Sassolas et al.
(a) (b)
(c) (d)
Fig. 3. Figure (a),(b),(c) and (d) were obtained with the same WCDMA application sending 256 frames. The communication buffers were 8-frame long and had a higher threshold identical to the lower one and set to 2 frames. (a) Total execution time for the WCDMA application in function of the number of processing resources and the scheduling algorithm used; execution time overhead of our solution compared to the no energy handling algorithm. (b) Total processor effective occupancy and energy saving in function of the number of processing resources and the scheduling algorithm used; (c) Average time spent in Deep Idle mode compared to the time spent in unused state or waiting for data for a processor when using our proposed algorithm; (d) Comparison of the average time a processor spends waiting for data in the case of the no power saving algorithm and of our solution (DPM+DVFS): influence of the Half-Turbo mode usage on blocking states. scheduling. The second version is called DPM-only scheduling. This corresponds to a naive power-aware approach. Here unused resources and resources executing blocked tasks are put to Deep Idle mode. Finally our proposed algorithm will be referred as DPM + DVFS scheduling. As shown in figure 3(a) the total execution time of the WCDMA application is not affected by our scheduling algorithm no matter how many processing resources there are. The variation in execution time is always maintained below 1.2%. In addition our algorithm allowed a good acceleration of the processing for streaming applications. While we managed to maintain the execution time of the scheduling without energy awareness, Fig. 3(b) shows that substantial energy savings were made. A Power-Aware Online Scheduling Algorithm for Streaming Applications 9
As soon as processor effective occupancy drops it is directly compensated by our power saving method. With 13 processors we reduced the power consumption by 45%. In addition, our method obtains better results than the DPM-only scheduling which only reaches 37% energy saving in that case. Fig. 3(c) illustrates how our scheduling algorithm uses the DPM mode in a real application case. The figure shows that when processors spend little time waiting for data or in unused state (below 17%), the Deep Idle mode is seldom used. When the wasted time increases the DPM usage curve follows the unused or blocked processor curve as planned. In fact, when the number of process- ing elements is little, there is often another task ready to be executed immedi- ately. For low PE numbers the wasted time corresponds to the control overhead. The controller lacks reactivity to reach higher computing performance or power saving. Finally Fig. 3(d) studies the impact of DVFS modes usage on the application execution. We compare the execution of our algorithm to the no energy handling scheduling. The analysis shows that when DVFS mode are used they drastically reduce the amount of time spent in blocking states (42% reduction for 13 pro- cessors). Thus, our algorithm succeeds to balance the streaming pipeline stage execution length efficiently when the processor usage drops. As a result the pro- cessor load is increased with our algorithm compared to the no energy handling scheduling as shows Fig. 3(b).
6Conclusion
In this paper we presented a new power-aware scheduling algorithm for pipelined application in MPSoC environments. The algorithm was implemented on a virtual MPSoC platform simulated with the SESAM environment. Substantial en- ergy consumption gain was made compared to a classic data dependency scheduling that only takes into account blocking states. For a WCDMA application execut- ing on a platform with 13 PE our scheduling algorithm reduced the processing resources power consumption by 45%. In addition the use of DVFS and DPM did not impact the application execution speed. The variation in execution speed were maintained below 2%. Moreover, our algorithm succeeded to maintain a high pro- cessor load. As a result, our algorithm allows a good acceleration of the execution speed of streaming applications in MPSoCs while efficiently managing power con- sumption issues through the use of DVFS and DPM capabilities. In addition, as our algorithm is fully online and can handle the scheduling of more tasks than pro- cessor, we can manually shut down some processing resources to lower the power budget while guaranteeing a correct execution.
Acknowledgements
Part of the research leading to these results has received funding from the ARTEMIS Joint Undertaking under grant agreement no. 100029. 10 T. Sassolas et al.
References
1. Venkatachalam, V., Franz, M.: Power Reduction Techniques For Microprocessor Systems. ACM Computing Surveys (CSUR) 37(3), 195–237 (2005) 2. Intel PXA27x Processor Family, Electrical, Mechanical, and Thermal Specification (2005) 3. Dertouzos, M.L., Mok, A.K.: Multiprocessor Online Scheduling of Hard-Real-Time Tasks. IEEE Transactions on Software Engineering 15(12), 1497–1506 (1989) 4. Benini, L., Bertozzi, D., Guerri, A., Milano, M.: Allocation, Scheduling and Voltage Scaling on Energy Aware MPSoCs. In: Beck, J.C., Smith, B.M. (eds.) CPAIOR 2006. LNCS, vol. 3990, pp. 44–58. Springer, Heidelberg (2006) 5. Lu, Y.-H., Benini, L., De Micheli, G.: Dynamic Frequency Scaling with Buffer Insertion for Mixed Workloads. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 21(5), 1284–1305 (2002) 6. Pettis, N., Cai, L., Lu, Y.-H.: Statistically Optimal Dynamic Power Management for Streaming Data. IEEE Transactions on Computers 55(7), 800–814 (2006) 7. Kim, K.H., Buyya, R., Kim, J.: Power Aware Scheduling of Bag-of-Tasks Applica- tions with Deadline Constraints on DVS-enabled Clusters. In: IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 541–548 (2007) 8. Zhang, F., Chanson, S.T.: Power-Aware Processor Scheduling under Average De- lay Constraints. In: IEEE Real Time on Embedded Technology and Applications Symposium (RTAS), pp. 202–212 (2005) 9. Choudhury, P., Chakrabarti, P.P., Kumar, R.: Online Dynamic Voltage Scaling using Task Graph Mapping Analysis for Multiprocessors. In: International Confer- ence on VLSI Design (VLSID), pp. 89–94 (2007) 10. Hua, S., Qu, G., Bhattacharyya, S.S.: Energy-Efficient Embedded Software Imple- mentation on Multiprocessor System-on-Chip with Multiple Voltages. ACM Trans- actions on Embedded Computing Systems (TECS) 5(2), 321–341 (2006) 11. Zhang, F., Chanson, S.T.: Blocking-Aware Processor Voltage Scheduling for Real- Time Tasks. ACM TECS 3(2), 307–335 (2004) 12. Im, C., Kim, H., Ha, S.: Dynamic Voltage Scheduling Technique for Low-Power Multimedia Applications Using Buffers. In: ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 34–39 (2001) 13. Ventroux, N., Guerre, A., Sassolas, T., Moutaoukil, L., Bechara, C., David, R.: SESAM: an MPSoC Simulation Environment for Dynamic Application Processing. In: IEEE International Conference on Embedded Software and Systems, ICESS (2010) 14. Guerre, A., Ventroux, N., David, R., Merigot, A.: Approximate-Timed Transac- tional Level Modeling for MPSoC Exploration: A Network-on-Chip Case Study. In: IEEE Euromicro Symposium on Digital Systems Design (DSD), pp. 390–397 (2009) 15. Richardson, A.: WCDMA Design Handbook (2006) An Automated Framework for Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software
Christian Bachmann1, Andreas Genser1, Christian Steger1, Reinhold Weiß1,andJosefHaid2
1 Institute for Technical Informatics, Graz University of Technology, Austria 2 Infineon Technologies Austria AG, Design Center Graz, Austria
Abstract. In power-constrained mobile systems such as RF-powered smart-cards, power consumption peaks can lead to supply voltage drops threatening the reliability of these systems. In this paper we focus on the automated detection and reduction of power consumption peaks caused by embedded software. We propose a complete framework for automat- ically profiling embedded software applications by means of the power emulation technique and for identifying the power-critical software source code regions causing power peaks. Depending on the power management features available on the given device, an optimization strategy is cho- sen and automatically applied to the source code. In comparison to the manual optimization of power peaks, the automatic approach decreases the execution time overhead while only slightly increasing the required code size.
1 Introduction
The power consumption of embedded systems is increasingly dependent on soft- ware applications determining the utilization of system components and periph- erals. Furthermore, the embedded software actuates power management features such as voltage and frequency scaling as well as dedicated sleep or hibernation states. Hence, software applications impact the average as well as the peak power consumption that is in turn affecting the reliability, stability and security of em- bedded systems. Especially for RF-powered devices such as contactless smart- cards, power peaks threaten the system reliability by impacting the power supply circuit and leading to supply voltage drops [1]. These supply voltage drops can in turn result in system resets or, even worse, in erroneous system states. There- fore, power peak reduction and elimination methods for embedded software have been proposed [2–4]. Furthermore, power peak reduction techniques have been studied for the purpose of power profile flattening in hardware implementations [5–7]. For security applications, the profile flattening resembles a countermeasure against power analysis attacks. In this paper we propose an automated methodology for profiling a software application’s power consumption and deriving a power peak optimized implemen- tation. Based on an integrated supply voltage simulation, critical code regions are
R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 11–20, 2011. c Springer-Verlag Berlin Heidelberg 2011 12 C. Bachmann et al. detected and optimized. While existing software optimization methods employ either instruction-level power simulators [2–4] or physical on-chip power measure- ments [5–7] to obtain power profiles, our approach utilizes a high-level power emu- lation technique previously introduced in [8]. Using this technique, cycle-accurate run-time power estimates are derived from the system-under-test’s functional em- ulation. In comparison to measurement-based approaches, the joint functional and power emulation offers the advantage of inherent power profile to functional exe- cution trace correspondence, i.e., a power consumption value can be determined for each executed instruction. Furthermore, the emulation is cycle-accurate while still allowing for rapid profiling of long program sequences. This constitutes an ad- vantage over simulation-based approaches that are either lacking simulation detail and hence accuracy or simulation speed. In contrast to hardware power profile flattening approaches, no additional on- chip measurement and control hardware is required. Furthermore, opposed to power peak reduction methods modifying intermediate language representations of the given software application [2, 3], our approach operates on and modifies the original C or assembler source code. The resulting power peak optimized source code can afterwards still be manually modified by the software engineer if required. In the context of embedded software power peak optimization, the novel contributions of this paper are as follows:
– We present a framework for detecting source code regions causing power peaks by analyzing the power consumption as well as the functional debug information obtained during software execution. – We derive an optimization algorithm, actuating power management features for these power-critical source code regions and hence reducing the number of power peaks. – Finally, we illustrate the feasibility of our approach on a power-constrained deep-submicron smart-card controller system.
This paper is structured as follows. In Section 2 we discuss related work on power peak optimization and power profile flattening. Section 3 presents our automated framework for power-critical code region detection and optimization. We illustrate the effectiveness of our approach in Section 4. Finally, conclusions drawn from our current work are summarized in Section 5.
2 Related Work
Due to the large influence of software on both average as well as peak power consumption of embedded systems, numerous works have studied power- and energy-aware software optimization methods. With regard to power-constrained devices, the power profile flattening and the optimization of power consump- tion peaks, is of increased interest. These power peaks are often caused due to the occurrence of power-critical events during software execution. Especially in battery- and RF-powered devices these peaks can severely impact the power supply circuit and can lead to supply voltage drops [1]. These supply voltage An Automated Framework for Power-Critical Code Region Detection 13 drops seriously jeopardize the stability and hence the reliability of the given sys- tem. Power profile flattening hardware implementations have been studied in the context of security-related applications. In the security domain, the reduction of profile variability is of increased interest as a countermeasure against power analysis attacks [9]. For the purpose of reliability enhancements, the reduction of power peaks has been investigated in [3] by means of a simulation-based peak elimination framework using iterative compilation. Other attempts on power peak reduction have focused on instruction reordering to minimize the switching activity due to circuit state changes [2] as well as non-functional instruction (NFI) insertion [4]. Power profile flattening in security applications, aiming at hindering power analysis attacks by means of NFI insertion, was studied in [5]. Both software and hardware implementations were shown. In [6] a current-injection-based real-time flattening method has been proposed. This approach has been extended in [7] by a voltage scaling capability for improved flattening performance.
3 Automated Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software
Our automated power profiling and power-critical code region detection method- ology as depicted in Figure 1 builds upon a standard software development flow (A) and our run-time power profiling approach (B). The power estimates, along- side with the functional traces are being analyzed to detect power-critical code regions (C). After these regions have been detected, an optimization algorithm is used to reduce the power consumption and hence the power peaks during these critical code regions (D).
Standard Software Development Flow Power Peak Code Optimization
Source SW Development Debug Info Optimized Code Toolchain Power Peak Source Memory Map Optimization Code
Binaries A D
B C Functional Trace – Source Emulation Correlation Critical Power-Critical Code Code Region Region Detection Report Power Power Supply Voltage Model Emulation Simulation
Run-Time Power Profiling Detection of Power-Critical Code Regions
Fig. 1. Automated flow for power profiling, power-critical code region detection and optimization 14 C. Bachmann et al.
3.1 Run-Time Power Profiling Based on Power Emulation For the purpose of detecting power-critical code regions, power profiling of the given software application has to be performed in the first place. In contrast to existing software power peak optimization approaches, we employ the power emulation technique previously introduced in [8] to obtain power profiles for the software application’s execution. The principle of power emulation as depicted in Figure 2, is to augment the functionally emulated system-under-test with special power estimation hardware. This power estimation hardware monitors the state of the system and its subcomponents. Based on these state data, the power estimator derives cycle-accurate run-time power estimates according to an integrated high-level power model.
... CoProc 1 RAM Trace of Functional FU 1 FU n ROM MOV @R8, R12 Execution CoProc 2 NVM INC R8, #0x02 ...
Emulation ADD R8, R5 CPU CoProc Memories Functional
Component State Component State Component State Functional Verification Power Model Power Model ... Power Model Power Sensor Power Sensor Power Sensor Trace of Power Power Estimator Estimates Power Power Emulation Averaging Debug Trace Generator Time Power Verification FPGA Board Host PC
Fig. 2. Embedded software power profiling utilizing power emulation: Run-time power estimation and functional execution trace generation (adapted from [8])
As compared to low-level simulation-based power profiling, the power emu- lation technique largely reduces profiling time. This allows for the profiling of complex software applications and elaborate program sequences, such as the booting process of an operating system. In contrast to high-level simulators, power emulation offers the benefit of cycle-accuracy that instruction- or system- level-simulators fail to deliver. Furthermore, power emulation offers the advan- tage of inherent power profile to functional execution trace correspondence as compared to measurement-based approaches.
3.2 Power-Critical Code Region Detection Our power critical code region detection approach as depicted in Figure 1 con- sists of multiple stages. First, the functional execution trace obtained in the joint functional and power emulation step is used to establish the source code correlation, i.e., identifying the source code region corresponding to each exe- cution trace message. Second, using the power emulation trace as input data, a supply voltage simulation employing a numerical model of the RF-supply is performed1. Third, the resulting supply voltage profile is utilized to identify 1 Due to the limited computational complexity of the numerical RF-supply model, a simulation-based implementation is adequate. An Automated Framework for Power-Critical Code Region Detection 15 power peaks leading to critical voltage drops and finding the source code regions causing these drops. Figure 3 depicts the inductively coupled power supply of a contact-less smart- card device. The impact of power peaks on the supply voltage level, however, is dependent on the duration, power level and rate of these peaks as shown in Figure 4. We define power-critical source code regions as parts of an embedded software application resulting in power peaks that lead to supply voltage drops below a critical limit. These peaks can be caused by, e.g., phases of high processor activity, a number of consecutive memory read or write accesses and co-processor as well power-intensive peripheral activity. In order to identify power peaks that actually lead to critical supply voltage drops on the given system, a supply voltage simulation based on the emulated power profile is performed.
Power 1
0.9
0.8 Power [normalized] 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized]
Supply Voltage Reader Embedded System 1 Device C1 C2
0.9 V Smart Card Limit
0.8 Supply Voltage[normalized] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Magnetic field H Time [normalized]
Fig. 3. Inductively coupled power sup- Fig. 4. Impact of different power peaks ply of RF-powered smart-card embedded on the supply voltage (voltage drops) system (adapted from [10])
3.3 Optimization of Power-Critical Source Code Regions
The subsequent power-critical code region optimization algorithm as shown in Algorithm 1 aims at applying code modifications for power peak reduction to the original C or assembler source code. Depending on the power management features available on the given system, the frequency scaling and the NFI inser- tion techniques are applied to these power-critical regions. Listing 1.1 illustrates the insertion of frequency scaling control instructions around the call-site2 of a function causing power peaks, whereas Listing 1.2 shows the use of NFI insertion within a loop causing short power peaks. The algorithm operates in three major stages: (1) The power-critical code re- gions for each function are determined. If a large part of a function constitutes the power-critical code region, the algorithm chooses to optimize the entire function. In this case the call-sites of the function are searched and marked for modification
2 The source code line calling a particular function. 16 C. Bachmann et al.
start_f_scaling(); while(loop_condition) { power_critical_function(); short_loop_instruction; nop(); //NFI stop_f_scaling(); }
Listing 1.1. f-scaling example Listing 1.2. NFI insertion example instead of the function itself. (2) Consecutive source code lines marked for modi- fication are grouped into modification clusters. For each of those clusters, the al- gorithm chooses an optimization strategy based on the cluster’s number of power peaks and their respective duration: Short power peaks are likely to be resolved by NFI insertion, longer power peaks or longer groups of peaks can be reduced by applying frequency scaling. (3) Each of the found source code clusters is then modified in the chosen way and the modified code is written back to the source files.
Algorithm 1. Power-Critical Source Code Region Optimization Input: Set of application source code S, List of power-critical code regions L, Threshold of max. percentage of power-critical lines per function Thclpf , Threshold of f-scaling time penalty Thf−scale Output: Set of optimized application source code So Step 1, group by function: List of affected source code lines Lsl := {} foreach Function f in S do Find source code lines of f in L if Foundsourcecodelines> 0 then Calculate percentage of power-critical code region in function if Percentage >Thclpf then Find call-sites of function f, add source code lines of call-sites to Lsl else Add source code lines to Lsl
Step 2, cluster lines to modify & choose optimization strategy: Lslc := Cluster consecutive source code lines in Lsl foreach Source code cluster C in Lslc do if Duration C>Thf−scale then Mark cluster C for f-scaling else Mark cluster C for NFI insertion Step 3, perform modification: So :=S foreach Source code cluster C in Lslc do Modify So by inserting selected optimization instructions An Automated Framework for Power-Critical Code Region Detection 17
4 Experimental Results
For evaluating our framework, a smart-card microcontroller test-system supplied by our industrial partner was employed. For different benchmarking applications, power profiles were recorded using the power emulation technique. Afterwards, these benchmarks were optimized both in a manual as well as in an automated way utilizing the presented framework. This allows for evaluating the effective- ness of our method.
4.1 Test System for Power Peak Optimization The used smart-card microcontroller test system consists of a 16-bit pipelined cache architecture. It comprises volatile and non-volatile memories as well as a number of peripherals, e.g., cryptographic coprocessors, timers, and random number generators. The system has been augmented with a power emulation unit as depicted in Figure 5 to allow for the generation of run-time power estimates. For detecting power peaks leading to problematic supply voltage drops, we have implemented an RF power supply equivalent circuit model as proposed in [1] and depicted in Figure 6. Based on power consumption changes in the microcontroller test-system, the load current il(t) changes and affects the load voltage vl(t). In phases of high power consumption and thus high load currents when the required load current is higher than the supplied source current is(t), the energy storage capacitor delivers the missing fraction ic(t). However, for longer power peaks or a longer series of short power peaks, the capacitor fails to deliver the required current resulting in a critical supply voltage drop.
)8 )8Q 520 is(t) il(t)
&38 5$0 Ri
&DFKH 0(' 190 ic(t)
&RUH 0HPRULHV + V C v (t) Test Vs z l 8$57 &U\SWR&R3URF - System &U\SWR&R3URF ,& &R3URFHVVRUV
&/,QWHUIDFH 751* 351* 7LPHU ,QWHUIDFHV 3HULSKHUDOV
3RZHU(PXODWLRQ8QLW
Fig. 5. 16-bit smart-card microcontroller Fig. 6. Equivalent circuit of the RF test system augmented by power emula- power supply of the test system (adapted tion unit (adapted from [11]) from [1])
4.2 Comparison of Original and Optimized Power Consumption and Supply Voltage Profiles We illustrate the optimization result by comparing the power consumption and the respective supply voltage profiles of a given software application. Figure 7 18 C. Bachmann et al. resembles the results obtained during profiling of the original application. After the power-critical code region detection and optimization, the power profiling and supply voltage simulation was repeated yielding the profiles depicted in Figure 8.
Power Peaks 1 1
0.8 0.8
0.6 0.6
Power [normalized]Power Unoptimized Power
Power [normalized]Power Optimized Power 0.4 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized] Time [normalized]
Unoptimized Supply Voltage Optimized Supply Voltage 1 1
0.8 VLimit 0.8 VLimit
Voltage Drops Reduced Voltage Drops 0.6
Supply Voltage[normalized] 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Supply Voltage[normalized] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized] Time [normalized]
Fig. 7. Unoptimized power consump- Fig. 8. Optimized power consumption tion and resulting supply voltage pro- and resulting supply voltage pro- files of authentication benchmarking files of authentication benchmarking 3 application application3
The results illustrate how a number of power peaks result in supply voltage drops below the critical limit. By applying frequency scaling and NFI insertion to the code regions causing these peaks, their power consumption and hence their supply voltage impact can be diminished. Note that this modification, while improving system stability and reliability, comes at the cost of a slightly increased execution time. However, as illustrated in the subsequent section, the additionally required execution time is smaller for the automatically than for the manually optimized version because the frequency scaling and the NFI insertion are applied more selectively.
4.3 Impact of Power Peak Optimization on Execution Time and Code Size We have applied the power peak optimization algorithm to various benchmarking applications in order to evaluate its impact on the execution time and the code size. For comparison we have also manually optimized the given benchmarking applications by applying frequency scaling to the entire benchmark. For both the manual and the automatic approach, all power peaks resulting in critical supply voltage drops have been eliminated. Figure 9 illustrates these results for two general purpose microcontroller benchmarks (Coremark [12] and Dhrystone) as well as for two domain-specific ones (Authenthication and Crypto).
3 Data normalized due to existing NDA. An Automated Framework for Power-Critical Code Region Detection 19
Execution Time per Testcase Code Size per Testcase 120 110 Original Manual optimization 110 Automatic optimization 105
100
100
90 Code Size [%] 95 Execution Time [%] 80
70 90 o o e on rk on rk n ti a pt ti a ry rypt tica rem C tica rem C hrysto en Co Dhrystone en Co D h th ut u A A
Fig. 9. Execution time and code size of original, manually as well as automatically modified benchmarks4
The results show that in terms of execution time the automatic approach outperforms the manual optimization due to the finer granularity of code mod- ifications. For the manual optimization approach the execution time increases by ∼10% due to the minimally required frequency reduction of ∼10% for elim- inating all critical supply voltage drops. However, for the automatic approach this increase is in the range of only 1.2% (Crypto) up to 6.8% (Authentication) depending on the number and duration of power peaks. Note that the increase in execution time also depends on the ratio of code regions affected by power peaks that need to be optimized to regions requiring no optimization. Furthermore, we compare the increase in code size caused by the insertion of frequency scaling control instructions and NFIs. This increase is almost negli- gible for the manual approach (smaller than or ∼1% for all testcases). For the automatic approach, the increase is slightly higher and in the range of 0.2% (Crypto) up to 3.2% (Dhrystone).
5 Conclusions
The power consumption of embedded systems is to a large extent determined by software applications, actuating power management features as well as control- ling the overall system activity. Power peaks, caused by power-critical software events, can seriously impact the supply voltage and lead to critical supply voltage drops. These voltage drops pose a threat to the reliability of power-constrained mobile devices such as RF-powered smart cards. In this paper we have outlined an automated framework aimed at the power peak detection utilizing the emulation-based power profiling of given embedded software applications. By identifying the software code regions causing power peaks, the framework is able to selectively apply power reduction strategies, such
4 Data normalized due to existing NDA. 20 C. Bachmann et al. as frequency scaling and non-functional instruction insertion, to the affected re- gions. Furthermore, we have evaluated the effectiveness of this automated power peak optimization framework on a number of benchmarking applications. For these benchmarks the inherent execution time increase is in the range of only 1.2% up to 6.8% for the automatic modifications as compared to ∼10% for the manual ones.
Acknowledgements
We would like to thank the Austrian Federal Ministry for Transport, Innovation, and Technology for providing us with funding for the POWERHOUSE project under FIT-IT contract FFG 815193, as well as our industrial partners Infineon Technologies Austria AG and Austria Card GmbH for their enduring support.
References
1. Haid, J., Kargl, W., Leutgeb, T., Scheiblhofer, D.: Power management for RF- powered vs. battery-powered devices. In: TMCS (2005) 2. Grumer, M., Wendt, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Au- tomated software power optimization for smart card systems with focus on peak reduction. In: AICCSA (2007) 3. Grumer, M., Wendt, M., Lickl, S., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Software power peak reduction on smart card systems based on iterative compiling. Emerging Directions in Embedded and Ubiquitous Computing (2007) 4. Wendt, M., Grumer, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: System level power profile analysis and optimization for smart cards and mobile devices. In: SAC (2008) 5. Muresan, R., Gebotys, C.: Current flattening in software and hardware for security applications. In: CODES+ISSS (2004) 6. Li, X., Vahedi, H., Muresan, R., Gregori, S.: An integrated current flattening mod- ule for embedded cryptosystems. In: ISCAS (2005) 7. Vahedi, H., Muresan, R., Gregori, S.: On-chip current flattening circuit with dy- namic voltage scaling. In: ISCAS (2006) 8. Genser, A., Bachmann, C., Haid, J., Steger, C., Weiss, R.: An emulation-based real-time power profiling unit for embedded software. In: SAMOS (2009) 9. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, p. 388. Springer, Heidelberg (1999) 10. Finkenzeller, K.: RFID Handbook. John Wiley & Sons Ltd., Chichester (2003) 11. Bachmann, C., Genser, A., Steger, C., Weiss, R., Haid, J.: Automated power char- acterization for run-time power emulation of soc designs. In: 13th Euromicro DSD (2010) (in press) 12. http://www.coremark.org/ System Level Power Estimation of System-on-Chip Interconnects in Consideration of Transition Activity and Crosstalk
Martin Gag, Tim Wegner, and Dirk Timmermann
Institute of Applied Microelectronics and Computer Engineering, University of Rostock [email protected] www.networks-on-chip.com
Abstract. As technology reaches nanoscale order, interconnection systems account for the largest part of power consumption in Systems- on-Chip. Hence, an early and sufficiently accurate power estimation tech- nique is needed for making the right design decisions. In this paper we present a method for system-level power estimation of interconnection fabrics in Systems-on-Chip. Estimations with simple av- erage assumptions regarding the data stream are compared against esti- mations considering bit level statistics in order to include low level effects like activity factors and crosstalk capacitances. By examining different data patterns and traces of a video decoding system as a realistic exam- ple, we found that the data dependent effects are not negligible influences on power consumption in the interconnection system of nanoscale chips. Due to the use of statistical data there is no degradation of simulation speed in our approach.
1 Introduction
Lowering the power consumption of microsystems is one of the main topics in chip design and technology development. Not only due to the demand of energy saving and extended run times of mobile devices but also to avoid problems concerning cooling and reliability, this challenge has to be tackled. Shrinking and further enhancements regarding technology structures are es- pecially lowering the dynamic power consumption and the size of transistors. As logic devices are getting less and less energy dissipative and smaller, the inte- gration density is raised. Therefore, more interconnects between these elements are needed. The power consumption of the wires mainly remains on a certain level because they cannot be made smaller and need to be at a low distance to each other raising the capacitances even under the use of ultra low-k materials. The share of energy consumed in the interconnection system increases compared to the overall energy dissipation. In the Intel 80-core e. g. the communication system is responsible for over 28 % of the overall power budget [1]. Hence, the importance of energy consumed in the interconnection system of microchips is getting bigger.
R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 21–30, 2011.