Lecture Notes in Computer Science 6448 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany René van Leuken Gilles Sicard (Eds.)

Integrated Circuit and System Design

Power and Timing Modeling, Optimization and Simulation

20th International Workshop, PATMOS 2010 Grenoble, France, September 7-10, 2010 Revised Selected Papers

13 Volume Editors

René van Leuken Delft University of Technology 2628 CD Delft, The Netherlands E-mail: [email protected]

Gilles Sicard TIMA Laboratory 38031 Grenoble, France E-mail: [email protected]

Library of Congress Control Number: 2010940964

CR Subject Classification (1998): C.4, I.6, D.2, C.2, F.3, D.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

ISSN 0302-9743 ISBN-10 3-642-17751-4 Springer Berlin Heidelberg New York ISBN-13 978-3-642-17751-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

Welcome to the proceedings of the 20th International Workshop on Power and Timing Modeling, Optimization and Simulations, PATMOS 2010. Over the years, PATMOS has evolved into an important European event, where researchers from both industry and academia discuss and investigate the emerging chal- lenges in future and contemporary applications, design methodologies, and tools required for the development of the upcoming generations of integrated cir- cuits and systems. PATMOS 2010 was organized by the TIMA Laboratory, France, with the sponsorship of Joseph Fourier University, CEA LETI, Mina- logic, CNRS, Grenoble Institute of Technology and the technical co-sponsorship of the IEEE France Section. Further information about the workshop is available at: http://patmos2010.imag.fr. The technical program of PATMOS 2010 contained state-of-the-art technical contributions, three invited keynotes, a special session organized by the “Beyond DREAMS (Catrene 2A717)” project on “High-Level Modeling of Power-Aware Heterogeneous Designs in SystemC-AMS” and a special session organized by Minalogic presenting the results of four projects. The technical program focused on timing, performance, and power consump- tion, as well as architectural aspects with particular emphasis on modeling, de- sign, characterization, analysis, and optimization in the nanometer era. The Technical Program Committee, with the assistance of additional expert reviewers, selected the 24 papers presented at PATMOS. The papers were or- ganized into six oral sessions. As is customary for the PATMOS workshops, full papers were required for review, and a minimum of three reviewers were received per manuscript. Beyond the presentations of the papers, the PATMOS technical program was enriched by a series of talks offered by world-class experts, on important emerging research issues of industrial relevance. Kiyoo Itoh Fellow of Central Research Laboratory, Hitachi, Ltd., spoke about “Variability-Conscious Circuit Designs for Low-Voltage Memory-Rich Nano-Scale CMOS LSIs,” Marc Belleville of CEA, LETI, MINATEC, spoke about “3D Integration for Digital and Imagers Circuits: Opportunities and Challenges,” and S´ebastien Marchal of STMicroelectonics spoke about “Signing off Industrial Designs on Evolving Technologies.” We would like to thank our colleagues who voluntarily worked to make this edition of PATMOS possible: the expert reviewers; the members of the Technical Program and Steering Committees; the invited speakers; and last but not least, the local personnel who offered their skill, time, and extensive knowledge to make PATMOS 2010 a memorable event.

September 2010 Ren´evanLeuken Gilles Sicard

Organization

Organizing Committee

Ren´e van Leuken TU Delft, The Netherlands (Program Chair) Gilles Sicard TIMA Laboratory, France (General Chair) Anne-Laure Fourneret-Itie TIMA Laboratory, France Laurent Fesquet TIMA Laboratory, France Katell Morin–Allory TIMA Laboratory, France Florent Ouchet TIMA Laboratory, France Julie Correard TIMA Laboratory, France

Technical Program Committee

Atila Alvandpour Link¨oping University, Sweden David Atienza EPFL, Switzerland Nadine Azemard University of Montpellier, France Peter Beerel USC, USA Davide Bertozzi University of Ferrara, Italy Naehyuck Chang Seoul University, Korea Jorge Juan Chico University of Seville, Spain Joan Figueras University of Catalonia, Spain Eby Friedman University of Rochester, USA Costas Goutis University of Patras, Greece Eckhard Grass IHP, Germany Jos´es Lu´ıs G¨untzel University of Santa Catarina, Brazil Oscar Gustafsson Link¨oping University, Sweden Shiyan Hu Michigan Technical University, USA Nathalie Julien University of Bretagne-Sud, France Domenik Helms OFFIS Research Institute, Germany Ren´e van Leuken TU Delft, The Netherlands Philippe Maurine University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Vasily Moshnyaga University of Fukuoka, Japan Tudor Murgan Infineon, Germany Wolfgang Nebel University of Oldenburg, Germany Dimitris Nikolos University of Patras, Greece Antonio Nunez University of Las Palmas, Spain Vojin Oklobdzija University of Texas at Dallas, USA Vassilis Paliouras University of Patras, Greece Davide Pandini ST Microelectronics, Italy Antonis Papanikolaou NTUA, Greece VIII Organization

Christian Piguet CSEM, Switzerland Massimo Poncino Politecnico di Torino, Italy Ricardo Reis University of Porto Alegre, Brazil Donatella Sciuto Politecnico di Milano, Italy Gilles Sicard TIMA Laboratory, France Dimitrios Soudris NTUA, Athens, Greece Zuochang Ye Tsinghua University, Beijing, China Robin Wilson ST Microelectronics, France

Steering Committee

Antonio J. Acosta University of Seville, Spain Nadine Azemard University of Montpellier, France Joan Figueras University of Catalonia, Spain Reiner Hartenstein TU Kaiserslautern, Germany Jorge Juan-Chico University of Seville, Spain Enrico Macii Politecnico di Torino, Italy Philippe Maurine University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Wolfgang Nebel OFFIS, Germany Vassilis Paliouras University of Patras, Greece Christian Piguet CSEM, Switzerland Dimitrios Soudris NTUA, Athens, Greece Ren´e Van Leuken TU Delft, The Netherlands Diederik Verkest IMEC, Belgium Roberto Zafalon ST Microelectronics, Italy

Executive Steering Committee

Vassilis Paliouras University of Patras, Greece Nadine Azemard University of Montpellier, France Jose Monteiro INESC-ID / IST, Portugal Table of Contents

Session 1: Design Flows

A Power-Aware Online Scheduling Algorithm for Streaming Applications in Embedded MPSoC ...... 1 Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, and Guillaume Blanc

An Automated Framework for Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software ...... 11 Christian Bachmann, Andreas Genser, Christian Steger, Reinhold Weiß, and Josef Haid

System Level Power Estimation of System-on-Chip Interconnects in Consideration of Transition Activity and Crosstalk ...... 21 Martin Gag, Tim Wegner, and Dirk Timmermann

Residue Arithmetic for Designing Low-Power Multiply-Add Units ...... 31 Ioannis Kouretas and Vassilis Paliouras

Session 2: Circuit Techniques 1

An On-chip Flip-Flop Characterization Circuit ...... 41 Abhishek Jain, Andrea Veggetti, Dennis Crippa, and Pierluigi Rolandi

A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion ...... 51 Lida Ramezani

Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient VLSI Circuits ...... 62 Massimo Alioto, Elio Consoli, and Gaetano Palumbo

A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework ...... 73 Dimitris Bekiaris, Antonis Papanikolaou, Christos Papameletis, Dimitrios Soudris, George Economakos, and Kiamal Pekmestzi X Table of Contents

Session 3: Low Power Circuits An Efficient Low Power Multiple-Value Look-Up Table Targeting Quaternary FPGAs ...... 84 Cristiano Lazzari, Jorge Fernandes, Paulo Flores, and Jos´eMonteiro On Line Power Optimization of Data Flow Multi-core Architecture Based on Vdd-Hopping for Local DVFS ...... 94 Pascal Vivet, Edith Beigne, Hugo Lebreton, and Nacer-Eddine Zergainoh Self-Timed SRAM for Energy Harvesting Systems ...... 105 Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev L1 Data Cache Power Reduction Using a Forwarding Predictor ...... 116 P.Carazo,R.Apolloni,F.Castro,D.Chaver,L.Pinuel,and F. Tirado

Session 4: Self-Timed Circuits Statistical Leakage Power Optimization of Asynchronous Circuits Considering Process Variations ...... 126 Mohsen Raji, Alireza Tajary, Behnam Ghavami, Hossein Pedram, and Hamid R. Zarandi Optimizing and Comparing CMOS Implementations of the C-Element in 65nm Technology: Self-Timed Ring Case ...... 137 Oussama Elissati, Eslam Yahya, S´ebastien Rieubon, and Laurent Fesquet Hermes-A – An Asynchronous NoC Router with Distributed Routing ... 150 Julian Pontes, Matheus Moreira, Fernando Moraes, and Ney Calazans Practical and Theoretical Considerations on Low-Power Probability- Codes for Networks-on-Chip ...... 160 Alberto Garcia-Ortiz and Leandro S. Indrusiak

Session 5: Process Variation Logic Architecture and VDD Selection for Reducing the Impact of Intra-die Random VT Variations on Timing ...... 170 Bahman Kheradmand-Boroujeni, Christian Piguet, and Yusuf Leblebici Impact of Process Variations on Pulsed Flip-Flops: Yield Improving Circuit-Level Techniques and Comparative Analysis ...... 180 Marco Lanuzza, Raffaele De Rose, Fabio Frustaci, Stefania Perri, and Pasquale Corsonello Table of Contents XI

Transistor-Level Gate Modeling for Nano CMOS Circuit Verification Considering Statistical Process Variations ...... 190 Qin Tang, Amir Zjajo, Michel Berkelaar, and Nick van der Meijs White-Box Current Source Modeling Including Parameter Variation and Its Application in Timing Simulation ...... 200 Christoph Knoth, Irina Eichwald, Petra Nordholz, and Ulf Schlichtmann

Session 6: Circuit Techniques 2 Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer ...... 211 Abdelkrim Kamel Oudjida, Ahmed Liacha, Mohamed Lamine Berrandjia, and Rachid Tiar An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis...... 218 Oliver Schrape, Frank Winkler, Steffen Zeidler, Markus Petri, Eckhard Grass, and Ulrich Jagdhold Clock Network Synthesis with Concurrent Gate Insertion ...... 228 Jingwei Lu, Wing-Kai Chow, and Chiu-Wing Sham Modeling Time Domain Magnetic Emissions of ICs ...... 238 Victor Lomn´e, Philippe Maurine, Lionel Torres, Thomas Ordas, Mathieu Lisart, and J´erome Toublanc

Special Session 1: High-Level Modeling of Power-Aware Heterogeneous Designs in SystemC-AMS (Abstracts) Power Profiling of Embedded Analog/Mixed-Signal Systems ...... 250 Jan Haase and Christoph Grimm Open-People: Open Power and Energy Optimization PLatform and Estimator ...... 251 Daniel Chillet Early Power Estimation in Heterogeneous Designs Using SoCLib and SystemC-AMS ...... 252 Fran¸cois Pˆecheux, Khouloud Zine El Abidine, and Alain Greiner

Special Session 2: Minalogic (Abstracts) ASTEC: Asynchronous Technology for Low Power and Secured Embedded Systems...... 253 Pr. Marc Renaudin XII Table of Contents

OPENTLM and SOCKET: Creating an Open EcoSystem for Virtual Prototyping of Complex SOCs ...... 254 Laurent Maillet-Contoz

Keynotes (Abstracts)

Variability-Conscious Circuit Designs for Low-Voltage Memory-Rich Nano-Scale CMOS LSIs ...... 255 Kiyoo Itoh

3D Integration for Digital and Imagers Circuits: Opportunities and Challenges ...... 256 Marc Belleville

Signing off Industrial Designs on Evolving Technologies ...... 257 S´ebastien Marchal

Author Index ...... 259

A Power-Aware Online Scheduling Algorithm for Streaming Applications in Embedded MPSoC

Tanguy Sassolas, Nicolas Ventroux, Nassima Boudouani, and Guillaume Blanc

CEA, LIST, Embedded Computing Laboratory, 91191 Gif-sur-Yvette CEDEX, France [email protected]

Abstract. As application complexity grows, embedded systems move to multiprocessor architectures to cope with the computation needs. The is- sue for multiprocessor architectures is to optimize the processing resources usage and power consumption to reach a higher energy efficiency. These optimizations are handled by scheduling techniques. To tackle this issue we propose a global online scheduling algorithm for streaming applica- tions. It takes into account data dependencies between pipeline tasks to optimize processor usage and reduce power consumption through the use of DPM and DVFS modes. An implementation of the algorithm on a vir- tual platform, executing a WCDMA application, demonstrates up to 45% power consumption gain while guaranteeing regular data throughput.

Index Terms: scheduling, low-power, multiprocessor, streaming applications.

1 Introduction

As embedded applications become more complex, future embedded architectures will have to provide higher computing performances, while respecting strong sur- face and consumption constraints. Embedded devices will not only execute more computing intensive applications but also cross-domain ones, including telecom and video processing application . To cope with these demands an emerging trend in design lies in the conception of MultiProcessor Systems-on- Chips (MPSoC). These new architectures with a high density of processing elements have a strong energy dissipation. This dissipation must be taken into account to match an embedded-compliant power budget and to limit ageing phenomenon. To han- dle these thermal and energy issues, MPSoC designer integrate DVFS and DPM capabilities in their platform. To leverage MPSoCs processing capabilities, applications need to be highly parallelized. A simple way to increase application parallelism and data through- put is to pipeline sequential applications into streaming ones. This applies to the WCDMA application whose parallelism can be drastically increased. Then pipeline stages must be efficiently allocated to the processing resources while

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 1–10, 2011. c Springer-Verlag Berlin Heidelberg 2011 2 T. Sassolas et al. taking into account data dependencies between them. As applications become more prone to execution time variation, online control solution are needed to dy- namically schedule tasks and increase processor load. This variations can stem from the differences in input data for data processing application; or from the application structure itself. For instance the WCDMA application differently processes a pilot frame from a user frame. Only a global scheduler with a complete view of the computation resource and task states can perform an optimal scheduling. The choice of global scheduling pushes forward the use of a central control solution. In addition, an online cen- tral control solution must react quickly to platform events. Therefore, online scheduling must remain simple and must find a balance between accuracy and execution speed. In this article, we propose an online power-aware scheduling al- gorithm that matches these conditions. This algorithm focuses on the scheduling of streaming applications. Our scheduling algorithm also tackles power consump- tion issues through an efficient use of Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Power Management (DPM) modes of the processing re- sources. This paper is organized as follows: section 2 will study existing solutions in the field of power-aware streaming application scheduling. Then, section 3 will describe the proposed power-aware scheduling algorithm. Section 4 will detail implementation issues focusing on the simulation framework and the targeted MPSoC platform. Results will be presented in section 5 where the impact of our scheduling algorithm in terms of Quality of Service (QoS) and power consump- tion gain will be evaluated. Finally section 6 will discuss this new streaming application scheduling algorithm capabilities and its future improvements.

2 Related Work

We focus our study on power-aware sheduling algorithms that rely on DVFS and DPM techniques [1]. First of all, we will briefly present the DPM and DVFS techniques and their impact on energy consumption. Then we will present a survey of previous works in the field of offline power-aware scheduling techniques for streaming processing. Finally we will expose online low-power scheduling techniques for dependant tasks. ThedissipatedpowerinaCMOSdesigncan be divided into two major sources: the dynamic power consumption and the static one. The dynamic consumption part is mainly due to transistor state switching and it can be drastically reduced by lowering the supply voltage. As the transistor delay is a function of the supply voltage, lowering the supply voltage imposes an adapted frequency reduction. This technique is called DVFS. The static consumption is due to various current leakages in the transistor. The DVFS technique has some impact on the static power consumption thanks to the supply voltage reduction. Nonetheless this is not sufficient to drastically reduce static power consumption. To cut down static power consumption the only viable solution consists in switching off unused parts of a circuit. This A Power-Aware Online Scheduling Algorithm for Streaming Applications 3 technique is called DPM. Contrary to the DVFS technique the resource is made unavailable. The main drawback of these two techniques lies in the timing and consump- tion mode switching penalties. If the timing penalties for the DVFS are rather constrained, it is not the same for the DPM where wake-up time can reach a hundred milliseconds (136ms for the PXA270 [2]). Therefore, for a processor im- plementing both techniques, the issue is to find when reducing the voltage and frequency couple is more energy efficient than running at full speed then switch of the processor. This matter is summarized in Fig 1. For a given technological process, the issue is thus to evaluate the duration of future inactivity periods of the resource. Having introduced the DVFS and DPM technique and the opti- mization problem they imply, we will now present offline low-power scheduling technique for streaming applications.

Fig. 1. DPM (left)and DVFS (right) technique timing issues

Given the fact that scheduling on a multiprocessor environment is an NP com- plete problem [3], adding power consumption optimization to the problem makes the issue of power-aware scheduling for multiprocessor harder to solve. Streaming application can be seen as a set of tasks linked by their data dependencies. Thus, scheduling dependent tasks allows to schedule streaming applications. Many of- fline solutions have been proposed to solve this optimality issue assuming task dependencies and their execution lengths were available. They mainly vary in the way they describe the problem, changing which parameters have to be taken into account, and the computing optimization method used to solve the problem like in [4]. To the authors’ knowledge no previous work has been done to find an of- fline low-power multiprocessor scheduling dedicated to streaming application. Nonetheless an interesting line of work has been developed with the same scope but for monoprocessor environment. In [5] the authors study the power optimiza- tion by using DVFS technique on a streaming application described as a directed acyclic graph with a constant output rate. Their solution allows to find the lower consumption scheduling given buffer size or finding the buffer size given a power budget. A similar approach is taken in [6] with DPM utilization. To meet more realistic application they describe the production rate as a random variable fol- lowing a given probability rule. Nonetheless, variations in the effective execution time limit the performance of offline solutions. To handle this dynamism, online low-power solution have been proposed for streaming applications. 4 T. Sassolas et al.

Many online solutions have been designed for the case of independent tasks [7,8] but they cannot apply for streaming applications. Online scheduling that han- dle task dependency issues are uncommon. Interesting solutions for dependant task scheduling have been proposed by [9,10]. Nonetheless, these solutions rely on a partitionning of resources. Partionning solution are necessarily sub-optimal as they only handle resources separately. A global scheduling can potentially reach a better resource usage. We remind for the reader’s knowledge a few online power management tech- niques used for mono processor architecture in the case of streaming applica- tion described with a Directed Acyclic Graph (DAG). In [11] the author take into account potential blocking communication between tasks to always run the data producer at full speed in that case but lower the energy consumption oth- erwise. [12] presents another example of inter task communication buffer size optimization, with this time an online scheduler handling slack time accumu- lated with buffer use. None of the strategies listed above take into account the online scheduling of streaming applications that allow a pipelined execution and potential output rate improvements in an MPSoC environment.

3 Power-Aware Streaming Application Scheduling

We believe that a more power-efficient scheduling for dynamic streaming appli- cations can be found by the use of an online global scheduling. In this section, we will first remind the application description used by our algorithm. Then we will explain the grounds of our algorithm, before presenting it in detail. Our scheduling algorithm has been written to handle streaming applications de- scribed in a specific way. An application is a set of tasks with consumer/producer relationships. Data is transferred from a producer task to a consumer task through a circular buffer. Only one task can write on a buffer while it can be read by mul- tiple consumer tasks. This creates a divergence in the data flow. A consumer task can also read multiple input buffers, creating a convergence in the data flow. This allows the description of parallelism in the processing flow of a given data. Given the previously described application model, one can make a few obser- vations. A streaming application throughput is constrained by the duration of its slowest stage. As a result other pipeline stages can be slowed down to meet the same output rate as the slowest stage. This can be performed by using a slower DVFS mode for the resources with a too high output rate. Besides, tasks that are further in the pipeline stream than the slowest task are to be blocked waiting for data. These tasks should be preempted if other tasks can execute instead, or the resource should be shut down if not. This implies the use of DPM functionalities. Given these observations, our algorithm will use DVFS to bal- ance the pipeline stage length and DPM to shut down unused resources. Our objective is to maintain the same data throughput as if the task were executing at full speed while making substantial energy saving. To be able to balance an application pipeline, we need additional information on the dynamic output rate of a task. Thus we introduce monitors on every A Power-Aware Online Scheduling Algorithm for Streaming Applications 5 communication buffer. For every buffer we specify how many dataset it can contain. We also specify two thresholds. When the higher threshold is reached we assume that the producer is executing to fast. When the lower threshold is reached we assume that the producer is not executing fast enough. A specific event is sent to the scheduler when a threshold is crossed. It contains the writing task identifier. An event is also sent when a task is blocked reading an empty buffer, as well as when a task is blocked writing a full buffer. The buffer monitors are summarized in Fig. 2. One objective of balancing pipeline stage length is to prevent buffers from getting full, which would block the producer. And to never reach an empty buffer, which would block the consumer and could result in an increase of the data processing length.

Fig. 2. Summary of buffer monitors and scheduling implications

To keep our scheduling algorithm as simple as possible the task priorities are made of a static and a dynamic part. We will list the different priority parts by level of importance. First we check the blocked task status, as we do not want to give the priority to a blocked task. Then the application priority is taken into account. After that, we study pipeline position priority. Every task is given a priority depending on its position in the streaming pipeline. This allows to give the priority to tasks handling older dataset, i.e the ones that are deeper in the pipeline. Finally for tasks that have the same pipeline position priority, we give the priority to the task with the emptier buffer. The complete scheduling loop is described in Algorithm 1.

4 Implementation

To study and validate our algorithm we implemented it on a virtual MPSoC. In this section we will first present the SESAM simulation framework. Then, we will describe the specificities of the simulated MPSoC. Finally we will shortly present the WCDMA application used for our performance analysis. SESAM [13] is a tool that has been specifically built up to ease the design of asymmetric multiprocessor architectures. This framework is described with the SystemC description language, and allows MPSoC exploration at the TLM level with fast and cycle accurate simulation. Besides, SESAM uses approximate- timed TLM with explicit time to provide a fast and accurate simulation of com- plex NoC communications [14]. It performs simulations with an accuracy of 90% 6 T. Sassolas et al.

Algorithm 1. The Power-Aware Streaming Application Scheduling Loop 1: procedure scheduling(task to schedule[nb tasks], status proc[nb proc]) ♦ First we take into account buffer events 2: for all tasks to schedule do 3: if task is waiting for data then 4: remove task from task to schedule 5: else if task output buffer reached Higher Threshold then 6: reset task’s buffer priority bit 7: else if task output buffer reached Lower Threshold then 8: set task’s buffer priority bit 9: end if 10: end for ♦ Then we order the tasks by priority 11: ordered tasks[nbproc] ← sort task by priority(task to schedule) ♦ We handle already in execution tasks to limit preemption/migration 12: for all task already in execution in ordered tasks do 13: remove task from ordered tasks 14: remove proc executing task from freeproc 15: end for ♦ We allocate tasks not in execution on any processor yet 16: for all task left in ordered tasks do 17: execute task on freeproc 18: end for ♦ Finally we handle the consumption 19: for all proc do 20: if proc is free then 21: proc mode ← idle mode 22: else if Task on proc reached lower threshold then 23: proc mode ← turbo mode 24: else if Task on proc reached higher threshold then 25: proc mode ← half mode 26: end if 27: end for 28: end procedure compared to fully cycle accurate models. In addition, the programming model of SESAM is specifically adapted to dynamic applications and global scheduling methods. It is based on the explicit separation of the control and the computa- tion parts. The processing elements of the SESAM simulator are functional Instruction Set Simulators (ISS) generated by the ArchC tool. Thus, we extended the ArchC ISS to integrate DVFS and DPM models to the SESAM environment. To avoid multiple context switches and accelerate simulation, every ArchC ISS executes multiple instructions at a time then waits for the time it should have spent executing them. For every DVFS mode, we calculate the smallest couple (a, b)so that a/b equals the DVFS mode slowing factor. Then, we multiply the number of instructions to be executed by a and the time to wait for these instructions by b. A Power-Aware Online Scheduling Algorithm for Streaming Applications 7

We also calculate the energy spent during the execution of a set of instruction and keep the total energy consumption for each ISS. A DVFS mode switch is modelled as an interruption for the ISS. When it occurs, the ISS computes the time and energy spent in its previous mode. Then, it waits for the adequate switching latency, takes into account its switching energy penalty and finally resumes its execution with the (a, b) couple of the new DVFS mode. So as to model realistic processors we used the PXA270 Power State Machine (PSM) values [2]. We chose to use only two DVFS modes, Turbo and Half-turbo,and one DPM mode, Deep Idle, as they have acceptable switching latencies compared to our task execution times. To perform a realistic analysis of our scheduling algorithm we modelled with the SESAM simulator an asymmetric MPSoC platform. This platform is build of a set of Processing Elements (PE) made of a processor equipped with a TLB, a 1KB instruction cache and a 1KB data one. They are connected to a set of shared 2ns-latency L2 memory through a 2ns-latency multibus. Communication between tasks are made possible thanks to HAL functions. Data coherency is guaranteed by a (MMU). The buffers used for our al- gorithm are modelled using a specific HAL and the buffer thresholds are handled by the MMU. Preemption and migration of tasks are possible and their costs is reduced thanks to the shared memory and the virtualization of the memory space enabled by the use of TLBs [13]. The central controller is made of a processor with its own caches and memory. It is connected to the PEs and the MMU through another timed multibus. Its specific HAL enables to send configuration, execution, preemption or consump- tion mode switch orders. It can also be interrupted by any PE to be informed of a task execution end. The MMU also interrupts the controller whenever a task is blocked (or no longer blocked) waiting for input data or output space, as well as when a buffer threshold is crossed. We did not set the number of PE so as to study how our scheduling algorihtm can cope with different processor loads. To evaluate our algorithm impact on a streaming application, we used a well- known telecommunication application: a WCDMA encoder/decoder [15]. The application was pipelined and implemented on the simulated target MPSoC. The WCDMA application integrates an encoder followed by a decoder and is consequently built of 13 tasks. This allows having more tasks than resources on the SCMP platform to stress the potential scheduling anomalies. This application is characterized by an unbalanced pipeline whose slowest tasks are the FIR filters. In addition dynamism, is found in the task execution length as pilot frame get processed instead of actual data.

5Results

To study the impact of our scheduling algorithm we chose to compare it to two simpler versions of the algorithm. The first version does not handle power issues. It simply schedules tasks relying on pipeline stage position and blocked states. All processor are kept in Turbo mode. It is referred as the no energy handling 8 T. Sassolas et al.

(a) (b)

(c) (d)

Fig. 3. Figure (a),(b),(c) and (d) were obtained with the same WCDMA application sending 256 frames. The communication buffers were 8-frame long and had a higher threshold identical to the lower one and set to 2 frames. (a) Total execution time for the WCDMA application in function of the number of processing resources and the scheduling algorithm used; execution time overhead of our solution compared to the no energy handling algorithm. (b) Total processor effective occupancy and energy saving in function of the number of processing resources and the scheduling algorithm used; (c) Average time spent in Deep Idle mode compared to the time spent in unused state or waiting for data for a processor when using our proposed algorithm; (d) Comparison of the average time a processor spends waiting for data in the case of the no power saving algorithm and of our solution (DPM+DVFS): influence of the Half-Turbo mode usage on blocking states. scheduling. The second version is called DPM-only scheduling. This corresponds to a naive power-aware approach. Here unused resources and resources executing blocked tasks are put to Deep Idle mode. Finally our proposed algorithm will be referred as DPM + DVFS scheduling. As shown in figure 3(a) the total execution time of the WCDMA application is not affected by our scheduling algorithm no matter how many processing resources there are. The variation in execution time is always maintained below 1.2%. In addition our algorithm allowed a good acceleration of the processing for streaming applications. While we managed to maintain the execution time of the scheduling without energy awareness, Fig. 3(b) shows that substantial energy savings were made. A Power-Aware Online Scheduling Algorithm for Streaming Applications 9

As soon as processor effective occupancy drops it is directly compensated by our power saving method. With 13 processors we reduced the power consumption by 45%. In addition, our method obtains better results than the DPM-only scheduling which only reaches 37% energy saving in that case. Fig. 3(c) illustrates how our scheduling algorithm uses the DPM mode in a real application case. The figure shows that when processors spend little time waiting for data or in unused state (below 17%), the Deep Idle mode is seldom used. When the wasted time increases the DPM usage curve follows the unused or blocked processor curve as planned. In fact, when the number of process- ing elements is little, there is often another task ready to be executed immedi- ately. For low PE numbers the wasted time corresponds to the control overhead. The controller lacks reactivity to reach higher computing performance or power saving. Finally Fig. 3(d) studies the impact of DVFS modes usage on the application execution. We compare the execution of our algorithm to the no energy handling scheduling. The analysis shows that when DVFS mode are used they drastically reduce the amount of time spent in blocking states (42% reduction for 13 pro- cessors). Thus, our algorithm succeeds to balance the streaming pipeline stage execution length efficiently when the processor usage drops. As a result the pro- cessor load is increased with our algorithm compared to the no energy handling scheduling as shows Fig. 3(b).

6Conclusion

In this paper we presented a new power-aware scheduling algorithm for pipelined application in MPSoC environments. The algorithm was implemented on a virtual MPSoC platform simulated with the SESAM environment. Substantial en- ergy consumption gain was made compared to a classic data dependency scheduling that only takes into account blocking states. For a WCDMA application execut- ing on a platform with 13 PE our scheduling algorithm reduced the processing resources power consumption by 45%. In addition the use of DVFS and DPM did not impact the application execution speed. The variation in execution speed were maintained below 2%. Moreover, our algorithm succeeded to maintain a high pro- cessor load. As a result, our algorithm allows a good acceleration of the execution speed of streaming applications in MPSoCs while efficiently managing power con- sumption issues through the use of DVFS and DPM capabilities. In addition, as our algorithm is fully online and can handle the scheduling of more tasks than pro- cessor, we can manually shut down some processing resources to lower the power budget while guaranteeing a correct execution.

Acknowledgements

Part of the research leading to these results has received funding from the ARTEMIS Joint Undertaking under grant agreement no. 100029. 10 T. Sassolas et al.

References

1. Venkatachalam, V., Franz, M.: Power Reduction Techniques For Microprocessor Systems. ACM Computing Surveys (CSUR) 37(3), 195–237 (2005) 2. Intel PXA27x Processor Family, Electrical, Mechanical, and Thermal Specification (2005) 3. Dertouzos, M.L., Mok, A.K.: Multiprocessor Online Scheduling of Hard-Real-Time Tasks. IEEE Transactions on Software Engineering 15(12), 1497–1506 (1989) 4. Benini, L., Bertozzi, D., Guerri, A., Milano, M.: Allocation, Scheduling and Voltage Scaling on Energy Aware MPSoCs. In: Beck, J.C., Smith, B.M. (eds.) CPAIOR 2006. LNCS, vol. 3990, pp. 44–58. Springer, Heidelberg (2006) 5. Lu, Y.-H., Benini, L., De Micheli, G.: Dynamic Frequency Scaling with Buffer Insertion for Mixed Workloads. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 21(5), 1284–1305 (2002) 6. Pettis, N., Cai, L., Lu, Y.-H.: Statistically Optimal Dynamic Power Management for Streaming Data. IEEE Transactions on Computers 55(7), 800–814 (2006) 7. Kim, K.H., Buyya, R., Kim, J.: Power Aware Scheduling of Bag-of-Tasks Applica- tions with Deadline Constraints on DVS-enabled Clusters. In: IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp. 541–548 (2007) 8. Zhang, F., Chanson, S.T.: Power-Aware Processor Scheduling under Average De- lay Constraints. In: IEEE Real Time on Embedded Technology and Applications Symposium (RTAS), pp. 202–212 (2005) 9. Choudhury, P., Chakrabarti, P.P., Kumar, R.: Online Dynamic Voltage Scaling using Task Graph Mapping Analysis for Multiprocessors. In: International Confer- ence on VLSI Design (VLSID), pp. 89–94 (2007) 10. Hua, S., Qu, G., Bhattacharyya, S.S.: Energy-Efficient Embedded Software Imple- mentation on Multiprocessor System-on-Chip with Multiple Voltages. ACM Trans- actions on Embedded Computing Systems (TECS) 5(2), 321–341 (2006) 11. Zhang, F., Chanson, S.T.: Blocking-Aware Processor Voltage Scheduling for Real- Time Tasks. ACM TECS 3(2), 307–335 (2004) 12. Im, C., Kim, H., Ha, S.: Dynamic Voltage Scheduling Technique for Low-Power Multimedia Applications Using Buffers. In: ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 34–39 (2001) 13. Ventroux, N., Guerre, A., Sassolas, T., Moutaoukil, L., Bechara, C., David, R.: SESAM: an MPSoC Simulation Environment for Dynamic Application Processing. In: IEEE International Conference on Embedded Software and Systems, ICESS (2010) 14. Guerre, A., Ventroux, N., David, R., Merigot, A.: Approximate-Timed Transac- tional Level Modeling for MPSoC Exploration: A Network-on-Chip Case Study. In: IEEE Euromicro Symposium on Digital Systems Design (DSD), pp. 390–397 (2009) 15. Richardson, A.: WCDMA Design Handbook (2006) An Automated Framework for Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software

Christian Bachmann1, Andreas Genser1, Christian Steger1, Reinhold Weiß1,andJosefHaid2

1 Institute for Technical Informatics, Graz University of Technology, Austria 2 Infineon Technologies Austria AG, Design Center Graz, Austria

Abstract. In power-constrained mobile systems such as RF-powered smart-cards, power consumption peaks can lead to supply voltage drops threatening the reliability of these systems. In this paper we focus on the automated detection and reduction of power consumption peaks caused by embedded software. We propose a complete framework for automat- ically profiling embedded software applications by means of the power emulation technique and for identifying the power-critical software source code regions causing power peaks. Depending on the power management features available on the given device, an optimization strategy is cho- sen and automatically applied to the source code. In comparison to the manual optimization of power peaks, the automatic approach decreases the execution time overhead while only slightly increasing the required code size.

1 Introduction

The power consumption of embedded systems is increasingly dependent on soft- ware applications determining the utilization of system components and periph- erals. Furthermore, the embedded software actuates power management features such as voltage and frequency scaling as well as dedicated sleep or hibernation states. Hence, software applications impact the average as well as the peak power consumption that is in turn affecting the reliability, stability and security of em- bedded systems. Especially for RF-powered devices such as contactless smart- cards, power peaks threaten the system reliability by impacting the power supply circuit and leading to supply voltage drops [1]. These supply voltage drops can in turn result in system resets or, even worse, in erroneous system states. There- fore, power peak reduction and elimination methods for embedded software have been proposed [2–4]. Furthermore, power peak reduction techniques have been studied for the purpose of power profile flattening in hardware implementations [5–7]. For security applications, the profile flattening resembles a countermeasure against power analysis attacks. In this paper we propose an automated methodology for profiling a software application’s power consumption and deriving a power peak optimized implemen- tation. Based on an integrated supply voltage simulation, critical code regions are

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 11–20, 2011. c Springer-Verlag Berlin Heidelberg 2011 12 C. Bachmann et al. detected and optimized. While existing software optimization methods employ either instruction-level power simulators [2–4] or physical on-chip power measure- ments [5–7] to obtain power profiles, our approach utilizes a high-level power emu- lation technique previously introduced in [8]. Using this technique, cycle-accurate run-time power estimates are derived from the system-under-test’s functional em- ulation. In comparison to measurement-based approaches, the joint functional and power emulation offers the advantage of inherent power profile to functional exe- cution trace correspondence, i.e., a power consumption value can be determined for each executed instruction. Furthermore, the emulation is cycle-accurate while still allowing for rapid profiling of long program sequences. This constitutes an ad- vantage over simulation-based approaches that are either lacking simulation detail and hence accuracy or simulation speed. In contrast to hardware power profile flattening approaches, no additional on- chip measurement and control hardware is required. Furthermore, opposed to power peak reduction methods modifying intermediate language representations of the given software application [2, 3], our approach operates on and modifies the original C or assembler source code. The resulting power peak optimized source code can afterwards still be manually modified by the software engineer if required. In the context of embedded software power peak optimization, the novel contributions of this paper are as follows:

– We present a framework for detecting source code regions causing power peaks by analyzing the power consumption as well as the functional debug information obtained during software execution. – We derive an optimization algorithm, actuating power management features for these power-critical source code regions and hence reducing the number of power peaks. – Finally, we illustrate the feasibility of our approach on a power-constrained deep-submicron smart-card controller system.

This paper is structured as follows. In Section 2 we discuss related work on power peak optimization and power profile flattening. Section 3 presents our automated framework for power-critical code region detection and optimization. We illustrate the effectiveness of our approach in Section 4. Finally, conclusions drawn from our current work are summarized in Section 5.

2 Related Work

Due to the large influence of software on both average as well as peak power consumption of embedded systems, numerous works have studied power- and energy-aware software optimization methods. With regard to power-constrained devices, the power profile flattening and the optimization of power consump- tion peaks, is of increased interest. These power peaks are often caused due to the occurrence of power-critical events during software execution. Especially in battery- and RF-powered devices these peaks can severely impact the power supply circuit and can lead to supply voltage drops [1]. These supply voltage An Automated Framework for Power-Critical Code Region Detection 13 drops seriously jeopardize the stability and hence the reliability of the given sys- tem. Power profile flattening hardware implementations have been studied in the context of security-related applications. In the security domain, the reduction of profile variability is of increased interest as a countermeasure against power analysis attacks [9]. For the purpose of reliability enhancements, the reduction of power peaks has been investigated in [3] by means of a simulation-based peak elimination framework using iterative compilation. Other attempts on power peak reduction have focused on instruction reordering to minimize the switching activity due to circuit state changes [2] as well as non-functional instruction (NFI) insertion [4]. Power profile flattening in security applications, aiming at hindering power analysis attacks by means of NFI insertion, was studied in [5]. Both software and hardware implementations were shown. In [6] a current-injection-based real-time flattening method has been proposed. This approach has been extended in [7] by a voltage scaling capability for improved flattening performance.

3 Automated Power-Critical Code Region Detection and Power Peak Optimization of Embedded Software

Our automated power profiling and power-critical code region detection method- ology as depicted in Figure 1 builds upon a standard software development flow (A) and our run-time power profiling approach (B). The power estimates, along- side with the functional traces are being analyzed to detect power-critical code regions (C). After these regions have been detected, an optimization algorithm is used to reduce the power consumption and hence the power peaks during these critical code regions (D).

Standard Software Development Flow Power Peak Code Optimization

Source SW Development Debug Info Optimized Code Toolchain Power Peak Source Memory Map Optimization Code

Binaries A D

B C Functional Trace – Source Emulation Correlation Critical Power-Critical Code Code Region Region Detection Report Power Power Supply Voltage Model Emulation Simulation

Run-Time Power Profiling Detection of Power-Critical Code Regions

Fig. 1. Automated flow for power profiling, power-critical code region detection and optimization 14 C. Bachmann et al.

3.1 Run-Time Power Profiling Based on Power Emulation For the purpose of detecting power-critical code regions, power profiling of the given software application has to be performed in the first place. In contrast to existing software power peak optimization approaches, we employ the power emulation technique previously introduced in [8] to obtain power profiles for the software application’s execution. The principle of power emulation as depicted in Figure 2, is to augment the functionally emulated system-under-test with special power estimation hardware. This power estimation hardware monitors the state of the system and its subcomponents. Based on these state data, the power estimator derives cycle-accurate run-time power estimates according to an integrated high-level power model.

... CoProc 1 RAM Trace of Functional FU 1 FU n ROM MOV @R8, R12 Execution CoProc 2 NVM INC R8, #0x02 ...

Emulation ADD R8, R5 CPU CoProc Memories Functional

Component State Component State Component State Functional Verification Power Model Power Model ... Power Model Power Sensor Power Sensor Power Sensor Trace of Power Power Estimator Estimates Power Power Emulation Averaging Debug Trace Generator Time Power Verification FPGA Board Host PC

Fig. 2. Embedded software power profiling utilizing power emulation: Run-time power estimation and functional execution trace generation (adapted from [8])

As compared to low-level simulation-based power profiling, the power emu- lation technique largely reduces profiling time. This allows for the profiling of complex software applications and elaborate program sequences, such as the booting process of an operating system. In contrast to high-level simulators, power emulation offers the benefit of cycle-accuracy that instruction- or system- level-simulators fail to deliver. Furthermore, power emulation offers the advan- tage of inherent power profile to functional execution trace correspondence as compared to measurement-based approaches.

3.2 Power-Critical Code Region Detection Our power critical code region detection approach as depicted in Figure 1 con- sists of multiple stages. First, the functional execution trace obtained in the joint functional and power emulation step is used to establish the source code correlation, i.e., identifying the source code region corresponding to each exe- cution trace message. Second, using the power emulation trace as input data, a supply voltage simulation employing a numerical model of the RF-supply is performed1. Third, the resulting supply voltage profile is utilized to identify 1 Due to the limited computational complexity of the numerical RF-supply model, a simulation-based implementation is adequate. An Automated Framework for Power-Critical Code Region Detection 15 power peaks leading to critical voltage drops and finding the source code regions causing these drops. Figure 3 depicts the inductively coupled power supply of a contact-less smart- card device. The impact of power peaks on the supply voltage level, however, is dependent on the duration, power level and rate of these peaks as shown in Figure 4. We define power-critical source code regions as parts of an embedded software application resulting in power peaks that lead to supply voltage drops below a critical limit. These peaks can be caused by, e.g., phases of high processor activity, a number of consecutive memory read or write accesses and co-processor as well power-intensive peripheral activity. In order to identify power peaks that actually lead to critical supply voltage drops on the given system, a supply voltage simulation based on the emulated power profile is performed.

Power 1

0.9

0.8 Power [normalized] 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized]

Supply Voltage Reader Embedded System 1 Device C1 C2

0.9 V Smart Card Limit

0.8 Supply Voltage[normalized] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Magnetic field H Time [normalized]

Fig. 3. Inductively coupled power sup- Fig. 4. Impact of different power peaks ply of RF-powered smart-card embedded on the supply voltage (voltage drops) system (adapted from [10])

3.3 Optimization of Power-Critical Source Code Regions

The subsequent power-critical code region optimization algorithm as shown in Algorithm 1 aims at applying code modifications for power peak reduction to the original C or assembler source code. Depending on the power management features available on the given system, the frequency scaling and the NFI inser- tion techniques are applied to these power-critical regions. Listing 1.1 illustrates the insertion of frequency scaling control instructions around the call-site2 of a function causing power peaks, whereas Listing 1.2 shows the use of NFI insertion within a loop causing short power peaks. The algorithm operates in three major stages: (1) The power-critical code re- gions for each function are determined. If a large part of a function constitutes the power-critical code region, the algorithm chooses to optimize the entire function. In this case the call-sites of the function are searched and marked for modification

2 The source code line calling a particular function. 16 C. Bachmann et al.

start_f_scaling(); while(loop_condition) { power_critical_function(); short_loop_instruction; nop(); //NFI stop_f_scaling(); }

Listing 1.1. f-scaling example Listing 1.2. NFI insertion example instead of the function itself. (2) Consecutive source code lines marked for modi- fication are grouped into modification clusters. For each of those clusters, the al- gorithm chooses an optimization strategy based on the cluster’s number of power peaks and their respective duration: Short power peaks are likely to be resolved by NFI insertion, longer power peaks or longer groups of peaks can be reduced by applying frequency scaling. (3) Each of the found source code clusters is then modified in the chosen way and the modified code is written back to the source files.

Algorithm 1. Power-Critical Source Code Region Optimization Input: Set of application source code S, List of power-critical code regions L, Threshold of max. percentage of power-critical lines per function Thclpf , Threshold of f-scaling time penalty Thf−scale Output: Set of optimized application source code So Step 1, group by function: List of affected source code lines Lsl := {} foreach Function f in S do Find source code lines of f in L if Foundsourcecodelines> 0 then Calculate percentage of power-critical code region in function if Percentage >Thclpf then Find call-sites of function f, add source code lines of call-sites to Lsl else Add source code lines to Lsl

Step 2, cluster lines to modify & choose optimization strategy: Lslc := Cluster consecutive source code lines in Lsl foreach Source code cluster C in Lslc do if Duration C>Thf−scale then Mark cluster C for f-scaling else Mark cluster C for NFI insertion Step 3, perform modification: So :=S foreach Source code cluster C in Lslc do Modify So by inserting selected optimization instructions An Automated Framework for Power-Critical Code Region Detection 17

4 Experimental Results

For evaluating our framework, a smart-card microcontroller test-system supplied by our industrial partner was employed. For different benchmarking applications, power profiles were recorded using the power emulation technique. Afterwards, these benchmarks were optimized both in a manual as well as in an automated way utilizing the presented framework. This allows for evaluating the effective- ness of our method.

4.1 Test System for Power Peak Optimization The used smart-card microcontroller test system consists of a 16-bit pipelined cache architecture. It comprises volatile and non-volatile memories as well as a number of peripherals, e.g., cryptographic coprocessors, timers, and random number generators. The system has been augmented with a power emulation unit as depicted in Figure 5 to allow for the generation of run-time power estimates. For detecting power peaks leading to problematic supply voltage drops, we have implemented an RF power supply equivalent circuit model as proposed in [1] and depicted in Figure 6. Based on power consumption changes in the microcontroller test-system, the load current il(t) changes and affects the load voltage vl(t). In phases of high power consumption and thus high load currents when the required load current is higher than the supplied source current is(t), the energy storage capacitor delivers the missing fraction ic(t). However, for longer power peaks or a longer series of short power peaks, the capacitor fails to deliver the required current resulting in a critical supply voltage drop.

)8  )8Q 520 is(t) il(t)

&38 5$0 Ri

&DFKH 0(' 190 ic(t)

&RUH 0HPRULHV + V C v (t) Test  Vs z l 8$57 &U\SWR&R3URF - System &U\SWR&R3URF ,& &R3URFHVVRUV

&/,QWHUIDFH 751* 351* 7LPHU ,QWHUIDFHV 3HULSKHUDOV

3RZHU(PXODWLRQ8QLW

Fig. 5. 16-bit smart-card microcontroller Fig. 6. Equivalent circuit of the RF test system augmented by power emula- power supply of the test system (adapted tion unit (adapted from [11]) from [1])

4.2 Comparison of Original and Optimized Power Consumption and Supply Voltage Profiles We illustrate the optimization result by comparing the power consumption and the respective supply voltage profiles of a given software application. Figure 7 18 C. Bachmann et al. resembles the results obtained during profiling of the original application. After the power-critical code region detection and optimization, the power profiling and supply voltage simulation was repeated yielding the profiles depicted in Figure 8.

Power Peaks 1 1

0.8 0.8

0.6 0.6

Power [normalized]Power Unoptimized Power

Power [normalized]Power Optimized Power 0.4 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized] Time [normalized]

Unoptimized Supply Voltage Optimized Supply Voltage 1 1

0.8 VLimit 0.8 VLimit

Voltage Drops Reduced Voltage Drops 0.6

Supply Voltage[normalized] 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Supply Voltage[normalized] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [normalized] Time [normalized]

Fig. 7. Unoptimized power consump- Fig. 8. Optimized power consumption tion and resulting supply voltage pro- and resulting supply voltage pro- files of authentication benchmarking files of authentication benchmarking 3 application application3

The results illustrate how a number of power peaks result in supply voltage drops below the critical limit. By applying frequency scaling and NFI insertion to the code regions causing these peaks, their power consumption and hence their supply voltage impact can be diminished. Note that this modification, while improving system stability and reliability, comes at the cost of a slightly increased execution time. However, as illustrated in the subsequent section, the additionally required execution time is smaller for the automatically than for the manually optimized version because the frequency scaling and the NFI insertion are applied more selectively.

4.3 Impact of Power Peak Optimization on Execution Time and Code Size We have applied the power peak optimization algorithm to various benchmarking applications in order to evaluate its impact on the execution time and the code size. For comparison we have also manually optimized the given benchmarking applications by applying frequency scaling to the entire benchmark. For both the manual and the automatic approach, all power peaks resulting in critical supply voltage drops have been eliminated. Figure 9 illustrates these results for two general purpose microcontroller benchmarks (Coremark [12] and Dhrystone) as well as for two domain-specific ones (Authenthication and Crypto).

3 Data normalized due to existing NDA. An Automated Framework for Power-Critical Code Region Detection 19

Execution Time per Testcase Code Size per Testcase 120 110 Original Manual optimization 110 Automatic optimization 105

100

100

90 Code Size [%] 95 Execution Time [%] 80

70 90 o o e on rk on rk n ti a pt ti a ry rypt tica rem C tica rem C hrysto en Co Dhrystone en Co D h th ut u A A

Fig. 9. Execution time and code size of original, manually as well as automatically modified benchmarks4

The results show that in terms of execution time the automatic approach outperforms the manual optimization due to the finer granularity of code mod- ifications. For the manual optimization approach the execution time increases by ∼10% due to the minimally required frequency reduction of ∼10% for elim- inating all critical supply voltage drops. However, for the automatic approach this increase is in the range of only 1.2% (Crypto) up to 6.8% (Authentication) depending on the number and duration of power peaks. Note that the increase in execution time also depends on the ratio of code regions affected by power peaks that need to be optimized to regions requiring no optimization. Furthermore, we compare the increase in code size caused by the insertion of frequency scaling control instructions and NFIs. This increase is almost negli- gible for the manual approach (smaller than or ∼1% for all testcases). For the automatic approach, the increase is slightly higher and in the range of 0.2% (Crypto) up to 3.2% (Dhrystone).

5 Conclusions

The power consumption of embedded systems is to a large extent determined by software applications, actuating power management features as well as control- ling the overall system activity. Power peaks, caused by power-critical software events, can seriously impact the supply voltage and lead to critical supply voltage drops. These voltage drops pose a threat to the reliability of power-constrained mobile devices such as RF-powered smart cards. In this paper we have outlined an automated framework aimed at the power peak detection utilizing the emulation-based power profiling of given embedded software applications. By identifying the software code regions causing power peaks, the framework is able to selectively apply power reduction strategies, such

4 Data normalized due to existing NDA. 20 C. Bachmann et al. as frequency scaling and non-functional instruction insertion, to the affected re- gions. Furthermore, we have evaluated the effectiveness of this automated power peak optimization framework on a number of benchmarking applications. For these benchmarks the inherent execution time increase is in the range of only 1.2% up to 6.8% for the automatic modifications as compared to ∼10% for the manual ones.

Acknowledgements

We would like to thank the Austrian Federal Ministry for Transport, Innovation, and Technology for providing us with funding for the POWERHOUSE project under FIT-IT contract FFG 815193, as well as our industrial partners Infineon Technologies Austria AG and Austria Card GmbH for their enduring support.

References

1. Haid, J., Kargl, W., Leutgeb, T., Scheiblhofer, D.: Power management for RF- powered vs. battery-powered devices. In: TMCS (2005) 2. Grumer, M., Wendt, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Au- tomated software power optimization for smart card systems with focus on peak reduction. In: AICCSA (2007) 3. Grumer, M., Wendt, M., Lickl, S., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: Software power peak reduction on smart card systems based on iterative compiling. Emerging Directions in Embedded and Ubiquitous Computing (2007) 4. Wendt, M., Grumer, M., Steger, C., Weiss, R., Neffe, U., Muehlberger, A.: System level power profile analysis and optimization for smart cards and mobile devices. In: SAC (2008) 5. Muresan, R., Gebotys, C.: Current flattening in software and hardware for security applications. In: CODES+ISSS (2004) 6. Li, X., Vahedi, H., Muresan, R., Gregori, S.: An integrated current flattening mod- ule for embedded cryptosystems. In: ISCAS (2005) 7. Vahedi, H., Muresan, R., Gregori, S.: On-chip current flattening circuit with dy- namic voltage scaling. In: ISCAS (2006) 8. Genser, A., Bachmann, C., Haid, J., Steger, C., Weiss, R.: An emulation-based real-time power profiling unit for embedded software. In: SAMOS (2009) 9. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol. 1666, p. 388. Springer, Heidelberg (1999) 10. Finkenzeller, K.: RFID Handbook. John Wiley & Sons Ltd., Chichester (2003) 11. Bachmann, C., Genser, A., Steger, C., Weiss, R., Haid, J.: Automated power char- acterization for run-time power emulation of soc designs. In: 13th Euromicro DSD (2010) (in press) 12. http://www.coremark.org/ System Level Power Estimation of System-on-Chip Interconnects in Consideration of Transition Activity and Crosstalk

Martin Gag, Tim Wegner, and Dirk Timmermann

Institute of Applied Microelectronics and Computer Engineering, University of Rostock [email protected] www.networks-on-chip.com

Abstract. As technology reaches nanoscale order, interconnection systems account for the largest part of power consumption in Systems- on-Chip. Hence, an early and sufficiently accurate power estimation tech- nique is needed for making the right design decisions. In this paper we present a method for system-level power estimation of interconnection fabrics in Systems-on-Chip. Estimations with simple av- erage assumptions regarding the data stream are compared against esti- mations considering bit level statistics in order to include low level effects like activity factors and crosstalk capacitances. By examining different data patterns and traces of a video decoding system as a realistic exam- ple, we found that the data dependent effects are not negligible influences on power consumption in the interconnection system of nanoscale chips. Due to the use of statistical data there is no degradation of simulation speed in our approach.

1 Introduction

Lowering the power consumption of microsystems is one of the main topics in chip design and technology development. Not only due to the demand of energy saving and extended run times of mobile devices but also to avoid problems concerning cooling and reliability, this challenge has to be tackled. Shrinking and further enhancements regarding technology structures are es- pecially lowering the dynamic power consumption and the size of transistors. As logic devices are getting less and less energy dissipative and smaller, the inte- gration density is raised. Therefore, more interconnects between these elements are needed. The power consumption of the wires mainly remains on a certain level because they cannot be made smaller and need to be at a low distance to each other raising the capacitances even under the use of ultra low-k materials. The share of energy consumed in the interconnection system increases compared to the overall energy dissipation. In the Intel 80-core e. g. the communication system is responsible for over 28 % of the overall power budget [1]. Hence, the importance of energy consumed in the interconnection system of microchips is getting bigger.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 21–30, 2011.

Springer-Verlag Berlin Heidelberg 2011 22 M. Gag, T. Wegner, and D. Timmermann

During the design process power consumption has to be estimated in every design step to be sure to meet the constraints of every part of the system as well as the whole system. The early phases of architectural, algorithmic and system design are very important parts of the whole process. Precise high level power estimation is leading to better designs, as the high level design changes are known to have more significant effects than enhancements at lower levels. At early design stages wire-mappings and cycle-accurate behavior mostly are not known, making system level power estimations difficult. We tackle this prob- lem with a mixture of well accepted assumptions regarding technology param- eters and statistical information that represents the characteristics of the data transmitted on-chip. For this matter, different data patterns are evaluated to get significant statistics of transition probabilities and crosstalk effects. The resulting statistical data is provided to a power model. This mixture of high level infor- mation and low level assumptions will facilitate more accurate power estimation than just relying on high level design information. In the following section this paper is related to the state of the art. Then the used power model is described. Our simulations are explained and the results are discussed before the paper is ended by a short conclusion.

2 Related Work

System level power estimation is already recognized as an important aspect in the field of chip design and system simulation. For design space exploration of Networks-on-Chip (NoCs) Kahng et al. give a high level power model of routers and links called Orion 2.0[2]. This work is based on the Predictive Technology Model (PTM) [3] and calculations of capacitances by Wang et al. [4]. The inclusion of low level power models in system level NoC simulation is part of the work of Xi et al. [5]. Transition activity was included in their simulation framework, which is crucial for the correct treatment when transition encoding is utilized [6–9]. Nevertheless, no crosstalk effects were included in their simulation framework. This could be fatal as influences of coupling capacitances on on-chip buses are not negligible. Sotiriadis et al. derived a new low level bus model to take such deep submicron effects into account [10]. There is many work on so called crosstalk avoidance codes [11–14] and even the combination of transition and crosstalk avoidance [15] that would benefit from a system level power estimation technique respecting actual transition counts and cross coupling effects. Using signal statistics to estimate transition activity and even crosstalk [16] is considered to claim many resources during simulation. In [17] the utilization of word level statistics was proposed to be a solution. In this paper we will show, that even bit level statistics are suitable to enhance the high level power estimations of on-chip interconnects at no simulation performance costs. System Level Power Estimation 23

3 Modeling of Dynamic Power Dissipation on Links

The power consumed by communicating links can be divided in static and dy- namic dissipation. Here we want to concentrate on the dynamic power dissipation because the static part is not influenced by the transmitted data. The well known formula 1 2 Pdyn = · a · f · V · C (1) 2 where a is the transition probability, f the frequency, V the operating voltage and C the switched load capacitance, represents the dynamic power model of every logic element in CMOS systems. In the case of wires, energy consumption originates from charging ground and cross coupling capacitances. In general, ca- pacitances to the ground and top plates are constant. The coupling capacitances are created by the left and right neighbors of a wire, which are parallel wires building a bus in most cases. The signal changes on those neighboring wires affect the effective capacitance seen by the driver through capacitive coupling. This can be considered a special case of the Miller Effect. The calculation of the effective capacitance is a combination of ground and coupling capacitance: Ceff = Cg + σ · Cc (2)

Where σ in this combination depends on switching directions of the right and left neighbor of the wire and is called the Miller Coupling Factor (MCF). There are different possible combinations which can raise but also lower the value of the effective capacitance compared to a static MCF, which is 2 on average (Tab. 1). f The MCF can be calculated using the following equation, where vi is one when i the final value of the voltage on the i-th line is high and zero if it is low. vi stands for the initial value of that line. ⎡ ⎤ f − i vi−1 vi−1 ⎣ f i ⎦ σ =[−1, 2, −1] · vi − vi (3) f − i vi+1 vi+1

The resulting dynamic power consumption can be calculated with the resulting Eq. (4), where V is the initial or final voltage. ⎡ ⎤ f − i Vi−1 Vi−1 f ⎣ f i ⎦ Pdyn = a · f · Vi · [−λ, 1+2λ, −λ] · Vi − Vi · Cg (4) f − i Vi+1 Vi+1

Similar to the Predictive Technology Model (PTM) [3] and Orion 2.0 [2], we are using the models of Wong et al. [4] to calculate the technology dependent values of ground and coupling capacitances. Together with the gathered MCF these values are used for dynamic power calculation. In Addition, a component of static power is added to include leakage like it is done in Orion 2.0. 24 M. Gag, T. Wegner, and D. Timmermann

Table 1. Possible Miller Coupling Factors of a wire (i) switching from 0 to 1 H H i-1 H 0  0 0  1 1  0 1  1 i+1 HH 0  0 2 1 3 2 0  1 1 0 2 1 1  0 3 2 4 3 1  1 2 1 3 2

4 Bit Level Statistics

To get the most exact values for effective coupling capacitances and transition counts, it is necessary to evaluate every bit that traverses the data bus in the system and analyze its correlation to the previous bit of this position. This is possible for all signals in gate level simulations, because all signals are known and their probable mapping to wires can be estimated. Even at system level this is possible for links connecting main modules (e. g. a bus in SoCs or the interconnection network in NoCs), if a few assumptions concerning bus mappings are made. The evaluation of every bit transmitted through the communication system takes time during the simulation process. This may reverse speed gain achieved through high level abstractions if done during system level simulations. However, we propose to use signal statistics to account for transition activity and crosstalk effects on links. The necessary signal statistics can be obtained from a sample of data characterizing traffic on the actual link before the system simulation starts. The time required to create offline statistics depends on the evaluated system and signal parameters but usually should be much lower than the time that is taken to process the whole real data stream. The acquisition of those signal data can be achieved by deploying cycle accurate system models or architectural models and exploiting knowledge of algorithms used in the system modules. It has to be known if the data is mostly random like compressed data or if there are inter-word correlations that are often found in uncompressed data. Of course signal traces of lower level models could be used as well, if they are available. In our experiments we generally used two ways to gather the bit level statistics of the data. In the first method stream based evaluation software is used to examine the characteristics of general data. At first, the incoming data from a file is divided into chunks corresponding to the expected word width on the later bus structure. Then transitions between two successive words are counted and the MCF is calculated for every bit position in the data word in order to consider crosstalk. In the middle of the bus the needed energy is affected by two aggressors, while the victim lines at the fringes have only one aggressor (Fig. 1). If the stream comes to an end, the arithmetic average of transitions and MCFs of all bit positions are determined. The second method is based on the interpretation of signal traces in Value Change Dump (VCD) format. A gate level simulation of a hardware design is System Level Power Estimation 25 used to generate the trace files. Our software extracts the interesting signals out of the signal dump. That would be the signals that will run between main modules and are possible candidates for relatively long wires i. e. claiming high capacitances in the data bus. These signals are analyzed as it is done in the stream based evaluation. In our simulations we used the first method for general investigations of bit level statistics of common data. The Second approach was used to evaluate our estimation technique for an implemented SoC.

T1 T2 Victim Edge Aggressor

Aggressor

Victim Middle

Aggressor

Fig. 1. Crosstalk estimation in two successive cycles at fringes and in the middle of a bus [16]

5 Simulation Results

To estimate the accuracy gain concerning power estimation with bit level statis- tics, different types of data were analyzed by our stream based program. As representatives for compressed data JPEG- and H.264 compressed image and video files as well as MPEG-Layer 3 encoded audio files were used. As a group of uncompressed data decoded image, audio, video and text files were used. A more practical data stream with a mixture of compressed and uncompressed data is represented by a network stream while browsing a webpage. Character- istic content of such a stream dump are uncompressed packet headers and a compressed HTML-text plus a few compressed graphics files. For comparison, we included a data pattern that maximizes crosstalk and transition probability to 100 % representing the worst case of data patterns. To get indications for the applicability of using bit level statistics, the model of an application was investigated. The H.264 decoder [18] was simulated at register transfer level to extract signal dumps of the global connections of functional blocks like memories, entropy decoder, prediction unit etc. Those trace dumps were analyzed to extract the bit level signal statistics. 26 M. Gag, T. Wegner, and D. Timmermann

5.1 Simulation Accuracy

Traditional data independent power estimation considers a transition probability of 50 %. In Fig. 2 the results of our system level power estimation compared to a traditional one are shown. In addition, we determined the estimated power values with the actual gathered transition probability without calculating crosstalk effects to rule out the influence of the MCF. As expected, the highly compressed data mostly consists of uncorrelated pat- terns. This corresponds to random data. The resulting power estimation with consideration of bit level statistics differs hardly from the traditional approach of assuming 50 % transition probability. This applies for random data as well as compressed images (JPEG), videos (H.264) and audio (MP3). The estimation error in respect to the most accurate method of using the real transition count and the crosstalk calculation shows relatively low values of up to 7.1 % (Tab. 2). The most accurate calculation with respecting the crosstalk capacitances in- cluding the MCF shows a little bit lower power values even in the case of completely random data. That is because the fringe capacitances, which are considered to be very much lower than the coupling capacitances, were included only in this esti- mation mode where the deep submicron bus model was used. The other two esti- mation modes only assume coupling capacitances on both sides of the wire even at the fringes of the bus. The uncompressed data shows higher autocorrelation. This results in lower power values due to fewer transitions on the wires in cases of uncompressed video as well as images (BMP), audio (WAVE) and text files. The effect is due to the most significant bits are switched more infrequently compared to the less significant ones. In these cases it is very important to choose the right word width to exploit the data characteristics. This decision is mostly implied by the application but information about this aspect can also be provided by our data analysis software. As Fig. 3 shows, transition probability of uncompressed data has a dependency on the used word width. The optimal width for uncompressed image and video data is 24 bit because typically there is 3 byte of color informa- tion per pixel in such a data structure. Our audio example consists of a 16 bit stereo wave file and shows an optimal word width of 32 bit. The text file would be optimally segmented in every multiple of 8 bit because ASCII encoding is used, which utilizes 1 byte of data per character. The highest difference between the power estimation values was reached by uncompressed video, which consists of a scene of an animated comic in 1080p format. The method of considering realistic transition counts and calculating the crosstalk activity differs about 432.5 % from the estimation with a simple assumption of 50 % transition activity. Just considering transitions and ignoring the MCFs of crosstalk shows a deviation of only 2.2 %. To get more realistic data patterns a SoC was examined. This hardware design implements a H.264 decoder and is divided into functional blocks. The signals connecting those modules are considered to be intermediate wires that are long enough to produce high capacitances and make a remarkable contribution to the overall energy consumption. The extracted signal statistics lead to power System Level Power Estimation 27 estimations that are significantly lower (deviation of 84.6 %) than assuming an average transition rate of 50 %. Therefore, the average transition rates between the main modules of the SoC are more in the regions of uncompressed data than being similar to the compressed data. This leads to a better power estimation when using real signal statistics. As simulation results show, the accuracy of the system level power estimation is raised by our approach of using signal statistics to predict transition prob- ability. By doing so, the error of up to 432.5 % in simulations using a general assumption of 50 % transition probability is avoided. The amount of such esti- mation errors depends on the data itself and is higher the less compressed the data is. As our worst case data sample shows, the simple estimation could be too low by about 64.9 % in cases of practical data it is consistently too high. Crosstalk effects are not that much important to the power estimation as can be seen by the little deviations of the method using real transition counts with- out the application of crosstalk estimation. That is because the average MCF is mostly met by the data characteristics.

Table 2. Relative deviation of energy estimation techniques related to the method of considering real transition rate and crosstalk

method using real tr. rate 50 % tr. rate worst case 0,313 0,649 random 0,026 0,027 JPEG 0,007 0,022 H.264 0,028 0,043 MP3 0,013 0,071 web surfing 0,032 0,187 text (ASCII) 0,052 0,520 BMP 0,016 1,266 video unenc. 0,022 4,325 WAVE 0,021 0,422 H.264 decoder SoC 0,059 0,846

5.2 Simulation Performance The method of using signal statistics reduces to calculating the power equation during simulation. In this step the general time complexity of the simulation is not affected, so there is no speed penalty and system level power estimation finishes in parts of a second. The statistical data of possible signals must be gathered prior to the simula- tion. This step takes time and depends on the method of statistics acquisition. In our experiment with general data files the data stream analysis lasts up to 5 seconds when processing up to 100 MB on an Intel Core2Duo workstation PC. It has to be mentioned that we did not optimize for runtime, as we assume to gather the statistics offline and then simulate high level models with few design possibilities in seconds. 28 M. Gag, T. Wegner, and D. Timmermann

35 50 % Transition Rate 30 Real Transition Rate 25 Real Tr. Rate and Crosstalk 20 15 10

Energy in fJ per Bit 5 0 worst caserandomJPEG H.264 MP3 web surfingWAVE text (ASCII)BMP video unenc.H.264 decoder SoC

Fig. 2. Estimated average energy for transmitting one bit on an intermediate wire of

200 ñm length (single spaced) in 65 nm technology for different data files evaluated by 3 different estimation techniques

0.6 WAVE 0.5 video BMP 0.4 random text 0.3

0.2

0.1 Transition Probability

0 8 16243240485664 Word Width

Fig. 3. Transition probability using different word width for transmission

6Conclusion

In this paper we showed how wrong system level power estimation could be if not aware of the data that will pass the interconnection system between the main modules. Our proposed technique takes bit level statistical data of a possible data stream in the system and makes it available to commonly accepted low System Level Power Estimation 29 level power models of interconnection links. By using this approach the actual transition activity of the interconnections and low level phenomena like cross coupling effects can be considered. It turns out that, if mainly uncompressed data is transmitted between the system components, the deviations between the power estimations are not negligible. In consequence, the consideration of bit level statistics promises to facilitate more accurate estimations. As the investigation on a realistic system showed, our technique was by 84.6 % more correct then if a general transition activity of 50 % would be assumed. The crosstalk feature of our power estimation technique showed no mention- able effects when realistic data was used. The difference to the method consider- ing real transition activities was 6.5 %. As we plan to integrate this work into a bigger simulation kit with different link level encoding features to exploit tran- sition and crosstalk avoidance codes, the feature of cross coupling estimation is going to be essential for correct power estimations.

References

1. Vangal, S., Howard, J., Ruhl, G., Dighe, S., et al.: An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits 43(1), 29–41 (2008) 2. Kahng, A., Li, B., Peh, L., Samadi, K.: Orion 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In: Design, Automation, and Test in Europe, pp. 423–428 (2009) 3. Predictive Technology Model, http://ptm.asu.edu/ 4. Wong, S.C., Lee, G.Y., Ma, D.J.: Modeling of Interconnect Capacitance, Delay, and Crosstalk in VLSI. IEEE Transactions on Semiconductor Manufacturing 13, 108–111 (2000) 5. Xi, J., Zhong, P.: A System-level Network-on-Chip Simulation Framework Inte- grated with Low-level Analytical Models. In: 2006 International Conference on Computer Design, pp. 383–388 (Oktober 2006) 6. Kretzschmar, C., Siegmund, R., M¨uller, D.: Adaptive bus encoding technique for switching activity reduced data transfer over wide system buses. In: Soudris, D.J., Pirsch, P., Barke, E. (eds.) PATMOS 2000. LNCS, vol. 1918, pp. 66–75. Springer, Heidelberg (2000) 7. Sotiriadis, P., Chandrakasan, A.: Bus energy minimization by transition pat- tern coding (TPC) in deep sub-micron technologies. In: Proceedings of the 2000 IEEE/ACM International Conference on Computer-Aided Design, pp. 322–328. IEEE Press, Los Alamitos (2000) 8. Ramprasad, S., Shanbhag, N., Hajj, I.: A coding framework for low-power address and data busses. IEEE Transactions on Very Large Scale Integration (VLSI) Sys- tems 7(2), 212–221 (1999) 9. Benini, L., Micheli, G., Macii, E., Sciuto, D., Silvano, C.: Address bus encoding techniques for system-level power optimization. In: Design, Automation, and Test in Europe, pp. 275–289. Springer, Heidelberg (1998) 10. Sotiriadis, P.P., Chandrakasan, A.: A Bus Energy Model For Deep Sub-Micron Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Sys- tems 10, 341–350 (2002) 11. Pande, P., Ganguly, a., Zhu, H., Grecu, C.: Energy reduction through crosstalk avoidance coding in networks on chip. Journal of Systems Architecture 54(3-4), 441–451 (2008) 30 M. Gag, T. Wegner, and D. Timmermann

12. Rahaman, M., Chowdhury, M.: Crosstalk Avoidance and Error-Correction Coding for Coupled RLC Interconnects. Crosstalk, 141–144 (2009) 13. Duan, C., Cordero Calle, V.H., Khatri, S.P.: Efficient On-Chip Crosstalk Avoid- ance CODEC Design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17(4), 551–560 (2009) 14. Sankaran, H., Katkoori, S.: On-chip dynamic worst-case crosstalk pattern detec- tion and elimination for bus-based macro-cell designs. In: 2009 10th International Symposium on Quality of Electronic Design, pp. 33–39 (M¨arz 2009) 15. Palesi, M., Fazzino, F., Ascia, G., Catania, V.: Data Encoding for Low-Power in Wormhole-Switched Networks-on-Chip. In: 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, pp. 119–126 (2009) 16. Gupta, S., Katkoori, S.: Intra-bus crosstalk estimation using word-level statistics. In: 17th International Conference on VLSI Design, Proceedings, pp. 449–454 (2004) 17. Ramprasad, S., Shanbhag, N., Hajj, I.: Analytical estimation of transition activity from word-level signal statistics. In: Proceedings of the 34th, vol. 16(7), pp. 718–733 (1997) 18. Fleming, K., Dave, C., Arvind, N., Raghavan, G., Jamey, M.: H. 264 Decoder: A Case Study in Multiple Design Points. In: 6th ACM/IEEE International Confer- ence on Formal Methods and Models for Co-Design, MEMOCODE, pp. 165–174 (2008) Residue Arithmetic for Designing Low-Power Multiply-Add Units

Ioannis Kouretas and Vassilis Paliouras

Electrical and Computer Engineering Dept., University of Patras, Greece

Abstract. In this paper an efficient way to exploit multi-Vdd standard- cell libraries is quantitatively investigated as a means to reduce power consumption of multiply-add units. It is shown that multi-Vdd library- based design is suitable for RNS systems due to their inherent modular organization. In particular the paths defined by the isolated moduli chan- nels are clearly distinguished and the designer can easily and efficiently determine high- and low-voltage areas in the design. Three-, four- and five-moduli RNS bases have been used for the design of the RNS multiply- add units. Comparisons to synthesized circuits that do not use multi-Vdd libraries revealed power reduction up to 38%.

1 Introduction

A main challenge for the electronics industry is to provide extremely efficient and powerful devices for communications, video and network applications that meet strict power constraints of portable battery-operated devices. This requires effective design techniques to address both the power constraints and the increase of the computational complexity. The use of alternative number representations such as the Logarithmic Num- ber System (LNS) and the Residue Number System (RNS), is a promising technique for the implementation of computationally-intensive low-power sys- tems [1, 17] using special-purpose dedicated circuits. In particular, RNS has been investigated as a possible choice for number representation in DSP ap- plications [14, 15], since it offers parallel multiplication or addition and error correction properties [18]. Recently RNS has been proved to provide solutions in the field of wireless telecom applications [12]. RNS architectures for basic arithmetic circuits can be distinguished into memory table lookup-based ones, combinatorial logic-based ones, or combination of both approaches [2]. Combi- natorial RNS circuits are efficient especially for large moduli, and for moduli of the form 2n − 1[7], 2n,and2n +1[7,8,19].Modulioftheform2n − 1and 2n + 1 offer low-complexity circuits for arithmetic operations due to the end- around carry property, while moduli of the form 2n lead to simple and regular architectures due to the carry-ignore property. Recent publications have shown that RNS can offer significant power savings when applied to the design of VLSI FIR digital filters [3–5]. In [13] it is theo- retically shown that power minimization is possible in RNS domain, by using

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 31–40, 2011. c Springer-Verlag Berlin Heidelberg 2011 32 I. Kouretas and V. Paliouras multi-voltage supply voltages. The particular study focuses on Polynomial RNS for the implementation of low-power convolvers. In this paper a multi-voltage library is exploited to reduce power dissipation of RNS multiply-add units and a quantitatively analysis is offered. In particular low-voltage cells are employed to implement specific paths, i.e., paths that are not maximum delay critical for the circuit. The remainder of the paper s organized as follows. Section 2 offers RNS ba- sics, while section 3 reviews power dissipation basics. Section 4 describes the proposed multi-Vdd multiply-add units and quantitative analysis is taking place in section 5. Section 6 ends up with some conclusions.

2ReviewofRNSBasics

The RNS maps an integer X to a N-tuple of residues xi, as follows

RNS X −→ { x1,x2,...,xN }, (1)

i   ·m i i where x = X mi , i denotes the mod m operation, and m is a member of a set of pair-wise co-prime integers {m1,m2,...,mM }, called base.Co-prime integers have the property that gcd(mi,mj)=1,i = j. The modulo operation Xm returns the integer remainder of the integer division x div m, i.e., a number · k such that x = m l + k,wherel is an integer. Mapping (1) offers a unique ≤ N representation of integer X,when0 X< i=1 mi. RNS is of interest because basic arithmetic operations can be performed in RNS a carry-free manner. In particular the operation Z = X ◦ Y ,whereY −→ RNS {y1,y2,...,yN }, Z −→ { z1,z2,...,zN },andthesymbol◦ stands for addition, i  i ◦ i subtraction, or multiplication, can be implemented in RNS as z = x y mi , for i =1, 2,...,M. According to the above, each residue result zi does not depend on any of the xi, yi, j = i, thus allowing fast data processing in N parallel independent residue channels. Inverse conversion is accomplished by means of the Chinese Remainder Theorem (CRT) or mixed-radix conversion [16].

3 Low-Power in RNS

Dynamic power Pdyn of a circuit is given by [10]

2 Pdyn = CL · Vdd · f · α, (2) where CL is the load capacitance, Vdd is the supply voltage, f is the frequency of transitions and α is the switching activity on each clock cycle. Eq. (2) shows that power is quadratically related to voltage. Therefore by reducing power sup- ply (Vdd), dynamic power decreases dramatically. The penalty for the reduction of Vdd is that cells that operate at lower voltage are slower. Hence, the de- signer should identify the non-critical paths (i.e., the paths that do not define the maximum-delay critical path) and power the respective gates with a lower voltage. Residue Arithmetic for Designing Low-Power Multiply-Add Units 33

For the case of a multi-Vdd system, power dissipation is given by  p · 2 · · Pdyn = i=1 CLi Vdd,i fi αi, (3) where p is the number of power domains employed. The proposed technique builds on the modular organization of residue-based systems. In particular, it is here proposed that each independent moduli channel of an RNS architecture is mapped to an appropriate supply voltage. According to the proposed technique moduli channels that contain the longest path are mapped to higher supply voltages. It is noted that power minimization is achieved without any impact on the delay. Due to its modular organization, RNS is ideally suited for the simple and efficient application of the aforementioned low-power design technique. Assume an L-moduli RNS base {m1,m2, ..., mL} implemented by an L-channel residue architecture, as shown in Fig. 2. Each modulo mi defines the complexity of the corresponding modulo channel the delays of which are {d1,d2, ..., dL},re- spectively, assuming high-voltage power supply denoted as Vdd(H). Here we focus on the case of two power domains, i.e., p = 2, with two voltage values, Vdd(H)and Vdd(L). The maximum delay dmax =max(d1,d2,...,dL) determines the critical maximum delay of the design. Assume that dmax = dk and for the delays dl, l = k, without loss of generality, it holds that

dk1

((Vdd(H)) that compose each one of the moduli channels mki with low-voltage gates (Vdd(L) is permissible, provided that the imposed delay penalty in non- critical circuits does not affect the overall critical delay dmax, i.e., dmax = dk ≥ max{dki }. Subsequently the proposed multiply-add units are described and quantitative power dissipation and complexity results are derived. Comparisons are offered to both binary structures and residue multiply-add units without multi-voltage supply in terms of power dissipation and complexity in terms of power dissipation and complexity.

4 RNS and Binary Multiply-Add Units

This section describes the organization of RNS and binary multiply-add units. In the case of RNS, three- four- and five-moduli bases of the form {2n1 −1, 2n2 , 2n3 + 1}, {2n1 , 2n2 − 1, 2n3 − 1, 2n4 +1} and {2n1 , 2n2 − 1, 2n3 − 1, 2n4 +1, 2n5 +1} are used, respectively. The binary multiply-add unit comprises a Wallace mul- tiplier augmented by a step for the addition of a third operand. Figs. 1 and 3 depict the organization of a binary and a three-moduli RNS-based multiply-add unit respectively, while Fig. 4 shows possible 4-bit implementations for modulo- (2n − 1) MAC (Fig. 4(a)), modulo-2n (Fig. 4(c)) and binary MAC (Fig. 4(b)). Both architectures implement the multiply-add operation a ∗ b + c. 34 I. Kouretas and V. Paliouras

n1 bits

mod m1 processor Vdd(H)

a b c nk1 bits mod mk1 processor Vdd(L)

nk2 bits mod mk2 n bits n bits bin processor Vdd(L) RNS AND ARRAY to to nkl−1 bits mod mkl−1 bin RNS processor Vdd(L)

nk bits mod mk processor Vdd(H) Wallace adder array nL bits mod mL processor Vdd(H)

Fig. 1. Organization of Fig. 2. Architecture of multi-voltage RNS system the binary multiply-add unit

It is noted that in the case of RNS, binary-to-RNS and RNS-to-binary con- verters are required. Forward conversion is required at the start and reverse conversion at the end of a MAC-intensive operation, such as the computation of an N-point Fourier transform [11]. To illustrate this point, assume the FIR filter operation y(n)=b0x(n)+ b1x(n − 1) + b2x(n − 2) + ···+ bM x(n − M), where x(n) is the input signal, b(n)arethecoefficientsandy(n) is the output signal. Let the RNS base be of the form {m1,m2,m3, ..., mN }.Thenforkth sample y(k) of the filter output, it holds that y(k)=b0x(k)+b1x(k − 1)+ b2x(k − 2)+ ···+ bLx(k − L). In the RNS domain the same operation is performed in N parallel modulo-mi channels as   M    l · −  y(k) mi = b x(k l) mi , (5)

l=0 mi where mi denotes the ith moduli, i =1, 2, ..., N. The procedure for the computation of y(n) is as follows. Initially the mul-   i tiplication c(0) = b0x(k) mi is computed. Then the modulo-m result c(0) is  −  added to the residue product b1x(k 1) mi to derive the intermediate quantity  −    c(1) = c(0) + b1x(k 1) mi .Theresult y(k) mi is recursively derived after L additions and multiplications. Hence the final result y(k) is generated by the {    } residue-to-binary conversion of the RNS result y(k) m1 , ..., y(k) mN after L multiply-add operations. For this reason the backward residue-to-binary conver- sion is performed every L multiply-add operations. Furthermore, x and b is for- ward converted once and is recursively used for the computation of y. Therefore for sufficiently large amount of processing, the conversion cost can be compen- sated by savings achieved due to more efficient processing. Due to the conversion overhead, applications suitable for RNS include multiply-add-intensive kernels such as digital filtering or discrete transforms. Residue Arithmetic for Designing Low-Power Multiply-Add Units 35

a2n1 −1 b2n1 −1c2n1 −1 a2n2 b2n2 c2n2 a2n3 +1 b2n3 +1c2n3 +1

AND array AND array AND array

modulo 2n1 − 1 modulo 2n2 modulo 2n3 +1 adder array adder array adder array

n n modulo 2 1 − 1 adder modulo 2n2 adder modulo 2 3 + 1 adder

n modulo 2n1 − 1 channel [20]. modulo 2n2 channel. modulo 2 3 + 1 channel [6].

Fig. 3. Organization of RNS-based multiply-add unit

a3a2a1a0 b3b2b1b0 a3a2a1a0 b3b2b1b0

and array and array a3a2a1a0 b3b2b1b0 c3 c2 c1 c0 c3 c2 c1 c0 and array FA FA FA HA FA FA FA FA c3 c2 c1 c0 FA FA FA HA FA FA HA FA FA FA FA FA FA FA HA

FA FA FA HA FA HA FA FA FA FA HA HA n FA modulo 2 − 1 adder FA FA

r3r2r1r0 r7r6 r5 r4 r3 r2 r1 r0 r3 r2 r1 r0 (a) Modulo 2n − 1 (b) Binary MAC. (c) modulo 2n MAC. MAC. Fig. 4. Implementations of RNS and binary MAC units

5 Results and Comparisons

In this section quantitative analysis and comparisons of residue circuits to the equivalent binary multiply-add unit is offered, in case of three-, four- and five- moduli bases. In particular as a test case, a 50th-order FIR low-pass filter is used, with a cut-off frequency of 0.3rad/sec. A zero-mean uncorrelated gaussian random sequence is used as stimulus. The experiment assumes 1000 input data samples. For each modulo channel of the RNS circuit the corresponding input vectors are derived by the modulo operation on the input data samples and the coefficients of the FIR filter. Hence the inputs of the modulo circuits assume the values that a forward converter would generate. 36 I. Kouretas and V. Paliouras

Subsequently, the equivalent to RNS binary multiply-add unit is defined. The signal to noise ratio (SNR) is used as a metric to define binary structure which is equivalent to RNS. SNR is estimated by using the filter and the input data described above. It is found that the 30-bit data range RNS FIR filter exhibits almost the same SNR with the binary FIR filter with 20-bit wordlength operands (SNRBIN =64.71, SNRRNS =65.38). In this paper a multi-Vdd 90nm TSMC library, characterized for 1.2Volts (high-voltage) and 1.0Volts (low-voltage) power supply and Prime Time of Syn- opsys [9] have been used. Power is estimated by using the stimuli derived by the FIR filter defined above with annotated switching activity, assuming a 5ns clock period for the simulation. It is noted that high-voltage gated exhibit faster delay compared to the low- voltage gates. The proposed multi-Vdd based design technique distinguishes parts of the circuit that are not critical and may operate at reduced speed. Therefore low-voltage power supply can be used without affecting the critical path delay. In the following the residue number system is used for multi-Vdd design. Assume an L-moduli RNS base {m1,m2, ..., mL} the delays of which are {d1,d2, ..., dL}, respectively, for high-voltage power supply. The maximum delay dmax =max(d1,d2,...,dn) determines the critical delay of the design. Now as- sume that dmax = dk and for the delays dp, dp−1 and dp−2 of the moduli channels p, p − 1, and p − 2 respectively, it holds

dp

Regarding design constraints, legal replacement of high-voltage gates that com- pose each one of the moduli channels p, p − 1andp − 2 with low-voltage gates is achievable, provided that the derived delay penalty retains the critical delay dmax stable, i.e., dmax = dk ≥{dp,dp−1,dp−2}. Several RNS circuits have been synthesized using the multi-Vdd library, and the obtained results are presented in Tables 1, 2 and 3. The moduli followed by (*) denote low-voltage(1.0volts) power-supply synthesis. Lack of (*) means that the particular moduli circuits have been synthesized with high-voltage(1.2volts) power supply. The column labeled “power”, contains power results for RNS system before the application of the multi-Vdd low-power technique and after. The power savings percentage is computed as Powerbefore−Powerafter · 100%. Powerbefore More specifically, Table 1 depicts results in case of three-moduli RNS bases of the form {2n1 − 1, 2n2 , 2n3 +1}. It is shown that power savings range from 8.11% to 37.96%, in case of bases {256∗, 2047, 2049} and {64, 8191∗, 1025∗}, respectively. In case of the base {6256, 2047, 1025} it is shown that by low- voltage supplying modulo-1025, deriving the base {256, 2047, 1025∗}, 28.71% power saving is achieved while in case of low-Vdd application to both modulo- 1025 and -256, power saving is increased to 33.35%. Regarding four-moduli bases of the form {2n1 , 2n2 −1, 2n3 −1, 2n4 +1},Table2 depicts that power achieves upto 38.63% savings in case of the base {16, 31∗, 2047∗, 1025∗}. Table 2 also demonstrates that the bases Residue Arithmetic for Designing Low-Power Multiply-Add Units 37

Table 1. Power, delay and area results in case of multi-vdd application in three-moduli RNS bases

power(mW) 2 base area(μm ) delay(ns) power savings before after {256, 2047, 1025∗} 3.0577 2.1797 11427.6623 2 28.71% {256, 511∗, 8193} 3.3128 2.5658 12513.1888 2 22.55% {64, 8191∗, 1025∗} 3.2168 1.9957 20874.1566 2 37.96% {256∗, 2047, 2049} 1.7488 1.607 7166.2304 2 8.11% {256∗, 2047, 1025∗} 3.0577 2.0379 11190.0319 2 33.35% {256∗, 1023∗, 4097} 3.0823 2.2485 12060.9775 2 27.05%

Table 2. Power, delay and area results in case of multi-vdd application in four-moduli RNS bases

power(mW) 2 base area(μm ) delay(ns) power savings before after {16, 31, 2047, 1025∗} 3.1598 2.282 12507.1519 2 27.79% {32, 15, 511∗, 4097} 3.1058 2.359 12056.0384 2 24.05% {16, 31, 2047∗, 1025∗} 3.1598 2.069 14390.0846 2 34.52% {32, 511∗, 2047, 17} 2.9866 2.240 11585.7168 2 25.01% {16, 31∗, 2047, 1025∗} 3.1598 2.152 12301.9007 2 31.90% {32, 511∗, 2047∗, 17} 2.9866 2.027 13468.6495 2 32.14% {16, 31∗, 2047∗, 1025∗} 3.1598 1.939 14184.8334 2 38.63% {256∗, 31, 4095, 17} 1.8238 1.682 7958.6976 2 7.77% {16∗, 31, 2047, 1025∗} 3.1598 2.247 12265.1311 2 28.89% {32∗, 15, 511∗, 4097} 3.1058 2.327 12082.3808 2 25.08% {16∗, 31, 2047∗, 1025∗} 3.1598 2.034 14148.0638 2 35.63% {32∗, 511∗, 2047, 17} 2.9866 2.208 11612.0592 2 26.08% {16∗, 31∗, 2047, 1025∗} 3.1598 2.117 12059.8799 2 33.00% {32∗, 511∗, 2047∗, 17} 2.9866 1.995 13494.9919 2 33.20%

{16, 31, 2047, 1025∗} and {16, 31, 2047∗, 1025∗} achieve 27.79% and 34.52% power savings respectively. In Table 3, similar results are revealed in the case of five-moduli RNS multi- add units. In particular the base {64, 31, 511∗, 17, 33} which demonstrates low- voltage supply to modulo-511, achieves 23.66% power reduction while the base {64, 31∗, 511∗, 17∗, 33} with three low-Vdd moduli channels, namely modulo- 511, -31 and -17, exhibits 30.34% power consumption gains. Power-saving gains range from 9.54% up to 38.03% in the case of the bases {512∗, 15, 31, 17, 257} and {16, 31∗, 63∗, 17∗, 1025∗}, respectively. Referring to binary FIR filter with 20-bit wordlength operands, it depicts results in Table 4. It is shown that power consumption in the binary domain is 4.432mW while the maximum power result in the RNS domain is 3.373mW and 2.415mW in case of high-Vdd and low-Vdd supply voltage, respectively. Results reveal that multi-Vdd design is highly suited for RNS design of multiply- add units and hence for the implementation of low-power FIR VLSI filters. 38 I. Kouretas and V. Paliouras

Table 3. Power, delay and area results in case of multi-vdd application in five-moduli RNS bases

power(mW) 2 base area(μm ) delay(ns) power savings before after {16, 31, 63, 17, 1025∗} 3.3735 2.4955 13912.0799 1.67 26.03% {64, 31, 127, 33∗, 65} 2.1642 2.0944 10590.7424 1.30 3.23% {16, 31, 63, 17∗, 1025∗} 3.3735 2.4145 14082.7567 1.67 28.43% {64, 31, 511∗, 17, 33} 3.1573 2.4103 12844.1152 1.73 23.66% {16, 31, 63∗, 17, 1025∗} 3.3735 2.3016 13527.3711 1.67 31.77% {64, 31, 511∗, 17∗, 33} 3.1573 2.3293 13014.792 1.73 26.22% {16, 31, 63∗, 17∗, 1025∗} 3.3735 2.2206 13698.0479 1.67 34.18% {64, 63∗, 127, 17, 65} 2.2674 2.0735 10667.5744 1.47 8.55% {16, 31∗, 63, 17, 1025∗} 3.3735 2.3656 13706.8287 1.67 29.88% {64, 63∗, 127, 17∗, 65} 2.2674 1.9925 10838.2512 1.47 12.12% {16, 31∗, 63, 17∗, 1025∗} 3.3735 2.2846 13877.5055 1.67 32.28% {64, 31∗, 511∗, 17, 33} 3.1573 2.2804 12638.864 1.73 27.77% {16, 31∗, 63∗, 17, 1025∗} 3.3735 2.1717 13322.1199 1.67 35.62% {64, 31∗, 511∗, 17∗, 33} 3.1573 2.1994 12809.5408 1.73 30.34% {16, 31∗, 63∗, 17∗, 1025∗} 3.3735 2.0907 13492.7967 1.67 38.03% {512∗, 15, 31, 17, 257} 2.498 2.2598 11028.6848 2.23 9.54% {16∗, 31, 63, 17, 1025∗} 3.3735 2.46056 13670.0591 1.67 27.06% {512∗, 15, 31, 17∗, 257} 2.498 2.1788 11199.3616 2.23 12.78% {16∗, 31, 63, 17∗, 1025∗} 3.3735 2.37956 13840.7359 1.67 29.46% {64∗, 31, 511∗, 17, 33} 3.1573 2.3588 13127.8448 1.73 25.29% {16∗, 31, 63∗, 17, 1025∗} 3.3735 2.26666 13285.3503 1.67 32.81% {32∗, 31, 511∗, 17∗, 65} 3.275 2.41517 13784.7584 1.73 26.25% {16∗, 31, 63∗, 17∗, 1025∗} 3.3735 2.18566 13456.0271 1.67 35.21% {256∗, 31∗, 127, 17, 33} 2.0767 1.805 9426.7376 1.87 13.08% {16∗, 31∗, 63, 17, 1025∗} 3.3735 2.33066 13464.8079 1.67 30.91% {256∗, 31∗, 127, 17∗, 33} 2.0767 1.724 9597.4144 1.87 16.98% {16∗, 31∗, 63, 17∗, 1025∗} 3.3735 2.24966 13635.4847 1.67 33.31% {64∗, 31∗, 511∗, 17, 33} 3.1573 2.2289 12922.5936 1.73 29.40% {16∗, 31∗, 63∗, 17, 1025∗} 3.3735 2.13676 13080.0991 1.67 36.66% {64∗, 31∗, 511∗, 17∗, 33} 3.1573 2.1479 13093.2704 1.73 31.97%

Table 4. Power, delay and area results for the binary 20-bit wordlength multiply-add unit with high-vdd supply voltage

power(mW) area(μm2) delay(ns) 4.432 19550.451 4.41

6 Conclusions

In this paper the low-power technique of multi-Vdd design has been applied for the design of multiply-add units in residue number system. It is shown that the particular technique can be used in RNS systems because the paths defined by the moduli channels are clearly distinguished and the designer can easily apply high- and low-voltage areas in the design. Residue Arithmetic for Designing Low-Power Multiply-Add Units 39

Furthermore, binary and residue multiply-add units are quantitatively com- pared. RNS is shown to demonstrate substantial power savings due to the parallel structure of RNS and to the simple and effective application of multi-Vdd design technique.

References

1. Basetas, C., Kouretas, I., Paliouras, V.: Low-Power Digital Filtering Based on the Logarithmic Number System. In: Az´emard, N., Svensson, L. (eds.) PATMOS 2007. LNCS, vol. 4644, pp. 546–555. Springer, Heidelberg (2007) 2. Bayoumi, M.A., Jullien, G.A., Miller, W.C.: A VLSI implementation of residue adders. IEEE Transactions on Circuits and Systems 34, 284–288 (1987) 3. Bernocchi, G.L., Cardarilli, G.C., Re, A.D., Nannarelli, A., Re, M.: Low-power adaptive filter based on RNS components. In: ISCAS, pp. 3211–3214 (2007) 4. Cardarilli, G., Re, A.D., Nannarelli, A., Re, M.: Impact of RNS coding overhead on FIR filters performance. In: Proc. of 41st Asilomar Conference on Signals, Systems, and Computers (November 2007), http://www2.imm.dtu.dk/pubdb/p.php?5566 5. Cardarilli, G., Nannarelli, A., Re, M.: Reducing Power Dissipation in FIR Filters using the Residue Number System. In: Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, vol. 1, pp. 320–323 (August 2000) 6. Efstathiou, C., Vergos, H.T., Dimitrakopoulos, G., Nikolos, D.: Efficient diminished-1 modulo 2n + 1 multipliers. IEEE Transactions on Computers 54(4), 491–496 (2005) 7. Efstathiou, C., Vergos, H.T., Nikolos, D.: Modulo 2n ± 1 adder design using select- prefix blocks. IEEE Transactions on Computers 52(11) (November 2003) 8. Hiasat, A.A.: High-speed and reduced area modular adder structures for RNS. IEEE Transactions on Computers 51(1), 84–89 (2002) 9. http://www.synopsys.com 10. Keating, M., Flynn, D., Aitken, R., Gibbons, A., Shi, K.: Low Power Methodology Manual: For System-on-Chip Design. Springer Publishing Company, Incorporated, Heidelberg (2007) 11. Kouretas, I., Paliouras, V.: Mixed radix-2 and high-radix RNS bases for low- power multiplication. In: Svensson, L., Monteiro, J. (eds.) PATMOS 2008. LNCS, vol. 5349, pp. 93–102. Springer, Heidelberg (2009) 12. Madhukumar, A.S., Chin, F.: Enhanced architecture for residue number system- based CDMA for high-rate data transmission. IEEE Transactions on Wireless Com- munications 3(5), 1363–1368 (2004) 13. Paliouras, V., Skavantzos, A., Stouraitis, T.: Multi-Voltage Low Power Convolvers Using the Polynomial Residue Number System. In: Proceedings of the 12th ACM Great Lakes Symposium on VLSI, GLSVLSI 2002, pp. 7–11. ACM, New York (2002) 14. Ramirez, J., Fernandez, P., Meyer-Base, U., Taylor, F., Garcia, A.: Index-Based RNS DWT architecture for custom IC designs. In: IEEE Workshop, Signal Pro- cessing Systems, pp. 70–79 (2001) 15. Ramirez, J., Garcia, A., Lopez-Buedo, S., Lloris, A.: RNS-enabled design. Electronics Letters 38, 266–268 (2002) 16. Soderstrand, M.A., Jenkins, W.K., Jullien, G.A., Taylor, F.J.: Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. IEEE Press, Los Alamitos (1986) 40 I. Kouretas and V. Paliouras

17. Stouraitis, T., Paliouras, V.: Considering the alternatives in low-power design. IEEE Circuits and Devices 17(4), 23–29 (2001) 18. Szab´o, N., Tanaka, R.: Residue Arithmetic and its Applications to Computer Tech- nology. McGraw-Hill, New York (1967) 19. Wang, Z., Jullien, G.A., Miller, W.C.: An algorithm for multiplication modulo (2n + 1). In: Proceedings of 29th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, pp. 956–960 (1996) 20. Zimmermann, R.: Efficient VLSI implementation of modulo (2n ± 1) addition and multiplication. In: Proceedings of the 14th IEEE Symposium on Computer Arith- metic, ARITH 1999, p. 158 (1999) An On-Chip Flip-Flop Characterization Circuit

Abhishek Jain1, Andrea Veggetti2, Dennis Crippa2, and Pierluigi Rolandi2

1 STMicroelectronics Noida, India 2 STMicroelectronics Agrate, Italy [email protected], [email protected], [email protected], [email protected]

Abstract. The performance of the sequential digital circuit (Speed, Power con- sumption etc.) depends upon the performance of flip-flop used in the design. ASIC design flows use characterized data of flip-flops for final signoff. There- fore it’s critical to know precisely the accuracy of characterized data with respect to the actual behavior of flip-flops on silicon. An on-chip flip-flop char- acterization circuit (FCC) has been presented here which gives the accurate es- timation of various parameters of flip-flop such as CP-Q Delay, Setup time, Hold time and Power consumption. The system consists of a digital controller and characterization circuit which are based upon configurable oscillator which could be programmed to oscillate in different configurations or could be oper- ated in functional mode for functional verification. The delay values are calcu- lated by processing the value of time period of oscillator in different modes. The system was fabricated in 40nm CMOS technology and the flip-flop pa- rameters are extracted from it.

Keywords: Flip-flop, CMOS, delay measurement, characterization, silicon validation, on-chip, setup-hold.

1 Introduction

Flip-flops and Latches are the basic sequential logic elements used in ASIC design. These elements take significant portion of critical path timing in a high speed digital circuit and they also contribute heavily on the total system power dynamic as well as static. The performance and complexity of modern designs make these components vital part of the design. Therefore, there exists a need of studying the behavior of these components. In general, the characteristics are measured using SPICE models and circuit simu- lators at the CAD level, and the data obtained is being put in different packaging formats. This data is used in the final SIGNOFF of the chip and thus it is required to be validated with actual measured results on silicon. A direct off-chip measurement of the delay between waveforms of flip-flop/Latch ports [1] can be used to validate the simulation models. However, an off-chip measurement approach has serious limita- tions, since the on-chip delays of flip-flops/Latches in deep-submicron technologies are typically much smaller than that of the circuitry connecting the ports to the

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 41–50, 2011. © Springer-Verlag Berlin Heidelberg 2011 42 A. Jain et al. instrumentation. The measurement errors incurred by this circuitry can be comparable to the measured quantity. Other methods for on-chip delay estimation are dummy path method and ring oscillator method [2]. The dummy path method is again limited by the accuracy since it’s based upon off-chip measurements however ring oscillator method involving measurement of square wave time period gives accurate results. The ring oscillator method is very good for delay measurement of combinational cells and latches but it is not well explored for measurement of flip-flop parameters. Some other systems have also been proposed [3], [4], [5] involving complete characteriza- tion of flip-flops/Latches but they are based on multiple circuits for characterization of different parameters. In this paper we present a single on-chip measurement system for complete charac- terization of the sequential elements which is based upon ring oscillator configuration for estimation of data, clock – output delays, setup-hold timings, and shift register configuration for estimation of power. In section 2 flip-flop/Latch characteris- tics/parameters have been explained, followed by the description of measurement apparatus/system in section 3. In Section 4 how the parameter extraction is being done based upon the apparatus of section 3 is explained. The Section 5 contains the measurement results based upon CAD simulations and Silicon Results obtained from test circuit implemented in 40nm CMOS technology and error analysis. Section 6 concludes the paper.

2 Sequential Element Characterization Parameters

In this section we describe the key parameters of a positive edge triggered D flip-flop circuit. These parameters are also valid for other configurations of sequential elements.

2.1 Timing Parameters

The functionality of flip-flop circuit depends upon the time at which a change in the data input D of the flip-flop has occurred with respect to positive edge of the clock input CP. If the signal at the D input is stable within a window around the positive transition of the clock CP, then some time later D value will propagate to the output Q of the flip-flop. As shown in Fig. 1, the time before the clock edge that the D input has to be stable is called the setup time (ts) and the time after the clock edge that the D input has to be stable is called the hold time (th). The delay from the positive clock input to the new value of the Q output is called the clock-to-Q delay or propagation delay (tCP-Q) [6]. The timing verification tools issues a timing violation if the data input D changes inside the window of setup and hold time as described above. This is a case of failure of flip-flop since the flip-flop circuit could enter in meta-stable state. In Fig. 2, the clock-to-Q delay has been plotted with respect to time difference between data and clock inputs of flip-flop. For large values of delay between data and clock the clock- to-Q delay is constant, but as the delay approaches the setup and hold time window the clock-to-Q delay starts increasing since internally flip-flop circuit takes more time An On-Chip Flip-Flop Characterization Circuit 43 to resolve its state. There exists a failure window wherein a change in data input does not have any effect on flip-flop output. The setup and hold time are therefore defined at the point where slope of the curve is equal to 1[3]. In the presented measurement system, we have exploited this relation of clock-to-Q delay with data and clock input delay to measure the timing parameters. The clock-to- Q delay is being measured when it is constant and setup/hold time are measured at the points defined in Fig. 2.

Fig. 1. D Flip-flop Timing Parameters

2.2 Dynamic Power

Flip-flops are used in wide variety of circuits targeting different applications where the data rates could be different. Therefore, it’s important to study the power con- sumption of the flip-flop with respect to the switching activity of data input or data rate (which also results in change in the output state). Here, dynamic Power is meas- ured with respect to different data rates and a constant clock frequency.

2.3 Static Power

As leakage power has become quite significant in submicron technologies, it is also important to know what current flip-flop is drawing in inactive state. Leakage power estimation is also useful for the case of retention flip-flops which are used in power down applications. Here the leakage power of the flip-flop could be measured under different configurations of inputs and outputs.

44 A. Jain et al.

Fig. 2. Clock-to-Q delay v/s Delay between Data and Clock Inputs of D Flip-flop [3]

3 Measurement System

The measurement system consists of two main blocks, Controller circuit and Charac- terization circuit (FCC). The controller circuit is based upon digital state machine generating the control signals for FCC to operate in different configurations. The FCC could be made to operate with or without controller circuit.

Fig. 3. FCC BASECELL Circuit Diagram An On-Chip Flip-Flop Characterization Circuit 45

3.1 Characterization Circuit (FCC)

It is pure Digital circuit which could be implemented using basic standard cell library. It is based upon N stages of FCC BASECELL units as shown in Figure 4. The Base- cell circuit consists of MUXes, Programmable Delay cells PDD and PDC and the DUT (Device Under Test, in present case could any D flip-flop) connected as shown in Figure 3. The signal to the Clock and Data inputs of the DUT could be configured through 4X1 muxes select lines and their respective path delays could be varied through PDC and PDD cells. The output of the Basecell could also be programmed to select output of DUT or D input of DUT or CP input of DUT as output. Depending upon the mode of working, these inputs and output could be configured accordingly, either by controlling circuit or external IO. The PDD and PDC cells used in data and clock path respectively are based upon programmable delay cell circuit as shown in figure 5. These cells are used to intro- duce delay between data and clock input of DUT for timing measurements. The PDD and PDC cells are made of different drives of BUF cell which forms a vernier delay line between clock and data path selectable through SDD and SDCP select lines. The select lines are selected in order to have minimum delay difference between the two. The delay introduced by these cells could be characterized in oscillator mode of the system which is explained later. These two blocks are implemented with full custom flow, in order to have minimum variation delays between different cells. To minimize the variation in delay due to different rise and fall delays of cells in PDC and PDD, for every even stage Basecell the positive edge of signal is propagated and for every odd stage Basecell negative edge of signal is propagated through PDD and PDC cells. The DUT in the circuit is connected to different power domain which is done by separating the rail connection of DUT from rest of the circuit and connecting it to different power supply. The number of stages N of the system is limited by the mini- mum current measurement value of the Tester. The N number of flip-flops should be able to produce leakage current of that order.

Fig. 4. FCC Characterization Circuit 46 A. Jain et al.

Fig. 5. Programmable Delay Cell Circuit Diagram (PDD and PDC)

3.2 Characterization Circuit Configurations

The system is based upon two different configurations. Oscillator and Shifter. Oscilla- tor configuration is used for extraction of timing parameters and Shifter configuration is used for extraction of static and dynamic power, and functional verification.

Oscillator Configuration:- In this configuration the inputs and output of the Basecell are configured to form a ring oscillator. The oscillator configuration could be config- ured in three different modes to include or exclude the delay of certain paths. (a) The delay of clock path is characterized in this mode. The output BOUT of Basecell passes the signal at CP input of DUT to the next stage Basecell. The delay of single unit equals 1/(2*N*Frequency of Oscillation at System Output). (b) The delay of clock path and clock-to-Q path of DUT is characterized in this mode. The select lines for MUXES are being set to send signal at Q output of DUT to CP input of next stage DUT. Here, a single edge (rise or fall) is being propagated through N stages and DUTs are reset (for Rise Delay Measurement) or set (for Fall Delay measurement). The Delay of single unit equals 1/(N*Frequency of Oscillation at System Output). (c) The delay of data path is being characterized. The BOUT output passes sig- nal at D input of DUT to next stage. The Delay of single unit equals 1/(2*N*Frequency of Oscillation at System Output).

Shifter Configuration:- In this configuration the Clock input of DUT is controlled with external clock signal and its Q output goes to the D input of next stage cell. In this way signal available at the D input of first stage is available at Q output of Nth Stage after N clock cycles. This configuration is useful for dynamic and leakage power estimation.

4 Measurements 4.1 Clock-to-Q Delay Measurement The circuit is operated in Oscillator configuration in mode (a) and (b) as explained above. Here, the setup and hold constraints of the DUT are respected in order to have stable value of clock-to-Q delay. The clock-to-Q delay value is given by (b)-(a). An On-Chip Flip-Flop Characterization Circuit 47

4.2 Setup and Hold Time Measurement

The circuit is operated in oscillator configuration in mode (a), (b) and (c) as explained in Section 2.The data path selects the clock path signal to pass through instead of signal coming to 4X1 data MUX.. The three measurements are performed for all combination of polarity and delays of clock and data paths. The clock-to-Q delay value is given by (b)-(a) and delay between clock and data signal is given by (c)-(a). These values are plotted and optimized setup and hold time values are extracted from the graph as explained in Section 2.

4.3 Dynamic Power Measurement The circuit is operated in Shifter Configuration. The data with different activity rates with respect to clock frequency is being passed through the shifter and power meas- urements are performed for two power supplies. i.e one which is supplying power to DUTs and other which is supplying power to rest of the circuit.

4.4 Leakage Power Measurement The circuit is again operated in Shifter configuration. The DUTs are first fixed to constant state and then leakage measurements are performed on two power supplies.

4.5 Sources of Error and Improvements The main sources of the error in the timing measurements at the circuit level come from the different path delays of MUX used in PDD and PDC cells, and difference in rise and fall delays of cells used in the circuit. These errors could be minimized using the im- plementation method suggested in section 3.1 but cannot be eliminated completely. For the power measurements, since, the power domain of DUT is separated from rest of the interface and control circuit, the results show the actual power dissipation without any external component. However, error in this case could be introduced by measurement apparatus used for current measurement, since these apparatus have limitation on minimum measurement values. To overcome this limitation, sufficient stages of Basecell should be put in circuit, especially in the case of static currents. Further, the present measurement system targets characterization at a particular load and slope only. In order to do characterization at different loads and clock transi- tions additional MUX stages could be added at the output of DUT and at the clock input of DUT which would give the programmability for selecting different load and clock signal slopes.

5 Measurement Results

5.1 CAD Results The analysis of complete system is being done at CAD level using XA simulator from and device models from 40nm CMOS technology process. The circuit is implemented based upon Tristate buffer Master –slave D flip-flop circuit [7] as DUT and 100 stages of Basecell has been put to make complete system. The simulation results shown are based on typical models. The misalignment in measured and actual 48 A. Jain et al. values of clock-to-q delay as shown in Fig. 6 and Fig. 8 is due error introduced by different path delays of MUX lines for different values of selection inputs required for enabling oscillation in different modes and due to difference in rise and fall delays internal cells. The estimated error introduction due to MUX is approximately 8-10ps and due to difference in rise and fall delay is 5-7 ps as obtained from characterized library database. The measured Hold time at 1V and 25C is around 5ps and Setup time is around 70ps. The measured clock-to-Q delay is 132ps. More analysis across different PVT corners is required to be done for complete validation of the circuit. The power values shown in Fig. 7 and Fig. 9 are obtained by calculating the aver- age current flowing through power supply of DUT. For dynamic power the circuit is operated in shift register mode wherein the input data rate is varied w.r.t clock fre- quency and for static measurement data corresponding to average of static current in different clock, data and output configuration has been plotted. The power values gives the actual power through the DUT excluding the power dissipation in interface circuit, therefore, are expected to be accurate.

TCP-Q (Measured) TCP-Q (Actual) 9.00E-06 200 DUT Power 8.00E-06 7.00E-06 150 6.00E-06

100 5.00E-06 4.00E-06

50 3.00E-06 2.00E-06

0 1.00E-06 -80.00 -60.00 -40.00 -20.00 0.00 20.00 40.00 60.00 0.00E+00 -50 50% 33% 25% 20% 17% 14% 13% 11% 10%

Fig. 6. Clock-to-Delay(ps) Vs Clock-Data Fig. 7. Dynamic Current in amps through Path Delay(ps) for Hold Time Estimation hundred DUT stages Vs Data Activity Rate w.r.t. to Clock at 1V, 25C and 10MHz Clock Frequency

TCP-Q (Measured) TCP-Q (Actual) 1.40E-05 180 DUT Power 1.20E-05

1.00E-05 130 8.00E-06

80 6.00E-06

4.00E-06

30 2.00E-06

0.00E+00 -50-20 0 50 100 150 200 0.9V 1.0V 1.2V

Fig. 8. Clock-to-Delay(ps) Vs Clock-Data Fig. 9. Leakage Current in amps through hun- Path Delay(ps) for Setup Time Estimation dred DUT stages Vs Applied Voltage at 150C An On-Chip Flip-Flop Characterization Circuit 49

Fourtune Memory Cuts Fourtune ALLCELL BISC for Access Time structures Characterization

Low Power Block 2

MERCURY_C40LP 3mm X 3mm

Low Power Block 1

Ring Oscillator Structures Ultra Low Voltage IPs

Fig. 10. Mercury_C40LP

5.2 Silicon Results

A subset of the system for the measurement of clock-to-Q delay of tri-state latch based master slave D flip-flop circuit [6] is implemented on Mercury Test-chip in 40nm CMOS process from SAMSUNG. The results are extracted across different voltages and temperatures on multiple dies at package level are shown in Fig. 11. At lower voltage level there is higher misalignment between CAD and Silicon values which is due to Model misalignment. At lower voltages the average error percentage is around 12% which reduces to 2% towards higher voltage side.

CP-Q Rise Arc Silicon CP-Q Fall Arc Silicon CP-Q Rise Arc CAD CP-Q Fall Arc CAD 3.00E-10

2.50E-10

2.00E-10

1.50E-10

1.00E-10

5.00E-11

0.00E+00 T=-40.00 T=-40.00 T=-40.00 T=25.00 T=25.00 T=25.00 T=125.00 T=125.00 T=125.00 V=0.90 V=1.00 V=1.10 V=0.90 V=1.00 V=1.10 V=0.90 V=1.00 V=1.10

Fig. 11. CAD Vs Silicon Results for Clock-to-Q Delay (sec) 50 A. Jain et al.

6 Conclusion

An accurate on-chip measurement system has been presented for characterization of flip-flop and latches which is also useful in spice model validation and comparative analysis of different structures. The silicon results obtained from 40nm CMOS proc- ess test-chip has been presented on subset of measurement apparatus which validates the principle of measurements and analysis of complete system has been shown at CAD level based on spice simulations, which is to be further validated on silicon for actual analysis. The silicon and CAD results shows that measurement apparatus gives accurate results for delay and power, and the error in measurements is under accept- able limits. The given system could be improved further for characterization at differ- ent output loads and clock transitions.

References

[1] Nikolic, B., et al.: Improved sense-amplifier-based flip-flop: Design and measurements. IEEE J. Solid-State Circuits 35, 876–884 (2000) [2] Singh, A.P., Panwar, N.S., et al.: On Silicon Timing Validation of Digital Logic Gates - A Study of Two Generic Methods. In: 25th International Conference on Microelectronics, pp. 424–427 (2006) [3] Nedovic, N., et al.: A Test Circuit for Measurement of Clocked Storage Element Charac- teristics. IEEE Journal of Solid State Circuits 39(8) (August 2004) [4] Rosenberger, F., et al.: Flip-flop Resolving Time Test Circuit. IEEE Journal of Solid State Circuits SC-17 (4) (August 1982) [5] Veggetti, A., et al.: Random sampling for on-chip characterization of standard-cell propa- gation delay. In: Fourth International Symposium on Quality Electronic Design, pp. 41–45 (2003) [6] Weste, N., Eshragian, K.: Principles of CMOS VLSI Design, pp. 317–324. Book Pub- lished by Pearson Education Asia [7] Yuan, J., et al.: New Single-Clock CMOS Latches and Flipflops with improved Speed and Power Savings. IEEE Journal of Solid State Circuits 32(1), 62–69 (1997) A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion

Lida Ramezani

Electrical & Computer Engineering Dept., Ryerson University, George Vari Engineering and Computing Center, 245 Church St., Toronto, Ontario, Canada, M5B 2K3 [email protected], [email protected]

Abstract. In this paper a low-voltage integrator circuit using MOSFETs in sub- threshold region is presented. This integrator is a Current-mode log-domain cir- cuit. The EKV MOSFET model is used for sub-threshold region simulations. Model parameters of IBM CMOS 130nm technology are used. This integrator works with a 500mv single supply voltage and its input current range is as high as bias current of the input transistor. According to CADENCE simulation re- sults for 1pf integrating capacitor and bias current of 20nA, cutoff frequency is 113.4 KHz and power consumption is 45.44nW. Integrator’s Cutoff frequency is tuned from 1.083 KHz to 1.023MHz using variable integrator capacitor value in the range of 10pf-0.1pf.

Keywords: Nonlinear electronics; Sub-threshold CMOS; Log-domain Integra- tor; Companding method; low voltage; low power.

1 Introduction

Low power integrated filters are required in portable systems such as telecommunica- tion receivers and implanted biomedical integrated circuits. Transcoductor-capacitor (Gm/C) filters are a kind of current mode active filters which can be used in a wide range of frequencies from a few HZ in biomedical systems to Several MHz in base- band or IF part of telecommunication receivers. In active Gm/C filters, passive inductors are replaced by active gyrator-C circuits. Active filters have smaller silicon area in comparison to passive filters. The pass-band gain, cutoff and centre frequency and quality factor in active filters are easily tuned and it is possible to make higher quality factors in active filters. But, active filters consume power and they have limited dynamic range. In most of applications, design of low-voltage, low-power active filters with sufficient dynamic range and bandwidth is intended. Low-voltage and low-current techniques are used in low power circuits. Rail-to-rail designs, use of supply multipliers, multistage circuit designs and use of bulk-driven transistors are among low-voltage strategies. Adaptive biasing and sub-threshold

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 51–61, 2011. © Springer-Verlag Berlin Heidelberg 2011 52 L. Ramezani biasing are kind of low current design methods [1]. In [2] continuous time low-voltage current-mode filters are discussed. Low voltage circuits suffer from dynamic range limitations. The maximum input signal is limited to linear range of the input circuit, and the minimum range of acceptable input signal is limited to noise level. The input signal should be several times less than bias level to reduce harmonic distortion caused by nonlinearity of input circuit. At the same time, input noise level should be kept as low as possible. For higher dynamic range, we need large bias level that causes large power consumption. There are several linearization techniques such as source degen- eration, nonlinear term cancellation, adaptive biasing, and class AB implementation. In these linearization methods several transistors are added to the circuit. Each transistor adds several parasitic capacitors and causes more limited bandwidth. Also the power consumption increases with transistor counts. As we intend to design high-frequency and low-power circuit, we need simple circuits with less count of transistors. In companding theory externally linear, internally nonlinear circuits (ELIN) are used to improve the dynamic range. Companding method is useful for improving the dynamic range with less count of transistors [1]. Companding method is used in log- domain circuits. Trans-linear devices are the key elements in log domain circuits. In this paper a low-voltage, current-mode, log-domain integrator using MOSFETs biased in sub-threshold rejoin is presented. In part 2, CMOS transistor in sub- threshold or weak inversion mode is discussed and used as a trans-linear element. Also companding method and log-domain filters are introduced in part 2. In part 3, MOSFET realization of a first order log-companding filter or integrator is presented and CADENCE simulation results are given. Finally comparison and conclusion are given in part 4.

2 MOSFET Biased in Weak Inversion as a Trans-linear Element

In this part, behavior of MOSFET in weak inversion is reviewed. Then trans-linear element and trans-linear principle are described. A trans-linear loop using MOSFETs in sub-threshold is presented. Also companding method and log-domain filters are introduced. These concepts and definitions are used in log domain MOSFET integra- tor circuit which is described in part3.

2.1 MOSFET in Weak Inversion

When the gate source voltage of a MOS transistor is less than threshold voltage but high enough to create depletion region at the surface of silicon, the device operates in weak inversion. This is called sub-threshold region and MOS has exponential volt- age-current characteristics. The drain current in weak inversion or sub-threshold re- gion is given in (1) [3].

⎛⎞⎛⎞VV−−⎡ V⎤ =×() GS th − DS IIWLDt exp⎜⎟⎜⎟⎢ 1 exp⎥ . (1) ⎝⎠⎝⎠nVTT⎣ V ⎦ A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 53

In (1) W and L are transistor channel width and length respectively. Ispec=It×(W/L) is called specific current and depends on physical parameters and technology. Specific current relation is given in (2). [4]

=×() μβ()22 = IIWLnCWLVnVspec. t 22.() ox T T (2)

VGS is gate to source voltage and VDS is drain to source voltage, Vth is threshold volt- age and VT is thermal voltage i.e. 25mv at room temperature. When VDS>>3VT, drain current is independent of VDS. Drain current in sub-threshold is less than It ×(W/L) [3]. In weak inversion, there is a voltage divider between the oxide capacitance (Cox) and depletion region capacitance (Cjs). In (1), n is the coefficient of voltage divider as given in (3).

C js n =+1 1.5 . (3) Cox MOSFET trans-conductance gain in weak inversion is given in (4) and transition frequency in weak inversion is according to (5). [3] ∂II g ==DD. m ∂ (4) VnVGS T

11I (5) f = D . T π 2 VWLCTjs

2.2 Trans-linear Principle

A trans-linear element is a physical device whose trans-conductance gain and current through the device are linearly related. In trans-linear elements, the current is expo- nentially dependent on the controlling voltage. Considering (1) and (4), MOSFET transistor biased in sub-threshold region is a trans-linear element. A closed loop con- taining equal number of oppositely connected trans-linear elements is called a trans- linear loop. According to trans-linear principle [2], in a trans-linear loop, the product of the current densities in the elements connected in clockwise (CW) direction is equal to the corresponding product for elements connected in the counter clockwise (CCW) direction. Π=Π IInm. (6) n∈∈ CW m CCW A CMOS trans-linear loop that is composed of MOS transistors biased in weak inver- sion is shown in Fig.1. Relation between transistor drain currents in Fig.1 is given in (7). 54 L. Ramezani

×=× iiDD12 ii DD 34. (7)

2.3 Companding Method and Log-Domain Filters

In companding method, compressor and expander circuits are used. The compressor circuit compresses the dynamic range of the input; it amplifies weak signals so that they can be transmitted with noise immunity. The expander circuit expands the dy- namic range; it reduces the amplitude of the amplified signals and thus of the noise picked up during transmission [1]. Logarithm is a compressor function and exponen- tial is an expander function. Block diagram of a companding circuit is shown in Fig.2.

Fig. 1. A trans-linear loop with CMOS in sub-threshold

Fig. 2. Block diagram of a companding circuit

In 1990 Seevinck invented a circuit using bipolar junction transistors (BJT) and he called it a companding current-mode integrator. That circuit was effectively a first- order log-domain filter [5]. In a log domain integrator, the currents with an inherently A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 55 large dynamic range are compressed logarithmically when transformed into voltages (prior to the integration on a capacitor) and expanded exponentially afterwards when transformed back to current [6]. Companding can be used in filters to enable supply voltage reduction without signal to noise ratio degeneration [6]. Log domain filters are type of externally linear, internally nonlinear (ELIN) Fil- ters. Log domain and companded filter synthesis methods are discussed in [7]. Log domain filters have the advantages of reduced circuit complexity, wider bandwidth, wider dynamic range and lower power consumption [7]. Different types of log do- main filters including Class A, Class AB and syllabic companding are described in [7]. One of filter synthesis methods is cascading. In this method first order and second order building blocks are used. Integrator is a first order filter and in part3 design and simulation results of a first order log companding filter using MOSFET in weak- inversion is given which is a low voltage and low power integrator.

3 Circuit Design and Simulation Results

In this section, CMOS realization of a log domain integrator and its transfer function is presented. Then CADENCE simulation results using EKV MOSFET model in sub- threshold or weak inversion region are given.

3.1 Circuit Design

The MOSFET realization of a log-domain integrator (first order filter) that uses MOSFET transistors biased in sub-threshold region is shown in Fig. 3.

Fig. 3. MOSFET realization of CMOS log-domain integrator with ideal current sources

In Fig.3 M1 is used as log compressor that converts input current to compressed voltage VGS1, M2 is a level shifter, M3 and C are the integrator circuit core elements

56 L. Ramezani and M4 is expander transistor. M1, M2, M3 and M4 make a trans-linear loop and according to (6) the relationship between their drain currents is given in (8). ()()+=+ itin() II12 it C () Ii 3 out (). t (8)

Capacitor voltage is equal to VGS4 i.e. gate-source voltage in M4. M4 is biased in sub- threshold and according to (1) VGS4 is a logarithmic function of drain current in M4 as given in (9).

⎛⎞it() vt()== V nV lnout + V . (9) CGST4 ⎜⎟() th ⎝⎠IWLt The capacitor current is given in (10). dv() tCnV di () t ==CT out itC () C . (10) dt iout () t dt From (8), (9) and (10), the first order differential equation between input current and output current is concluded as shown in (11). ⎛⎞CnV di() t ()+=+T out itin() II12⎜⎟ I 3 i out (). t (11) ⎝⎠itout () dt The first order differential equation of the circuit in Fig.1 is according to (12). CnVdi() t I T out +=2 () + itout() itI in ()1 . (12) Idt33 I The integrator transfer function is given in (13) and its cutoff frequency and pass- band gain (kPB) are given in (14), (15). is() k Hs()==out PB . + ()ω (13) isin () 1 s 0

I ω = 3 0 . (14) CnVT

I = 2 kPB . (15) I3

For higher cutoff frequency, smaller capacitor (C) and larger bias current (I3) are needed. Cutoff frequency tuning can be done by changing bias current (I3) and capaci- tor value (C). A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 57

In Fig.4, MOSFET log-domain integrator with non-ideal current sources is shown. M1, M2, M3 and M4 are log domain integrator elements as described in Fig.3 and M5, M6 and M7 are current source transistors mirrored from the main bias current branch that is composed of M9, M10 and RBIAS. M8 is active load. All transistors except M8 are biased in sub-threshold.

3.2 CADENCE Simulation Results

In this section, CADENCE simulation results of the circuits shown in Fig.4 are given. EKV model for MOSFETs in weak-inversion which is a precise model is used by CADENCE.

Fig. 4. CMOS log-domain integrator circuit

All transistors have the minimum size with channel length of 480nm and channel width of 120nm. The bias currents of non-ideal current sources in Fig.4 are 20nA provided by M9-M10 with 50KΩ bias resistor (RBIAS). Supply voltage is 500mv and according to simulation results all transistors are biased in region-3 i.e. sub-threshold region. Input signal is a sine wave current with frequency of 1KHz and amplitude of 20nA. The maximum amplitude of the input signal is equal to bias current and it should be less than It ×(W/L) to keep input transistor in sub-threshold. For higher input range, larger bias current for input/compressor transistor is needed and power consumption increases. Also larger compressor transistor ratio (W1/L1) is needed to stay in sub-threshold and parasitic capacitors of MOSFETs increases. According to (5) the transient frequency of the transistor decreases when it has large size, therefore 58 L. Ramezani maximum applicable cutoff frequency of integrator circuit decreases. Naturally, the trade off between power consumption and bandwidth and input range exist, but in this nonlinear integrator, maximum input range is as high as compressor transistor bias current and small signal limitations and distortion issues do not exist when input tran- sistor works in sub-threshold. Transient and frequency response of integrator circuit of Fig.4 with 1 Pico Farad integrator capacitor are shown in Fig.5. In right side waveforms, input signal (/Iin/MINUS) which is a 1KHz sine wave with 20nA amplitude and output signal which is drain current of expander transistor (/T4/D) are shown. Also Voltage of the gate of compressor transistor M1 (/net012), and expander transistor M4 (/net032) are shown. In transient response , the gate-source voltages of M1 and M4 are logarithm function of their drain currents as given in (16). ∝+ω Vtgs( ) log( Imax sin( t ) I DC ). (16)

Imax and IDC are 20nA and ω is 1KHz. In Fig.6 integrator cutoff frequency is tuned from 1.089 KHz to 1.023MHz by varying capacitor in the range of 100Pf -0.1Pf re- spectively. In Fig.7, the integrator core transistor width and its current mirror transis- tor width (W3=W7=k×480nm) are changed from 480nm to 4.8μm and the integrator pass-band gain increases from 0dB to 4.5dB , also cutoff frequency is increased from 113.4 KHz to 974.2 KHz.

Fig. 5. Frequency response and transient waveforms of circuit in Fig.4 A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 59

Fig. 6. Tuning of -3dB/cutoff frequency using variable capacitor in integrator circuit of Fig.4

Fig. 7. Cutoff frequency and pass-band gain in circuit of Fig.4 with 1pf capacitor for 3 different width sizes in M3, M7,(W3=W7=K*480nm) 60 L. Ramezani

4 Discussion and Conclusions

The low-voltage current-mode filters were motivated by the need to have high fre- quency filters with low supply voltage in portable equipment applications. Low voltage designs suffer from dynamic range limitations due to nonlinear behavior of transistors. In companding method, externally linear internally nonlinear (ELIN) circuits are used. This method is useful in low power and low voltage circuits to improve the maximum input range. In this paper a new log domain integrator circuit using MOSFET in sub-threshold region is introduced. MOSFETs in sub-threshold act as trans-linear elements and the designed circuit is a first order ELIN filter. CADENCE simulation results with IBM130nm technology parameters are given. This low voltage circuit works with 500mv supply voltage. The cutoff frequency of integrator can be tuned in two ways. By changing capaci- tor value from 0.1Pf to 10Pf, cutoff frequency changes from 1.023MHz to 1.089 KHz. This way is suggested for frequency coarse tuning and the power consumption re- mains nearly constant between 45.4nw to 54nw. Also by changing RBIAS, the bias current in integrator core changes and the cutoff frequency can be tuned. In this way the power consumption will increase and this way is suggested for frequency fine tuning. In this design, bias current and proper size of compressor transistor (M1) are cho- sen regarding the maximum input range. Also bias current and size of integrator core transistor (M3) are chosen regarding the desired cutoff frequency with an appropriate capacitor value. Considering the low-power constraints, the bias currents should be as low as possible. Also considering the high-frequency constraints, the parasitic capaci- tors and transistor sizes should be as small as possible. The tradeoffs between power consumption, bandwidth and input range exist. The summary of simulation results is given in table 1.

Table 1. Summary

Integrator core Integrator Power Pass band gain Cutoff frequency bias current (I3) capacitor consumption 20nA 1pf 50nw 0dB 113.4KHz 20nA 0.1pf-10pf 45.44nw-54nw 0dB 1.023MHz-1.089KHz 20nA-0.20uA 1pf 50nw-133.8nw 0dB-4.5dB 113.4KHz-974.2KHz

Acknowledgment

The author wishes to thank Department of Electrical and Computer Engineering at Ryerson University for their support to use workstation at Microsystems research laboratory. Furthermore I wish to thank Professor Fei Yuan, supervisor of ICS re- search group at Ryerson University for his useful comments. A Low-Voltage Log-Domain Integrator Using MOSFET in Weak Inversion 61

References

[1] Serra-Graells, F., Rueda, A., Huertas, J.L.: Low- Voltage CMOS Log Companding Analog Design. Kluwer Academic Publishers, Dordrecht (2003) [2] Sanchez Sinencio, E., Andrreou, A.G.: Low Voltage/Low Power Integrated Circuits and Systems, Low Voltage Mixed Signal Circuits. IEEE press series in microelectronic sys- tems, ch.3, pp. 68–72 (1998) [3] Gray, P.R., Meyer, R.G.: Analysis and Design of Analog Integrated Circuits, 5th edn. John Wiley & Sons Ltd., Chichester (2000) [4] Enz, C.C., Vittoz, E.A.: Charge-based MOS Transistor Modeling, the EKV model for low- Power and RF IC design. John Wiley & Sons Ltd., Chichester (2006) [5] Seevinck, E.: Companding Current-mode Integrator, a New Circuit Principle for Continu- ous Time Monolithic Filters. Electronics Letters 26(24), 2046–2047 (1990) [6] Fried, R., Python, D., Enz, C.C.: Compact Log Domain Current Mode Integrator with High Transconductance-to-Bias Current Ratio. Electronics Letters 32(11), 952–953 (1996) [7] Frey, D.: Future Implications on the Log Domain Paradigm. IEE Proc. Circuits Devices Syst., 147(1), 65–72 (2000) Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient VLSI Circuits

Massimo Alioto1,2, Elio Consoli3, and Gaetano Palumbo3

1 DIE, University of Siena, 53100 Siena, Italy 2 Currently also with BWRC, UC Berkeley, 94704-1302 Berkeley, California, USA [email protected], [email protected] 3 DIEES, University of Catania, 95100 Catania, Italy {econsoli,gpalumbo}@diees.unict.it

Abstract. In this paper, an extensive comparison of flip-flop (FF) topologies for high-speed applications is carried out in a 65-nm CMOS technology. This work goes beyond previous analyses in that traditional rankings do not include layout parasitics, which strongly affect both speed and energy and lead to drastic changes in the optimum transistor sizing. For this reason, in this work layout parasitics are included in the circuit design loop by adopting a novel strategy. The obtained results show that the energy efficiency and the performance of FFs is mainly determined by the regularity of their topology and layout. Finally, the area-delay tradeoff is also analyzed for the first time.

Keywords: Energy Efficiency, Clocking, Flip-Flops, High Speed, Energy- Delay, Nanometer CMOS, Interconnects, Layout Impact.

1 Introduction

The selection of flip-flop (FF) topologies is essential for the design of both high-speed and energy-efficient microprocessors [1]. Indeed, in fast micro-architectures with low logic depth, FFs delay occupies a significant fraction of the clock cycle [2]. Moreo- ver, together with circuits devoted to clock generation and distribution, FFs are re- sponsible for a large fraction of the whole chip energy budget [3]-[4]. Various high-speed FFs have been proposed in the past, mainly belonging to the Pulsed and Differential classes [2]. Usually, they are featured by a transparency win- dow, leading to clock-uncertainties absorption properties but also to a reduced race immunity [2]. However, both setup and hold time values can be arranged regardless of the FF delay value, since they depend on the sizing of gates that do not belong to the FF critical path. Therefore, the real figure of merit concerning the timing of such FFs is the minimum data-to-output delay, measuring the impact of FF speed on the clock cycle [5]-[6]. Given the presence of precharged nodes and the high switching activity in the pulse generator stages, high-speed FFs are distinguished by an high dissipation (e.g., compared to low-energy FFs, such as Master-Slave ones) [5]. There- fore, given that CMOS technology has entered a power-limited regime, identifying the most energy-efficient high-speed FFs is nowadays a decisive issue.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 62–72, 2011. © Springer-Verlag Berlin Heidelberg 2011 Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 63

However, the most significant previous comparisons [5]-[10] have not considered nanometer technologies, thereby neglecting the increasing impact of layout parasitics associated with local interconnects, which severely degrade both speed and energy. In this paper, the ranking of the most representative high-speed FFs in a 65-nm CMOS technology is reconsidered by including the above issue since the early design phases, in order to reach the very optimum FFs sizings corresponding to the energy- efficient designs in the Energy-Delay (E-D) space. The framework for FFs analysis and design and the considered topologies are briefly presented in Section 2. The rank- ing of Pulsed and Differential topologies in the E-D space is discussed in Section 3, where the main differences with respect to previous results are pointed out. Section 4 considers FFs area and its tradeoff with delay. Finally, conclusions are in Section 5.

2 Framework for FFs Comparison and Selected FF Topologies

2.1 Adopted Analysis/ Design Strategies and Inclusion of Layout Impact

As previously stated, FF delay is identified with the minimum data-to-output delay, , (measuring the impact of FF timing on the speed of pipelined systems [2],[5]- [6]). FF energy is extracted by summing transient (i.e., dynamic and short-circuit) and static (i.e., leakage) contributions, weighed according to the data input switching activity and to the clock period duration (set to 10 times the delay of the FF), respec- tively. The test bench adopted to evaluate the FFs energy is similar to that in [2],[5]- [6] and is summarized in the Appendix D of [11]. Various applicative conditions [9]-[10] are considered in terms of small, medium and large load, , equal to , 16 and 64 minimum symmetrical inverters (with 2 2, being 120 nm the minimum channel width and 410 aF at the input) and small, medium and large data input activity, i.e. 0.10 , 0.25 , 0.50, respectively. In the rest of the paper, we assume a load capacitance 16, and a switching activity 0.25 as the “reference case”. The comparison is carried out by analyzing the energy-efficient curves (EECs) of the FFs in the E-D space. Such curves are extracted by minimizing some figures of merit (FOMs) as described in [11] (due to the lack of space, please refer to that paper for procedures and examples concerning the detailed FFs design strategy). To gain an intuitive understanding of results independently of technology, they are properly normalized to reference values typical of the considered 65-nm CMOS technology. In particular: delays are normalized to 4 18.27 ps, energies are normalized to , 0.202 fJ (it is the energy dissipated by an unloaded symme- trical minimum inverter during a complete 010 transition cycle at its output) and areas are normalized to , where 200 nm is the minimum pitch of the Met- al2 layer. For all the analyses, a 1 V supply voltage is adopted. The sizing strategy in [11] also accounts for capacitive parasitics due to local inter- connects since the early design phases, for the first time in the literature relative to FFs analysis and design. Indeed, among previous works [2],[5]-[10], few consider layout impact simply a posteriori, while most neglect it at all. This leads to strong differences between the adopted design strategies and the actual optimum ones and to 64 M. Alioto, E. Consoli, and G. Palumbo the unreliability of the previously reported results, given the huge influence that local wires have on both energy and delay of FFs. The detailed methodology to extract capacitive parasitics is based on geometrical calculations performed on stick diagrams and on a realistic modeling of the per-unit- length capacitances of the various interconnecting Poly and Metal layers (thereby including the effect of capacitive coupling between adjacent and stacked wires). Such a methodology is accurately described in Appendix A in [11] and has been validated through the realization of several actual layouts of the considered FFs corresponding to the minimum -product designs in the reference case. Local interconnects parasitic capacitances are estimated with an error equal to 10 25%, while the error in the delay-energy estimation is lower (5 10%). It is worth noting that the values of such capacitances is quite similar to those of transis- tors-related (gate and drain) capacitances in the various FFs nodes, i.e. they introduce extremely significant branching and parasitic effects. As a consequence, the optimiza- tion leads to larger transistors sizes (up to 2X) in order to compensate the resulting speed degradation and hence energy increases both for the additional interconnects capacitances themselves and for the larger transistors sizes. This confirms the huge impact that such parasitics have on both energy and delay.

2.2 High-Speed FF Topologies: Pulsed and Differential Classes

In this paper we focus on the comparison of high-speed FFs and hence we consider the Pulsed and Differential topological classes, which are featured by small delays. On the whole, 11 among the most representative and best known FFs are selected. The analyzed Pulsed topologies are the Hybrid Latch FF [12] (HLFF), the Semi- Dynamic FF [13] (SDFF), the UltraSPARC Semi-Dynamic FF [14] (USDFF), the Implicit Push-Pull FF [6] (IPPFF), the Conditional Precharge FF [15] (CPFF), the Static Explicit Pulsed FF [16] (SEPFF) and the Transmission Gate Pulsed Latch [17] (TGPL). The latter two are Explicit Pulsed (EP) circuits, i.e. they employ a pulse generator (PG) providing an actually pulsed clock, whereas the remaining ones are Implicit Pulsed (IP), i.e. they simulate a pulsed clock through the temporary enabling of some (typically two) transistors according to the delay of an inverter chain [2]. The Differential FFs investigated are the Modified Sense-Amplifier FF [18] (MSAFF), the Skew-Tolerant FF [19] (STFF), the Conditional Capture FF [20] (CCFF) and the Variable Sampling Window FF [21] (VSWFF). The operation of the latter two resemble that of Pulsed FFs, since they employ a transparency window. The FFs schematics are reported in Fig. 1, together with widths of transistors in the data-to-output paths and that are optimized as independent design variables [11].

3 Energy-Delay Tradeoff and Energy-Efficient Curves

3.1 Pulsed FFs

The EEC of the IP-EP FFs, derived in the reference case, is reported in Fig. 2a. From this figure, the TGPL is clearly the most energy-efficient Pulsed FF in the high-speed region and in part of the low-energy one. This is expected from the simplicity of the basic latch structure of TGPL (and hence the low impact of layout parasitics). This

Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 65

Fig. 1. Schematics of the analyzed FFs: HLFF (a), SDFF (b), USDFF (c), IPPFF (d), CPFF (e), SEPFF (f), TGPL (g), MSAFF (h), STFF (i), CCFF (j), VSWFF (k) 66 M. Alioto, E. Consoli, and G. Palumbo good energy efficiency of TGPL is remarkable since here every FF is considered with its own Pulse Generator (PG), but actually energy may be further reduced by sharing PG among various FFs. From Fig. 2, in the deep low-energy region, the CPFF and IPPFF are the best Pulsed FFs. Indeed, both are Implicit Pulsed and hence do not require a PG. In addition, the CPFF employs a conditional technique to avoid unne- cessary precharge [15], while the IPPFF reduces the load on the precharged node by using a push-pull second stage. SEPFF is fast, but dissipates more than TGPL in all conditions and hence is less energy-efficient. Its average delay is also nearly 1.2X greater than TGPL. This is somewhat different from previous works [8], which predicted the same speed for a medium load (like 16). Again, this is due to the heavier impact of interconnects, since SEPFF has a slightly more complex layout compared to TGPL. Among all the Pulsed FFs, the semi-dynamic ones (SDFF and USDFF) exhibit the worst performances in the whole E-D space. The reason is again related with the layout complexity. In contrast with [5],[8],[13], where it is stated that such FFs have E-D features very similar to the HLFF, we find that the latter one is significantly more energy-efficient throughout the whole E-D space (except in the very high-speed re- gion where they are similar). Indeed, HLFF has a much simpler schematic and hence its layout has much shorter interconnects, thus reducing energy consumption. Moreover, in contrast to previous results [6],[14], USDFF does not outperform SDFF, again because of its more complex routing. Given the mirror-like structure of the two circuits, the local wires capacitances can be compared by averaging out the results for all the different nodes and for all the different considered sizing strategies. On the average, we find that parasitics are nearly 60% larger for USDFF than SDFF. All SET IP FFs are slower than EP FFs. In particular, by averaging out the delays correspondent to the various optimized FOMs, IP FFs delays are nearly 1.3X greater than for EP FFs. This happens mainly because IP FFs need stages with three stacked transistors in their critical path, whereas EP FFs exploit a real pulsed signal and need stages with two stacked transistors. In particular, IPPFF has the worst minimum delay among IP FFs, since it exhibits three and four stages paths for the rising and falling data transitions and this overcomes the advantages given by the push-pull stage [6]. To understand the dependence of the above results on the load value, the EECs of Pulsed FFs for 64 and 4 are reported in Fig. 2b-c (in both cases 0.25). The ranking of IP FFs does not change significantly, except for IPPFF that, having a greater number of stages in its data-to-output paths, becomes relatively faster for a large load. As concerns EP FFs, unlike [9], where the speed of a two stage FF (TGPL) is overcome by that of a three stage topology (SEPFF) when the load is large enough (64), the SEPFF still shows an average 1.1x (1.3x) delay increment even for 64 (4). When the load is small (4), TGPL is the most energy-efficient Pulsed FF practically in all the E-D space. To understand the effect of switching activity, the EECs for 0.1 and a 0.5 are reported in Fig. 2d-e (in both cases 16). The main changes occur in the low-energy region, where the CPFF becomes more energy efficient for 0.1, since it takes advantage of the conditional precharge. Conversely, for 0.5, the IPPFF becomes the most energy-efficient Pulsed FF in the deep low- energy region, whereas CPFF and SEPFF (both exhibiting pseudo-static first stages) experience a considerable dissipation increase due to the high data activity rate. Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 67

As a final remark, the overall superiority of EP over IP FFs is explained by consi- dering that, in nanometer technologies, IP FFs suffer from a complex routing between the stages involved in the data-to-output paths, which thus need to be oversized to avoid a speed penalty. This must be emphasized since EP FFs can benefit from a further energy reduction when the PG is shared among various FFs.

ͬŵŝŶ͕ŝŶǀ ϭϴϬ ,>&& ϭϲϬ ^&& h^&& ϭϰϬ /WW&& W&& ϭϮϬ ^W&& d'W> ϭϬϬ

ϴϬ

ϲϬ

ϰϬ ϭ͕ϬϮ͕Ϭϯ͕Ϭϰ͕Ϭϱ͕Ϭϲ͕Ϭ (a) ͬ&Kϰ

ͬŵŝŶ͕ŝŶǀ ͬŵŝŶ͕ŝŶǀ ϮϬϬ ϮϬϬ ,>&& ,>&& ϭϴϬ ^&& ϭϴϬ ^&& h^&& h^&& ϭϲϬ /WW&& ϭϲϬ /WW&& W&& W&& ϭϰϬ ϭϰϬ ^W&& ^W&& ϭϮϬ d'W> ϭϮϬ d'W>

ϭϬϬ ϭϬϬ

ϴϬ ϴϬ

ϲϬ ϲϬ

ϰϬ ϰϬ ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ ϳ͕Ϭ ϴ͕Ϭ ϭ͕ϬϮ͕Ϭϯ͕Ϭϰ͕Ϭϱ͕Ϭϲ͕Ϭϳ͕Ϭϴ͕Ϭ (b) ͬ&Kϰ (c) ͬ&Kϰ

ͬŵŝŶ͕ŝŶǀ ͬŵŝŶ͕ŝŶǀ ϭϴϬ ϭϴϬ ,>&& ,>&& ^&& ^&& ϭϲϬ ϭϲϬ h^&& h^&& ϭϰϬ /WW&& ϭϰϬ /WW&& W&& W&& ϭϮϬ ^W&& ϭϮϬ ^W&& d'W> d'W> ϭϬϬ ϭϬϬ

ϴϬ ϴϬ

ϲϬ ϲϬ

ϰϬ ϰϬ ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ (d) (e) ͬ&Kϰ ͬ&Kϰ

Fig. 2. Implicit-Explicit Pulsed FFs: reference case (a), 64 (b), 4 (c), 0.1 (d), 0.5 (e). In (b)-(c) 0.25. In (d)-(e) 16. 68 M. Alioto, E. Consoli, and G. Palumbo

3.2 Differential FFs

The EECs of the SET Differential FFs in the reference case are reported in Fig. 3a. From this figure, the E-D space is split in two regions: the high-speed one, where the STFF is the most energy-efficient, and the low-energy one, where the MSAFF is the best Differential FF. In particular, STFF is the fastest among all the analyzed FFs. For instance the average delay of TGPL is 1.1X greater than the STFF, whereas those of MSAFF, CCFF and VSWFF are 1.8X, 1.3X and 1.4X greater, respectively. These differences in the speed of such Differential FFs can be explained as fol- lows: all of them have equal second (skewed inverter) and third (push-pull) stages, which are very fast. As regards the first stage, the speed of MSAFF is affected by the load imposed by the cross-coupled inverters, whose NMOS transistors belong to the complementary critical paths (although the sense-amplifier nature is useful for level- restoring). The first stage of CCFF and VSWFF does not have this drawback and is significantly faster, but not as much as the first stage of STFF, where only two stacked NMOS are employed thanks to the use of additional driving NOR gates. The high energy-efficiency of MSAFF in the low-energy region is due to the rela- tively simpler layout and to the lower impact of layout parasitics that allows for downsizing transistors with minor performances loss with respect to STFF, CCFF and VSWFF. For analogous reasons, CCFF and VSWFF, which have an extremely com- plex routing, are never the most energy-efficient. This is in contrast to what is claimed in many papers [2],[15],[20]-[21] where the conditional capture property is praised as a very efficient technique to reduce energy at a negligible speed penalty. This is no longer true in nanometer technologies where the impact of local wires is considerable (to maintain a good speed, such FFs need to be strongly oversized). Given the similar topology of the considered Differential FFs, the same ranking is obtained regardless of the load . Instead, switching activity has a significant impact on the comparison, as is shown in Fig. 3b-c where the EECs derived for 0.1 and a 0.5 are plotted (in both cases 16). In detail, for 0.1, CCFF and VSWFF become the most energy-efficient in the region around the mini- mum point). For 0.5 their EECs move far away from the MSAFF and STFF ones, in contrast to [20], where it is stated that conditional capture FFs have a reasonable energy consumption even for such a data transition rate. Note that some of the considered Differential FFs [19]-[20] have complex IP single-ended counterparts whose energy-efficiency is always worse than the other single-ended topologies.

4 Area and Tradeoff with Delay

The silicon area occupied by FFs can be accurately estimated by using the same pro- cedure used to estimate the interconnects length (previous works did not analyze this aspect [2],[4]-[10],[12]-[21]). Table 1 reports the absolute and normalized area of the various FFs under three typical optimum sizings (minimum , and ). Area is mostly dictated by the topological complexity and we can draw the follow- ing main conclusions, which roughly hold for all the considered sizings: − Conditional Differential FFs (CCFF and VSWFF) have the greatest area; − HLFF and MSAFF have very small area. Indeed, MSAFF (despite its Differential nature) takes advantage of its regularity and HLFF is the simplest considered FF. Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 69

As concerns EP FFs, the values in Table 1 are somewhat pessimistic. Indeed, when sharing the PG among an increasing FFs number, the area increase of the PG is small.

ͬŵŝŶ͕ŝŶǀ ϭϴϬ D^&& ϭϲϬ ^d&& && ϭϰϬ s^t&& ϭϮϬ

ϭϬϬ

ϴϬ

ϲϬ

ϰϬ ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ (a) ͬ&Kϰ

ͬŵŝŶ͕ŝŶ ͬŵŝŶ͕ŝŶǀ ϭϴϬ ǀ ϮϬϬ D^&& D^&& ϭϴϬ ϭϲϬ ^d&& ^d&& && ϭϲϬ ϭϰϬ && s^t&& ϭϰϬ s^t&& ϭϮϬ ϭϮϬ ϭϬϬ ϭϬϬ ϴϬ ϴϬ

ϲϬ ϲϬ

ϰϬ ϰϬ

ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ ϭ͕Ϭ Ϯ͕Ϭ ϯ͕Ϭ ϰ͕Ϭ ϱ͕Ϭ ϲ͕Ϭ (b) ͬ&Kϰ (c) ͬ&Kϰ

Fig. 3. Differential FFs: reference case (a), 0.1 (b), 0.5 (c) ( 16)

Table 1. Absolute and normalized area of the considered FFs for various optimum sizings

Min : Area [] Min : Area [] Min : Area []

HLFF 681.6 (1.00x) 462.4 (1.00x) 462.4 (1.00x) SDFF 869.6 (1.28x) 703.2 (1.52x) 588.0 (1.27x) USDFF 983.2 (1.44x) 816.8 (1.77x) 644.8 (1.39x) IPPFF 816.8 (1.20x) 624.0 (1.35x) 603.2 (1.30x) CPFF 912.0 (1.34x) 704.0 (1.52x) 541.6 (1.17x) SEPFF 946.4 (1.39x) 759.2 (1.64x) 644.0 (1.39x) TGPL 780.8 (1.15x) 635.2 (1.37x) 552.0 (1.19x) MSAFF 691.2 (1.01x) 504.0 (1.09x) 504.0 (1.09x) STFF 1202.4 (1.76x) 765.6 (1.57x) 724.0 (1.57x) CCFF 1397.6 (2.05x) 1106.4 (1.74x) 804.0 (1.74x) VSWFF 1397.6 (2.05x) 1106.4 (1.74x) 804.0 (1.74x)

70 M. Alioto, E. Consoli, and G. Palumbo

The area-delay tradeoff is illustrated for the reference case in Fig. 4. From this fig- ure, the area-delay tradeoff closely resembles the energy-delay tradeoff since the overall energy dissipation is strongly related with the area and the size of the circuits. Note the very good tradeoff offered by the HLFF in the delay range 36 4. We also analyze the area degradation versus sizing (i.e., when optimizing FOMs where more emphasis is given to the speed). The results in Fig. 5 (Differential and Pulsed FFs are depicted with dotted and dashed lines, respectively) refer to the refer- ence case and are normalized with respect to the minimum area for each FF, obvious- ly achieved when simply minimizing the energy. Differential FFs see the highest relative increase in their area (up to 1.8X) when they are progressively up-sized for smaller delays. Indeed, their complex layouts and the high branching effects due to local wires parasitics and additional gates (not lying in the data-to-output paths) require a significant transistor oversizing of their critical stages. Pulsed FFs (both IP and EP) show area increments up to 1.4 1.7X.

(Area)/χ2 1400 HLFF SDFF USDFF IPPFF 1200 CPFF SEPFF TGPL MSAFF STFF CCFF 1000 STFF VSWFF

800

600 TGPL

400 HLFF 1234567 D/FO4 Fig. 4. Area-Delay tradeoff in the reference case

(Area)/(Area) 1.8 Emin HLFF 1.7 SDFF USDFF 1.6 IPPFF CPFF 1.5 SEPFF TGPL 1.4 MSAFF 1.3 STFF CCFF 1.2 VSWFF

1.1

1.0 5 4 3 2 2DE3DE ED ED ED ED ED E min Fig. 5. Area degradation when considering the optimum sizings minimizing various FOMs Physical Design Aware Comparison of Flip-Flops for High-Speed Energy-Efficient 71

5 Conclusion

In this paper, a thorough comparison in the energy-delay-area space of several high- speed FFs (Pulsed and Differential) in nanometer (65-nm) CMOS technology has been carried out. Analysis showed that, in many cases, results are different from pre- vious papers because the impact of local interconnects parasitics has been explicitly included since the early design phases. As a general remark, simpler basic structures are rewarded in nanometer technologies because of the strong impact of layout para- sitics. In particular, EP topologies, and specifically the TGPL, have been recognized as the best high-speed FF topologies in a very wide range of applications.

References

1. Kurd, N., et al.: A Family of 32nm IA Processors. In: 2010 IEEE ISSCC (2010) 2. Oklobdzija, V., et al.: Digital System Clocking: High-Performance and Low Power As- pects. Wiley-IEEE Press (2003) 3. Alioto, M., et al.: Flip-Flop Energy/Performance versus Clock Slope and Impact on the Clock Network Design. In: Print on IEEE TCAS-I 4. Nedovic, N., et al.: Dual-Edge Triggered Storage Elements and Clocking Strategy for Low-Power Systems. IEEE TVLSI 13(5), 577–590 (2005) 5. Stojanovic, V., et al.: Comparative Analysis of Master-Slave Latches and Flip-Flops for High-Performance and Low-Power Systems. IEEE JSSC 34(4), 536–548 (1999) 6. Giacomotto, C., et al.: The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements. IEEE JSSC 42(6), 1392–1404 (2007) 7. Markovic, D., et al.: Analysis and design of Low-Energy Flip-Flops. In: 2001 ISLPED, pp. 52–55 (2001) 8. Tschanz, J., et al.: Comparative Delay and Energy of Single Edge-Triggered and Dual Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors. In: 2001 ISLPED, pp. 147–152 (2001) 9. Heo, S., et al.: Load-Sensitive Flip-Flop Characterization. In: 2001 IEEE CSW-VLSI, pp. 87–92 (2001) 10. Heo, S., et al.: Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy. IEEE TVLSI 15(9), 1060–1064 (2007) 11. Alioto, M., et al.: General Strategies to Design Nanometer Flip-Flops in the Energy-Delay Space. In: Print on IEEE TCAS-I 12. Partovi, H., et al.: Flow-Through Latch and Edge-Triggered Flip-Flop Hybrid Elements. In: 1996 IEEE ISSCC, pp. 138–139 (1996) 13. Klass, F., et al.: A New Family of Semidynamic and Dynamic Flip-Flops with Embedded Logic for High-Performance Processors. IEEE JSSC 34(5), 712–716 (1999) 14. Heald, R., et al.: A Third Generation SPARC V9 64-b Microprocessor. IEEE JSSC 35(11), 1526–1538 (2000) 15. Nedovic, N., et al.: Conditional Techniques for Low Power Consumption Flip-Flops. In: 2001 IEEE ICECS, vol. 2, pp. 803–806 (2001) 16. Zhao, P., et al.: Low Power and High Speed Explicit-Pulsed Flip-Flops. In: 2002 IEEE MSCS, pp. 477–480 (2002) 17. Naffziger, S., et al.: The Implementation of the Itanium 2 Microprocessor. IEEE JSSC 37(11), 1448–1460 (2002) 72 M. Alioto, E. Consoli, and G. Palumbo

18. Nikolic, B., et al.: Improved Sense-Amplifier-Based Flip-Flop: Design and Measurements. IEEE JSSC 35(6), 876–884 (2000) 19. Nedovic, N., et al.: A Clock Skew Absorbing Flip-Flop. In: 2003 IEEE ISSCC, pp. 342– 344 (2003) 20. Kong, B., et al.: Conditional-Capture Flip-Flop for Statistical Power Reduction. IEEE JSSC 36(8), 1263–1271 (2001) 21. Shin, S., et al.: Variable Sampling Window Flip-Flops for Low-Power High-Speed VLSI. In: 2005 IEE CDS, vol. 152(3), pp. 266–271 (2005) A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework

Dimitris Bekiaris, Antonis Papanikolaou, Christos Papameletis, Dimitrios Soudris, George Economakos, and Kiamal Pekmestzi

Microprocessors and Digital Systems Lab, National Technical University of Athens 157 80, Zografou, Athens, Greece {mpekiaris,antonis,xristos86,dsoudris, geconom,pekmes}@microlab.ntua.gr

Abstract. The shrinking of interconnect width and thickness, due to technology scaling, along with the integration of low-k dielectrics, reveal novel reliability wear-out mechanisms, progressively affecting the performance of complex sys- tems. These phenomena progressively deteriorate the electrical characteristics and therefore the delay of interconnects, leading to violations in timing-critical paths. This work estimates the timing impact of Time-Dependent Dielectric Breakdown (TDDB) between wires of the same layer, considering temperature variations. The proposed framework is evaluated on a Leon3 MP-SoC design, implemented at a 45nm CMOS technology. The results evaluate the system’s performance drift due to TDDB, considering different physical implementation scenarios.

Keywords: Reliability, Time-Dependent Dielectric Breakdown, Inter-Metal Dielectric Leakage, Timing.

1 Introduction

The current trend of CMOS technology scaling aggressively reduces the physical dimensions of devices and interconnects leading simultaneously to contiguous effects, which form novel threats regarding the reliability of modern integrated circuits. The shrinking of channel length of transistors incurs an exponential growth of sub- threshold leakage, which increases power density and creates hot spots in congested areas of the chip. The reduction of gate oxide thickness in technology nodes beyond 65nm enhances the gate tunneling current, resulting in Negative-Bias Temperature Instability (NBTI) in PMOS transistors due to the gradual rise of threshold voltage. Similar effects of a progressive impact also appear in interconnection structures. They are caused by the shrinking of geometrical dimensions and the saturation of the operating voltage at around 1V, in sub-micron technologies [1]. The reduction of wires width and thickness increases current density, while the smaller pitch and spac- ing enhances the electrical field between interconnects of the same metal layer. Thus, Back-End-of-Line (BEOL) reliability phenomena like Electro-Migration (EM), Stress Migration (SM) and Time-Dependent Dielectric Breakdown (TDDB) start to gain in

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 73–83, 2011. © Springer-Verlag Berlin Heidelberg 2011 74 D. Bekiaris et al. significance with technology scaling and they progressively degrade the electrical characteristics and structure of affected interconnects. The recent move from silica-based to porous, low-k dielectrics between copper lines in the interconnect stack comes along with the advent of nanoscale technologies and has further aggravated the potential TDDB problems. Copper tends to “leak” into the dielectrics and create conductive paths between wires of the same metal layer, leading to breakdowns in the dielectric and leakage current between wires. Moreover, the evolution of this leakage current is not abrupt. It seems to be a rather smooth func- tion of operating time until the magnitude of the current is large enough to create an electrical short between wires which affects the functionality of the circuit. In this paper, we present an analysis flow that can capture the impact of Time De- pendent Dielectric Breakdown of the low-k dielectrics of the interconnect stack on the delay of the individual wires and, furthermore, propagate this impact to the timing of the entire chip. Hence, we can estimate when the chip will present timing violations due to reliability problems on the interconnects. The rest of the paper starts by presenting the related work in the literature and con- tinues with the model used for the TDDB estimations. Section 4 presents the proposed reliability analysis framework and Section 5 demonstrates the experimental results, based on the application of this framework on layouts of an MP-SoC platform. Fi- nally, a discussion on the results and also hints for future work conclude the paper.

2 Related Work

Time-Dependent Dielectric Breakdown of the low-k dielectrics has been identified as a potential reliability threat by many independent researchers since the decision to move from aluminum to copper wires for standard CMOS processes [2][3][4]. Sig- nificant effort is being invested at the process technology development level, in order to determine the process steps and materials that can alleviate this phenomenon [5][6]. Up to now, however, no solution at the level of process technology seems to solve the problem completely. Hence, TDDB must be taken into account at the design stage as a potential threat, not only for the reliability of interconnects, but also for the circuit’s performance, as the flow of inter-metal leakage through the dielectric increases the wire delay and possibly design’s critical time delay drift over time. This has been implicitly understood also by the process technology people, who have started working on modeling the impact of TDDB on the electrical properties of interconnects [7][8][9]. Although the design community has not yet taken up any of these models to evaluate the impact of TDDB at the level of an entire system, recent works present methodologies and tools estimating the system’s performance drift over time [10][11], based on the extrapolation of accelerated inter-metal leakage measure- ments to normal, operating conditions. In this work, we take a step further on these approaches, by exploring the impact of different place-and-route styles on system’s timing degradation due to TDDB, while considering the entire layout’s temperature profile, which is of course dependent on the application. A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 75

3 Time-Dependent Dielectric Breakdown Mechanism

Time-Dependent Dielectric Breakdown (TDDB) of inter-metal dielectrics refers to the progressive destruction of the material insulating interconnects of the same metal layer, leading to the formation of “leaky” paths and therefore increasing the time required for charging and discharging of wire capacitances. This mechanism is similar to the ones appearing in gate oxide structures and parallel plate capacitors of high-k dielectrics, also used in DRAMs. However, TDDB becomes more significant for interconnects with the advent of low-k porous dielectric materials, mostly used in the sub-micron manufacturing proc- esses to reduce interconnect delay, while improving crosstalk and minimizing inter- connect power dissipation. These gains come hand in hand with worse reliability characteristics, due to the porous nature of the specific type of dielectrics. The gradual breakdown of low-k materials is aggravated as far as the electric field between neighboring wires rises, wire pitch scales down and the operating voltage saturates around 1V [1]. Hence, the inter-metal electric field is growing stronger with technol- ogy scaling and comprises the main reason for the formation of conductive paths through the dielectric, along with imperfections appearing in the interconnects. These defects appear in the low-k materials used in current nanometer technology processes and their formation is mainly due to the dominating dielectric deposition methods, performed during the manufacturing process. Therefore, considering, in accelerated conditions of voltage and temperature, an electric field lower than 6 MV/cm, which is a usual stress value for low-k metal-insulator-metal structures [7], free charges (holes) are trapped into the areas of the dielectric where these defects exist. The number of trapped holes rises progressively, until a critical value is reached. Then, the flow of inter-metal leakage becomes significantly stronger, leading to the dielectric’s breakdown and finally resulting into a short-circuit. The TDDB mechanism can be modeled either by the Schottky or by the Frenkel- Poole emission, both of which have similar mathematical expressions of inter-metal leakage current density [7] and are exponentially dependent on temperature. How- ever, mainly because of the nature of the specific wear-out and of the recent shifting of the interconnect technology on low-k dielectrics, there is little convergence on a specific model. Therefore, a common practice for the estimation of inter-metal leak- age in operating conditions deals with the extrapolation of leakage measurements from experimental data, where wires are stressed for a certain number of hours under high voltage and temperature, resulting in strong electric fields. The extrapolation approach has been also adopted in this work, where the wires have been stressed for about one hour. The leakage in operating conditions is ex- tracted by performing linear extrapolation from the experimental measurements and the derived values formed the basis for the estimation of the delay impact of TDDB on individual interconnects. In the proposed reliability analysis framework, presented in the following section, we demonstrate how we use the information from the inter- metal leakage characterization libraries in stress conditions, in order to guide the estimation of additional delay in wires due to TDDB. This was a necessary step for the development of the proposed reliability framework, which predicts the design’s performance drift over time due to TDDB and therefore the shortening of system’s operating lifetime, under the required performance. 76 D. Bekiaris et al.

4 The Proposed Interconnect Reliability Framework

The proposed reliability analysis flow, which captures the impact of TDDB in inter- connects of low-k dielectrics on a design’s timing, is illustrated in Fig. 1. Even though its structure is generic enough, we have customized this instance of flow to capture the impact of TDDB on the delay of interconnects. The flow of Fig. 1 takes four main inputs: (i) the layout of the circuit which in- cludes all the geometrical information of the interconnect stack, (ii) the timing con- straints of the design, (iii) the standard-cell technology libraries, which include the information about the timing of the cells in the design’s post-layout netlist and the dimensions of cells and interconnects, and (iv) the layout’s power profile, which is needed to extract the temperature profile and establish the actual temperature on each net. The first steps of the flow estimate the temperature and timing profile of the layout. For the temperature estimation, we used HotSpot [12], an open-source academic tool that produces the thermal map of the chip, by taking as inputs the floorplan of the target design and the power consumption of the floorplan’s units. The power profile required for the temperature estimation is obtained via power analysis of the post- layout Verilog netlist in Synopsys PrimeTime PX, using an activity trace obtained through logic simulation, based on a testbench of a real application, in ModelSim. Static Timing Analysis (STA) is performed on the design’s post-layout Verilog netlist, using the SoC Encounter Timing System (ETS) tool, which finds the most timing-critical paths in the design. In our framework, we extract the nets from the 50 most timing-critical paths. These nets are the “key” interconnects, as they belong to timing paths susceptible to suffer from TDDB. These paths have a minimal slack (less than 2ns) and thus, a delay overhead due to TDDB may lead to timing violations. After these nets are identified, their geometrical properties are extracted, including the dimensions of the wires themselves and of their neighbors, as well as the spacing between them. This is performed through a Tcl script, which is executed in the SoC Encounter’s environment and reads the layout’s database based on the SoC Encoun- ter’s Database Access [13] command set. Hence, the script extracts the wires of the nets for the examined critical path, as well as their length, width and thickness, and finds the neighboring wires of the same metal layer, along with their physical dimen- sions and the distance between them and the wires of the examined net. All this in- formation, which will be used in the additional delay computation due to TDDB, is dumped to an output file, named as wire.report in our toolchain. After extracting the physical information about the examined nets’ wires, the next step is to estimate the impact that TDDB is expected to have on the delay of these wires individually, based on the model outlined in the previous section, and to anno- tate the generated delay overhead due to TDDB on the design’s Standard Delay For- mat (SDF) file. Finally, the additional delay of each wire is taken into account in a chip-level timing analysis, in order to estimate the impact of TDDB on the timing of the entire layout, in a similar way as in the second step. A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 77

Fig. 1. The proposed temperature-aware interconnect reliability framework

4.1 Estimation of Delay Impact on Interconnects

For each of these wires identified in the 50 most timing-critical paths, our flow esti- mates the delay overhead due to TDDB, based on pre-computed inter-metal leakage (IMD) look-up table libraries, given in operating and accelerated conditions. This delay, computed for each of the nets of the examined path, is annotated to the Stan- dard Delay Format (SDF) file of the design, to update the specific net delay with the new value. The computation of the additional delay due to TDDB is performed in three steps, as it is shown in Fig. 1. It is performed through a Matlab script, based on the information of wire extraction for the nets of the examined path, while taking into account the proper temperature, depending on the units from which the specific path comes through. The final SDF file, including the new net delays, is then back- annotated, along with the post-layout netlist, to the static timing analyzer of ETS, to 78 D. Bekiaris et al. evaluate the impact of TDDB on the design’s performance. The analytic description of the three steps required for the TDDB impact annotation is given below: Step 1 – IMD Leakage Extrapolation: Based on the neighboring wire information, a Matlab script performs the additional delay overhead computation due to inter-metal leakage and annotates the shifted delay to the SDF file of the design, for the TDDB timing impact evaluation. The additional delay calculation is divided into two steps. At first, the script reads all the wires of the examined net from wire.report, as well as their neighboring wires, and obtains IMD leakage from accelerated to operating con- ditions by performing linear extrapolation, based on experimental look-up table librar- ies. These libraries contain IMD leakage information after having stressed the wires for up to one hour and in conditions of 35V, 40V and 45V of voltage, under tempera- tures of 323K, 398K and 448K respectively. Step 2 – Delay Increment Computation: The extrapolated leakage is used to estimate the additional delay on the net due to TDDB, based on another look-up table library, which provides the delay increment ratio for charging or discharging a wire, depend- ing on the inter-metal leakage between two adjacent wires of varying length, spacing and overlap. For the construction of such a library, we simulated the behavior of two neighboring wires in Synopsys HSPICE, in order to find the ratio of delay increment of charging and discharging a wire due to IMD leakage, for various possible adjacent wire patterns. This library was created once and it is used in all the conducted ex- periments, as long as the on-the-fly extraction and simulation of adjacent wire pat- terns for all the timing paths of each layout would be time-consuming. In the conducted experiments, the wire length ranges between 10um and 600um, in order to include wire patterns with length equal or greater than those met in the lay- outs of our case study. The spacing’s range is between 0.06um and 0.5um, covering the range defined in the design rules of the 45nm standard-cell library used for the implementation of the layouts. Moreover, in order to measure the delay of wires which are not totally overlapped, we simulated wire patterns where the starting point of the neighboring wire was not equal to the one of the wire for which the delay was measured. Hence, the neighboring wire’s starting point was ranging from zero (total overlap of wires) to 75% of the target wire’s length (smallest overlap of wires). In order to simulate the inter-metal leakage in HSPICE, we used current sources distrib- uted across the target wire at each R-C (Resistance – Capacitance) segment.

Pre-overlap region Overlap region Post-overlap region

Fig. 2. The distributed RC model, simulating inter-metal leakage in HSPICE

The total leakage current for each wire in our simulations is dependent on the wire’s length and varies between 0 and 50uA, in order to cover a wide enough range. The A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 79 value of each current source, given in uA, depends on the overlap length between the target wire and its adjacent one. In our approach, it is computed by dividing the total leakage current for each wire with the number of R-C stages corresponding to the overlap length between the target wire and the adjacent. In Fig. 2, we demonstrate an example of an equivalent distributed R-C model of a wire that has two of the four R-C stages overlapping with its neighboring one (Overlap region), at the same metal layer.

Step 3 – Interconnect Delay computation: Thus, based on the delay increment ratios extracted from the simulations of wires, we constructed a look-up table library, in- cluding the wirelength of which the delay is computed, along with the neighboring wire’s length, the wires’ spacing and the starting point of the neighboring wire, all given in um. Based on this library, namely TDDB_LUT.lib in our flow, as well as on the extrapolated leakage from accelerated to operating conditions of the first step, we now performed a linear interpolation through Matlab, to compute the delay ratio for each wire of the examined net in the current timing path. It must be noted that only wires of length longer than 10um are considered in the additional delay calculation script, in correspondence with the range of wire lengths included in TDDB_LUT.lib. In the linear extrapolation method performed in Matlab, we derive the wire of each net in the examined path of the design, by reading the wire.report, which contains the physical dimensions information about the net’s wires and their neighboring ones in the same metal stack, from the initial layout extraction step. The arguments passed for the extrapolation are the wire’s width, thickness and length, as well as the starting point of each neighboring wire, its length and the distance between them, all obtained from the layout extraction. Hence, in order to find the additional delay for the specific wire, considering the neighboring ones from the layout, we perform a linear interpola- tion between these values and those of the wire patterns simulated in HSPICE, for which we have already computed the delay increment ratios and dumped them in TDDB_LUT.lib, as it is mentioned above. The delay overhead for individual wires is shown in Fig. 3, as a function of the wires’ length and distance (right figure), as well as of temperature and operation time, given in years (left figure).

Fig. 3. Delay impact on a wire due to TDDB depending on: temperature (left) and wire length and distance (right) 80 D. Bekiaris et al.

The calculated delay ratio is then multiplied with the quotient of the target wire’s length and the total net length and the result is added to the initial wire’s delay due to IMD leakage, which is of course zero. Thus, the additional delay on the specific wire due to TDDB is computed. The same process is performed for all the wires of the examined net in the current path of the design. The total additional delay of the whole net due to TDDB is the weighted summation of the all net’s wire delays, where the weights are computed by dividing the wire’s length to the total net’s length. The up- dated net’s delay is then annotated into the design’s SDF file, by finding the specific net and adding the extra delay to the existing one. Thus, the produced SDF, contain- ing the delay overhead from all nets of the path is then annotated to ETS to evaluate the total impact of TDDB on the design’s performance. The aforementioned process is continuously followed for all the selected register-to-register paths in the design, while it is applicable to any other design with a reasonable amount of gates.

5 Evaluation of the TDDB Framework to a LEON3-Based MP-SoC

The presented TDDB analysis flow is applied to an MP-SoC design, based on two LEON3 SPARC processor cores, both attached on the AMBA Advanced High Per- formance bus (AHB). Each processor has seven pipeline stages, while the internal caches include 2 sets of 4K bytes. The design’s RTL description is given in parameterized VHDL, configured via the Gaisler Research automated tools [14]. It is synthesized in Synopsys Design Compiler based on the TSMC 45nm standard-cell library (0.9V, 25 C) and at a clock period constraint of 2ns, resulting in about 30K gates. The floorplanning and the place-and- route steps are implemented in Cadence SoC Encounter, while ETS is employed for Static Timing Analysis. The post-layout Verilog netlist simulation is performed in ModelSim [15], where we obtained switching activity from a matrix multiplication application, running in both processors, as well as from an MP-SoC benchmark ini- tializing the two cores and the system’s peripherals, included in the Gaisler’s suite. The power analysis is performed in PrimeTime PX, by annotating the .vcd (Value Change Dump) file with the design’s activity, derived from ModelSim’s framework. In the proposed case study, we explore how the impact of TDDB on the perform- ance of a LEON3-based MP-SoC design may change, by selecting different place- ment and routing scenarios, considering the gate-level netlist obtained from synthesis. The dependence of inter-metal leakage on length and distance of wires motivated us to look at different place-and-route strategies favoring either timing or congestion, to find out which scenario minimizes the timing impact of TDDB.

5.1 Experimental Results and Discussion The main parameters that affect TDDB on the interconnect dielectrics are temperature, wire length and distance between adjacent wires. Regarding temperature, it is mostly affected by the switching activity of the designs. In our LEON3 layouts, which were implemented based on five different place-and-route scenarios, we ob- served minor temperature differences for two application benchmarks of different computational effort, mentioned above. This is due to the similar power traces A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 81 extracted from power analysis for the two application benchmarks executed, as well as to the fact that we have been based on the same floorplan, in order to implement the different placement and routing strategies. On the other hand, interconnect stack geometrical parameters, like lengths and distances, are mainly impacted by how the circuit is placed-and-routed. In principle, a timing-optimized placement and routing approach will tend to lead to shorter wires, while a congestion-oriented physical implementation strategy will tend to result into longer wires due to coarser placement, as well as to the detouring of wires during routing, to avoid the formation of over-congested areas. Therefore, it is likely that such a strategy will incur larger distances between wires in the same metal layer. However, the results depicted in Fig. 4 indicate that when placement is congestion- aware (CPl-NR & CPl-CR), the delay overhead due to TDDB is very high. Such a placement scenario will spread out the standard cells and inevitably lead to longer wires at the routing stage, compared to the timing-driven approach and irrespective of the routing strategy. At the other extreme, timing-aware placement and routing seems to result into the minimum delay impact, because the wire lengths are minimal. Combining these remarks with those of Fig. 3 (delay impact of a wire due to TDDB vs length vs distance vs wire length), we can draw interesting conclusions. Even though at the individual wire level the distance between wires is the most critical parameter for the delay impact of TDDB, at the entire chip level, wire length is the only important parameter for our LEON3-based layouts. Howevesr. in the presented case-study, the different routing strategies, favoring timing or congestion, tend to leave the distances between wires almost unaffected. Hence, a timing-optimal placement and routing approach will also lead to the best layout for TDDB. Since timing is usually the major design spec, the resulting layouts will be optimal for TDDB, while selecting a totally timing-driven place-and-route approach.

Fig. 4. Chip-level timing overhead due to TDDB for different layout styles (C: congestion- aware, T: timing-aware, N: normal, Pl: placement, R: routing)

This does not imply that designers need not to worry about TDDB, however. Tim- ing-optimal layouts tend to have minimal slack between the data arrival time and the 82 D. Bekiaris et al. required time, so that the designs can run at the highest possible clock frequency. Even after 3 years of operation the layout we used has incurred a critical path delay overhead of about 40ps, which might be enough to cause a timing violation. There is a trade-off between actual clock frequency and operating lifetime of the chip. If enough timing slack is left for TDDB tolerance, the expected operating lifetime will be longer, while the design’s operating frequency and consequently performance will degrade, and vice versa.

6 Conclusion and Hints for Future Work

In this work, we introduced a reliability analysis framework that estimates the impact of Time-Dependent Dielectric Breakdown on the system’s performance, considering an MP-SoC design implemented with a nanometer CMOS technology with different place-and-route strategies. The proposed flow captures the timing violations induced by the inter-metal leakage of low-k interconnects of the examined paths and predicts the gradual performance degradation for each implementation scenario, considering the layout’s temperature profile, based on a specific application. Future work may be focused on the frameworks’ automation, as well as on the selection of paths, depend- ing on the temperature of design units, the congestion and the length of wires.

References

1. ITRS 2005 public reports (2005), http://public.itrs.net 2. Chen, F., et al.: Critical low-k reliability issues for advanced CMOS technologies. In: Proc. of the 2009 IRPS Symposium, Montreal, Canada, May 26-30, pp. 464–475 (2009) 3. Nitta, S., et al.: Copper BEOL interconnects for silicon CMOS logic technology. In: Davis, J.A., Meindl, J.D. (eds.) Interconnect Technology and Design for Gigascale Integration. Springer, Heidelberg (2003) 4. Gonella, R.: Key reliability issues for copper integration in damascene architecture. Jour- nal of Microelectronic Engineering 55(1-4), 245–255 (2001) 5. Tan, T.L., Gan, C.L., Du, A.Y., Cheng, C.K., Gambino, J.P.: Dielectric degradation mechanism for copper interconnects capped with CoWP. Applied Physics, Letter. 92, 201916 (2008) 6. Takeda, K.-i., Ryuzaki, D., Mine, T., Hinode, K., Yoneyama, R.: Copper-induced dielec- tric breakdown in silicon oxide deposited by plasma-enhanced chemical vapor deposition using trimethoxysilane. Journal of Applied Physics 94(2572) (2003) 7. Chen, F., et al.: Line-edge roughness and spacing effect on low-k TDDB characteristics. In: Proceedings of the 2008 International Reliability Physics Symposium (IRPS), April 27- May 1, pp. 132–138 (2008) 8. Chen, F., Shinosky, M.: Addressing Cu/Low-k Dielectric TDDB Reliability Challenges for Advanced CMOS Technologies. IEEE Transactions on Electron Devices 56(1), 2–12 (2009) 9. Li, Y.: Low-k dielectric reliability in copper interconnects, PhD Dissertation, Katholieke Universiteit Leuven (2007) 10. Guo, J., et al.: A Tool Flow for Predicting System-Level Timing Failures due to Intercon- nect Reliability Degradation. In: Proc. of the 2008 GLSVLSI International Symposium, Orlando, Florida, USA, May 4-6, pp. 291–296 (2008) A Temperature-Aware Time-Dependent Dielectric Breakdown Analysis Framework 83

11. Guo, J., et al.: The Analysis of system level timing failures due to interconnect reliability degradation. IEEE Transactions on Device and Material Reliability (2009) 12. Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., Stan, M.R., Brown, C.L.: HotSpot: a compact thermal modeling methodology for early-stage VLSI de- sign. IEEE Transactions on VLSI Systems 14(5) (May 2006) 13. Cadence SoC Encounter Database Access command reference, http://www.cadence.com 14. Aeroflex Gaisler Research, http://www.gaisler.com 15. Mentor Graphics ModelSim, http://www.model.com An Efficient Low Power Multiple-Value Look-Up Table Targeting Quaternary FPGAs

Cristiano Lazzari1, Jorge Fernandes2, Paulo Flores2,andJos´eMonteiro2

1 INESC-ID, Lisbon, Portugal 2 INESC-ID / IST, TU Lisbon, Lisbon, Portugal {lazzari,jorge.fernandes,pff,jcm}@inesc-id.pt

Abstract. FPGA structures are widely used as they enable early time- to-market and reduced non-recurring engineering costs in comparison to ASIC designs. Interconnections play a crucial role in modern FPGAs, because they dominate delay, power and area. Multiple-valued logic al- lows the reduction of the number of interconnections in the circuit, hence can serve as a mean to effectively curtail the impact of interconnections. In this work we propose a new look-up table structure based on a low- power high-speed quaternary voltage-mode device. The most important characteristics of the proposed architecture are that it is a voltage-mode structure, which allows reduced power consumption, and it is imple- mented with a standard CMOS technology. Our quaternary implemen- tation overcomes previous proposed techniques with simple and efficient CMOS structures. Moreover, results show significant reductions on power consumption and timing in comparison to binary implementations with similar functionality.

Keywords: Multiple-value Logic, Quaternary Logic, Look-up Tables, FPGAs, Standard CMOS Technology.

1 Introduction

Designers face new challenges in modern systems on a chip (SoCs) due to the large number of components. The high integration of different systems increases the number and length of interconnections, which are becoming the dominant aspect of the circuit delay for state-of-the-art circuits due to the advent of deep sub-micron technologies (DSM). This fact is even more significant with each new technology generation [1]. In DSM technologies, the gate speed, density and power scaling follows Moore’s law. On the other hand, the interconnection resistance-capacitance product increases with the technology node, leading to an increase of network delay. Even after modifications in interconnections, from aluminum to copper and low-k inter metal dielectric materials, the problem remains and it is getting more significant [2]. In particular, interconnections play a crucial role in Field Programmable Gate Arrays (FPGA), because they not only dominate the delay, but they also present a significant impact on power consumption [3] and occupied area [4]. Recent work

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 84–93, 2011. c Springer-Verlag Berlin Heidelberg 2011 An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 85 suggests that in modern million-gates FPGAs, as much as 90% of chip area is dedicated to interconnections [5]. In order to keep the wide range of applications of the FPGAs in the market, one must deal with their excessive power dissipation, and this must be reduced without compromising computational power. One way to deal with this problem is to reduce the area occupied by the interconnections by, not only reducing the number of interconnections, but also the length of these interconnections. Multiple-valued logic (MVL) has received increased attention in the last years because of the possibility to represent the information with more than two dis- crete levels in a single wire. Hence, the number of interconnections can be signif- icantly reduced, with major impact in all design parameters: less area dedicated to interconnections; more compact and shorter interconnections, leading to in- creased performance; lower interconnect switched capacitance, and hence lower global power dissipation [6]. MVL has been successfully accomplished in several type of devices such as adders [7] and multipliers [8], as well as programmable devices [5,9] were also proposed. The main drawbacks of these previous MVL implementations are that they are either based on current-mode devices or demand extra steps in the fabrication process (for the generation of transistors with different Vths). Current-based circuits present successful improvements in reducing area, but their excessive power consumption and implementation complexities has pre- vented, until now, MVL systems from being a viable alternative to standard CMOS designs. On the other hand, while it is true that technologies with mul- tiple Vths deal very well with the power dissipation problem, as stated in [5,10], their additional phases on the fabrication process make their implementation more difficult, more susceptible to variability problems and more expensive. In this work we present a new implementation of a multiple-valued look-up table based on the quaternary representation, taking advantage of the analog nature of the multiple-valued representation. We implemented the quaternary look up-table by using a simple and efficient analog structure able to deal with the quaternary signals. Results show that our implementation overcomes the drawbacks of previous implementation and are competitive when compared to binary LUTs with the same functionality. This paper is organized as follows. Section 2 discusses the differences between binary and quaternary look-up table implementations. Section 3 presents the new quaternary look-up table, giving details about the proposed structure. A comparison between the binary and quaternary look-up tables is presented in Section 4. Variability and the reduced noise margin effects in quaternary circuits are discussed in Section 5, and finally, Section 6 concludes the paper and outlines future work.

2 Binary and Quaternary Look-Up Tables Overview

General Look-Up Tables (LUT) are basically memories, which implement a logic function according to their configuration. Configuration values C =(c0, ···,ci, 86 C. Lazzari et al.

···,ck−1) are initially stored in the look-up table structure, and once inputs are applied to it, the logic value in the addressed position is assigned to the output. The capacity of a LUT |C| is given by

|C| = n × bk (1) where n is the number of outputs, k is the number of inputs and b is the number of logic values. For example, a 4-input binary look-up table with one output is able to store 1 × 24 = 16 Boolean values. For the purpose of this work, only 1-output LUTs (n = 1) are discussed in this paper. A binary function implemented by a Binary Look-Up Table (BLUT) is defined k as f: B → B,overasetofvariablesX =(x0, ···,xi, ···,xk−1), where each variable xi represents a Boolean value. The total number of different functions |F | that can be implemented in a BLUT with k input variables is given by

|F | = b|C| (2) where b = |B| (b = 2 in the binary case). For example, a look-up table with 4 inputs (k = 4) can implement one of |F | =65, 536 different functions. Quaternary functions are basically generalizations of binary functions. A qua- ternary function implemented by a quaternary look-up table (QLUT) is defined k as g: Q → Q, over a set of quaternary variables Y =(y0, ···,yi, ···,yk−1), where the values of a variable yi, as the values of the function g(Y ), can be in Q= {0, 1, 2, 3}. As in the binary case, the number of possible function in QLUTs is given by (2), where b = 4. In this case, the number of functions that can be represented is around 4.3 × 109 for a QLUT with only two quaternary inputs (k =2),whichismuchlargerthanfortheBLUT. It is important to highlight that the function g(Y ) performs exactly the same function as two binary BLUTs, f0(Y )andf1(Y ), where f0 represents the least significant Boolean value and f1 represents the most significant one. Following the same idea, the configuration values are also quaternary for the QLUT, which represent the values for two binary configuration values. Since a quaternary variable y is capable of representing twice as much infor- mation as a binary variable x, we note that the cardinality of |Q| =2×|B| in our experiments. In other words, two binary variables with the same inputs can be grouped in order to represent a quaternary variable. Such procedure aims at reducing both the total number of connections and the number of gates.

3 Look-Up Tables Implementation

Binary and quaternary look-up tables were implemented with transmission gates. For the binary version, transmission gates are controlled by the BLUT inputs, while the QLUT is composed of transmission gates controlled by a new quater- nary to binary device. Fig. 1a shows a binary 4-input BLUT implementation (b =2,k =4, |C| = 16) where xi ∈ X are the inputs, ci ∈ C form the look-up table configuration An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 87

B0 Q03

c15 c15 B0 Q03 B0 B1 Q02

c14 c14 B0 B1 Q02 B0 B1 B2 Q01 Q13

c13 c13 B0 B1 B2 Q01 Q13 B0 B2 B3 Q00 Q12

c12 c12

B0 B2 B3 z Q00 Q12 w B2 B3 Q11 Q03

B2 B3 c3 Q11 B1 B2 Q03 Q10 Q02

B1 B2 c2 Q10 B0 B1 Q02 Q01 c1 c1 B0 B1 B0 Q01 Q00 c0 B0 c0 Q00

B0B0 B1B1 B2B2 B3 B3 Q00 Q01 Q02 Q03 Q10 Q11 Q12 Q13 Q−decoder 0 Q−decoder 1 x0 x1 x2 x3 y0 y1 (a) 4-input BLUT. (b) 2-input QLUT.

Fig. 1. Binary and quaternary look-up table implementations and z is the output. The BLUT is composed of four stages as a consequence of the number of inputs. Multiplexers (implemented using transmission gates) are responsible for propagating the configuration values to the BLUT output. The transmission gates receive selection signals from the four BLUT inputs and associated inverters. A quaternary look-up table (QLUT) follows the same structure as the BLUTs. Fig. 1b illustrates the implementation of a 2-input QLUT (b =4,k =2, |C| = 16). As in the binary case, ci ∈ C are the look-up table configuration values, yi ∈ Y are the inputs and w is the output. Due to the quaternary representation, only two stages of transmission gates are required. The transmission gates are controlled by binary signals. Therefore, we need a special circuit to convert the quaternary inputs y0 and y1 to the correspondent control signals – the quaternary-to-binary converter (Q-decoder). 88 C. Lazzari et al.

Table 1. The Q-decoder behavior as a funtion of the quaternary logic value at the input

Q Q0 Q1 Q2 Q3 04 12 0 0 0 14 0 12 0 0 24 0 0 12 0 34 0 0 0 12

3.1 Quaternary-to-Binary Converter Table 1 shows the Q-decoder binary output logic values as function of the quater- nary input Q. Outputs Q0 to Q3 determine which transmission gates (in Fig. 1b) are propagating the configuration value ci ∈ C to the QLUT output w.Note that values for the controlling signals Q0, Q1, Q2 and Q3 are binary values, meaning0(0V )or12 (VDD). The Q-decoder outputs may be considered as flags that determine which qua- ternary value is applied to Q-decoder input. Once we are able to determine the quaternary value in the Q-decoder input Q, the transmission gates connected to the Q-decoder outputs may be properly controlled. In other words, with the Q-decoder structure we are able to convert a quaternary input to a 4-bit word in one-hot codification and its inverted value.

Q0 Q0 Q1 Q1 CP Q aux

Q2 Q2 CN Q3 Q3

Fig. 2. The Q-decoder logic structure

The Q-decoder structure is shown in Fig. 2. The main advantage of this struc- ture over previous proposed implementations is that it is has standard CMOS structures. The Q-decoder is composed of two comparators CP and CN,and other traditional digital circuits such as inverters, NANDs and NORs. The CP and CN are self-reference analog comparators shown in Fig. 3. With these structures we are able to detect the four possible voltage levels. In a binary implementation, an inverter may be seen as a comparator where the voltage reference is VDD/2. For our quaternary device, we need three voltage references in order to determine a quaternary value, at 1/6VDD,3/6VDD and 5/6VDD,as depicted in Fig. 3a. An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 89

(a) Logic levels. (b) CP and CN transfer functions.

(c) CP Structure. (d) CN Structure.

Fig. 3. Quaternary logic levels and comparators details

One way to obtain this comparator behavior is by designing inverters with unbalanced PMOS and NMOS transistor widths. The main drawback of this technique is that it leads to large transistors widths with large gate capaci- tances, penalizing speed and power. Furthermore, in technologies with low VDD, reference voltage values are below Vth, which makes this sizing technique im- practicable. To overcome this problem, we propose the use of the comparator circuits in Fig. 3c and Fig. 3d that add an extra transistor connected as a “diode” to shift the supply voltage by Vth. In a first order approach, we consider simplified transistor models, and that k k k → µ W/L µ W/L transistors are equally sized ( 1 = 2 = 2 n( )1 = p( )2), with V V V V equal threshold voltages ( th1 = th2 = th2 = th). This simplified analysis is confirmed by simulations with more accurate models that will be presented in the next sections. 90 C. Lazzari et al.

Reference points are defined by calculating vx for vi = 0 (3) and the transitions points (4), leading to the transfer function curves represented in Fig. 3b.   vx ⇒ iD =0 vi=0 2 2 ⇒ k2(VDD − vx − Vth2) =0

⇒ vx = VDD − Vth2 (3)

i i ⇒ k v − V 2 k v −v − V 2 D1 = D2 1( i th1) = 2( x i th2)

VDD−Vth

⇒ vi − Vth1 = VDD − vi − 2Vth ⇒ 2vi = VDD − Vth VDD − Vth ⇒ vi = (4) 2 The Q-decoder was implemented with the UMC 130nm technology. Simulations waveforms are shown in Fig. 4, where Q-decoder outputs are shown as expected and described in Table 1. The largest propagation delay from the Q-decoder input to the outputs (Q → Q2) is 196ps for this technology. This result is very important, because an inverter connected to the same transmission gates (i.e., same output load) presents a 81ps propagation delay, and the transmission gates are the main contributors to the look-up table propagation delay. More details about the comparison of binary and quaternary LUTs are given in the next section.

4 Binary vs Quaternary Look-Up Tables

We also implemented the complete binary and quaternary look-up tables with the UMC 130nm technology in order to evaluate their performance and power consumption. The development of the binary and quaternary LUTs was per- formed according to the Fig. 1. Transistor widths were kept to the minimum value in order to have a fair comparison between binary and quaternary ver- sions. We inserted buffers in the binary structure in order to reduce the impact of the gate capacitances. According to Fig. 1a, a cell connected to the BLUT input x0 should drive 16 transistors. We balanced this gate capacitances by inserting 4 buffers, and thus improving the propagation delay. The power consumption was also reduced due to the faster transitions, and as a consequence, smaller short circuit times. Experimental results are shown in Table 2, where the quaternary structure proposed in this paper outperforms the binary implementation in both power consumption and propagation delay. These results were obtained through CADENCE Spectre simulation [11]. The propagation delay is simply the largest delay from an input to the output of each LUT. The average power consumption is obtained from the simulation of 1024 random input vectors, when circuits were running at 100MHz. An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 91

1.2

0 0 5 10 15 20 25 30 35 40 1.2

Q2 (V) 0 0 5 10 15 20 25 30 35 40 1.2 Q1 (V) 0 0 5 10 15 20 25 30 35 40 1.2

Q0 (V) 0 0 5 10 15 20 25 30 35 40

1.2 1 0.8 0.6

Q (V) 0.4 Q3 (V) 0.2 0 0 5 10 15 20 25 30 35 40 t (ns)

Fig. 4. The Q-decoder inputs and outputs waveforms

For the quaternary circuits, we carefully took in consideration every single voltage source (e.g.,usedtodriveci values of the QLUT), so that the results shown in Table 2 reflect the real power consumption (i.e., currents flowing from a voltage source to another are considered). Results highlight that the quaternary look-up table, proposed in this paper, is very promising. In terms of delay, the quaternary LUT presents a very similar behavior, but better results are obtained when the load capacitance is 0.5pF or larger. The power consumption is the most important result. According to Table 2, the quaternary LUT presents gains ranging from 22% (Cl=0.2pF) to 39% (Cl=1pF) in terms of power consumption. Note that, as for the propagation delay, gains are more important when the load capacitance increases. It is clear that these gains related to the power consumption are obtained due to the reduced voltage levels. While binary transitions range from 0V to 1.2V (for this technology), quaternary transitions may vary from 0V →0.44V to 0V →1.2V , demanding different current flows. Considering that all the possible transitions have the same probability, quaternary transitions have a smaller av- erage voltage transition, reducing the average current flow and consequently the power dissipation. 92 C. Lazzari et al.

Table 2. Delay and power consumption comparison of two 4-input BLUTs and one 2-input QLUT, both implemented with UMC 130nm process technology

Output 2 4-input Binary LUTs 2-input Quaternary LUT Load (Cl) Delay Power@100MHz Delay Power@100MHz 0.2pF 0.91ns 45µW 0.95ns 35µW 0.5pF 1.9ns 68µW 1.7ns 43µW 1.0pF 3.4ns 94µW 3.0ns 57µW

In a practical implementation of a FPGA, there will be a smaller number of interconnections due to the quaternary representation, and hence we will also be able to reduce the wire length, and the parasitics capacitance will be smaller, as a consequence. For this reason, we expect to have better results than the ones presented in this paper, when developing a complete FPGA, based on the proposed circuits, to implement the quaternary logic.

5 Variability and Noise Margin in Quaternary Circuits

In current sub-micron and future technologies, process variability and reduced noise margin are important challenges for the development of multiple-valued devices. Voltage-mode multiple-valued logic devices present reduced voltage lev- els to represent logic values in comparison to binary circuits, and for this reason they may be, in theory, more susceptible to errors. However, we performed Monte Carlo simulation with 500 runs to show that our quaternary LUT is robust to process variations when considering random process and mismatch variations. In this simulations, voltage variations are kept below 90mV for all the critical transition points (Q0 and Q3). Even with this variation range, we still have a 100mV gap between logic level transitions for other sources of noise or perturbations. Noise levels are indeed reduced in quaternary circuits due to the fact that we have four voltage levels while keeping the same supply voltage. However, we may argue from a different perspective. In the last years, supply voltages have been reduced from 5V ,to3.3V , and recently to 1V . This is a huge reduction in the noise margin and circuits have successfully coped with it. It is important to highlight that the perturbations in the quaternary devices should be smaller than the binary ones because of the smaller average voltage transitions. Therefore a lower noise coupling between lines. In summary, we may see the quaternary devices as a specific type of analog device. The knowledge and experience acquired by analog designers applied to the development of these devices in sub-micron technologies may be very useful in an effort to develop new multiple-value devices.

6 Conclusions

This work presents important advances in the development of multi-valued circuits through the implementation of a quaternary look-up table targeting An Efficient Low Power Multiple-Value LUT Targeting Quaternary FPGAs 93 multiple-valued FPGAs. Results show that the proposed structure is competi- tive with the binary one with significant reductions on power consumption and propagation delay. The technique proposed in this paper is simpler to imple- ment than the previous proposed multiple-valued circuits. Furthermore, as far we know, no other proposed work is more efficient than our technique when comparing to binary circuits. As future work, we are developing a complete FPGA (logic block, switch matrix, etc). A functional quaternary FPGA will allow the study of viability and the comparison with current binary circuits. We are also planning to implement our quaternary device in more recent technologies such as 45nm and below.

Acknowledgments. This work was supported by FCT (INESC-ID multian- nual funding) through the PIDDAC Program funds and by the FCT project PTDC/EEA-ELC/72933/2006.

References

1. Gupta, A.K., Dally, W.J.: Topology optimization of interconnection networks. IEEE Comput. Archit. Lett. 5(1), 3 (2006) 2. Banerjee, K., Souri, S.J., Kapur, P., Saraswat, K.C.: 3-D ICs: a novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration. Proceedings of the IEEE 89(5), 602–633 (2001) 3. Li, F., Lin, Y., He, L., Chen, D., Cong, J.: Power modeling and characteristics of field programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24(11), 1712–1724 (2005) 4. Singh, A., Marek-Sadowska, M.: Efficient circuit clustering for area and power reduction in FPGAs. In: Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, FPGA 2002, pp. 59–66. ACM, New York (2002) 5. da Silva, R., Lazzari, C., Boudinov, H., Carro, L.: CMOS voltage-mode quaternary look-up tables for multi-valued FPGAs. Microelectronics Journal 40(10), 1466– 1470 (2009) 6. Dubrova, E.: Multiple-valued logic in vlsi: Challenges and opportunities. In: Pro- ceedings of NORCHIP 1999, pp. 340–350 (1999) 7. Gonzalez, A., Mazumder, P.: Multiple-valued signed digit adder using negative differential resistance devices. IEEE Transactions on Computers 47(9), 947–959 (1998) 8. Hanyu, T., Kameyama, M.: A 200 MHz pipelined multiplier using 1.5 v-supply multiple-valued mos current-mode circuits with dual-rail source-coupled logic. IEEE Journal of Solid-State Circuits 30(11), 1239–1245 (1995) 9. Zilic, Z., Vranesic, Z.: Multiple-valued logic in FPGAs. In: Proceedings of the 36th Midwest Symposium on Circuits and Systems, vol. 2, pp. 1553–1556 (August 1993) 10. Cunha, R., Boudinov, H., Carro, L.: Quaternary look-up tables using voltage-mode CMOS logic design. In: 37th International Symposium on Multiple-Valued Logic, ISMVL 2007, pp. 56–56 (May 2007) 11. Inc.: Virtuoso spectre simulator user guide (2010) On Line Power Optimization of Data Flow Multi-core Architecture Based on Vdd-Hopping for Local DVFS

Pascal Vivet1, Edith Beigne1, Hugo Lebreton1, and Nacer-Eddine Zergainoh2

1 CEA-Leti, Minatec, Grenoble, France 2 TIMA, Grenoble, France {pascal.vivet,edith.beigne,hugo.lebreton}@cea.fr, [email protected]

Abstract. With growing integration, power consumption is becoming a major issue for multi-core chips. At system level, per-core DVFS is expected to save substantial energy provided an adapted control. In this paper we propose a local on-line optimization technique to reduce energy in data-flow architecture, thanks to a Local Power Manager (LPM) using Vdd-Hopping for efficient local DVFS. The proposed control is a hybrid global and local scheme which respects throughput and latency constraints. The approach has been fully validated on a real MIMO Telecom application using a SystemC platform instrumented with power estimates. Local DVFS brings 45% power reduction compared to idle mode. When local on-line optimization benefit from computation time varia- tions, 30% extra energy savings can be achieved.

Keywords: Low Power, DVFS, VDD-Hopping.

1 Introduction

In today’s System on Chip, power consumption is becoming a major issue. Dedicated mechanisms have been proposed in order to reduce both static and dynamic power consumption at different levels: from technology up to system level. At system level, Dynamic Power Management (DPM) techniques are classically used, such as ad- vanced standby modes or efficient Dynamic Voltage and Frequency Scaling (DVFS). The main difficulty of DPM techniques is to design efficient dedicated control up to application level. Power management is often specific to the low-power design tech- niques and must take into account architecture and application. In future multi-cores, the Globally Asynchronous Locally Synchronous paradigm is a natural enabler to help architecture partitioning and facilitate clock and power management [1][12]. In GALS scheme, each IP unit has its own frequency, and communicate asynchronously through a global interconnect. GALS scheme enables local power management: each IP unit is an independent Voltage and Frequency Is- land (VFI). This is also commonly called “per-core DPM”: further energy savings are obtained, since the power optimum is not limited by the most constrained IP core but can be reached independently on each IP cores.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 94–104, 2011. © Springer-Verlag Berlin Heidelberg 2011 On Line Power Optimization of Data Flow Multi-core Architecture 95

Considering that the energy is square Vdd dependant, DVFS technique is the most promising in terms of overall energy reduction. Due to the usage of external DC-DC converters, today’s DVFS techniques are mostly CPU centric and not applied at IP level. Recently, a low cost and efficient DVFS technique, called Vdd-Hopping, has been proposed [2][3]. By using only two external voltages and a dynamic voltage selector switch, DVFS can be efficiently offered locally to each IP core. In this paper, we target heterogeneous data-flow like architecture with Telecom applications as an exemple [14]. Regarding the application, execution time variations are decisive. In non real time systems, voltage and frequency selection consists usu- ally in a tradeoff between performance and energy. In case of real time system with data flow application, timing constraints must be met and are twofold: a throughput constraint for each IP and an overall latency constraint on the whole data-flow [4][5][6]. Heuristic algorithms can be used, based on the worst case application sce- narios [7][8]. In an heterogeneous architecture using dedicated IP engines, contrary to homogeneous multi-cores, task allocation is static and directly driven by the architec- ture. In that case, to reduce energy in a multi-application context and benefit from all available dynamic slack time, on-line optimization associated with a fast hardware DPM controller is required [10][11]. The VDD-Hopping technique has been intro- duced early by T. Zakurai group, which proposed some software control techniques [16] but not yet adapted to hardware heterogeneous architecture. In this paper, we propose an on-line optimization technique, to reduce energy in data-flow heterogeneous architecture, by using a dedicated DPM controller, which uses the efficient Vdd-Hopping technique for local DVFS. This consists in a hybrid global and local technique, as in [11], which respects throughput and latency con- straints, and using only two voltage/frequency points. The proposed technique has been applied to a real GALS NoC architecture targeting MIMO telecommunication applications [14]. Energy savings have been estimated on a SystemC simulation plat- form which has been instrumented with power estimates [15]. The outline of the paper is as follows: Section 2 introduces the targeted GALS NoC low-power architec- ture, and Section 3 describes the proposed Vdd-Hopping control for local DVFS. The local on-line optimization is described Section 4. Finally, the experimental results are given in Section 5.

2 Low Power GALS NoC Architecture

The low power overall architecture is organized within a complex GALS NoC fully implemented in asynchronous logic [14]. As shown in Figure 1, each synchronous IP unit of the SoC is integrated with advanced low-power mechanisms, such as in [12]. A programmable Local Clock Generator is implemented within each unit to generate a variable frequency F in a predefined applicative range. A local Power Supply Unit (PSU) manages the local unit voltage V, sharing a power switch between a Vdd- hopping technique and a classical MTCMOS technique. The PSU uses two external voltages with two power switches: VHIGH and VLOW which are automatically switched during DVFS phases. The Network Interface (NI) is in charge of communications with respect to the NoC protocol.

96 P. Vivet et al.

Fig. 1. Low Power GALS NoC overall Architecture

The Local Power Manager (LPM) implements the proposed DPM and on-line op- timization techniques. The LPM is activated by the NI in a data-flow manner accord- ing to NoC traffic and HW tasks. The NoC architecture targets data flow applications, where task control and complex data flows are handled by the NI. For each executed task, the NI loads a configuration for the IP core and associated input/output data flows, and then computation starts.

2.1 IP Unit Integration for Power Optimization

Each synchronous IP unit is defined as an independent power domain (using its dedi- cated local voltage V) and an independent frequency domain (using its dedicated local clock frequency F). Each IP unit can be set in one of the 4 power supply modes:

• HIGH mode, local supply voltage V is VHIGH and core clock is on. This is the “nominal” high performance working mode.

• LOW mode, core clock is on, but supply is switched to VLOW. Frequency is lower than nominal, energy per cycle decreases. This is “low power” mode. • IDLE mode, core clock is off and leakage power is further reduced thanks to VLOW supply voltage. This is the “low-power dormant” mode. • OFF mode, the unit is switched off when not used in the application, to further reduce the leakage power. For each unit, all power modes can be programmed through the Network Interface and the Local Power Manager, except the OFF mode which is programmed through top level signals (main CPU).

2.2 Local DVFS Using Two Voltages Set Points

In order to perform efficient local Dynamic Voltage Scaling (DVS), the main objec- tive is to avoid as much as possible low-level software control to ensure minimal latency cost. Within the Power Supply Unit, a hardware controller called Vdd- hopping automatically switches between VHIGH and VLOW (Figure 2). On Line Power Optimization of Data Flow Multi-core Architecture 97

1 LPMPWM 0

Fhigh Frequency Flow 0 1 Clk 0

Vhigh Voltage Vlow 0

Fig. 2. LPM control, Vdd-Hopping sequence example

During smooth DVFS transitions, the synchronous IP can continue its own compu- tations or communications. To obtain an average value between VHIGH and VLOW, the LPM controls the target performance by switching between these 2 values. The power efficiency of the proposed Vdd-Hopping [2] is more than 95%. In a given VHIGH or VLOW voltage, there are no losses except those in a standard power switch; there are only energy losses during the transitions (less than 100 ns). There is no latency cost, and no need for real time software, fast and robust transitions are ensured by hard- ware. The VDD-Hopping mechanism has been implemented and validated in a test-chip in 65nm [13], which prove high reliability. In order to minimize energy per operation, the IP unit should run at maximum achievable frequencies fh and fl. The LPM objective is then to spend more time at VLOW to decrease energy, while respect- ing timing constraints. The proposed hybrid local and global DVFS principle and associated LPM schemes are introduced in next section.

3 Local DVFS Control

On data flow architectures with latency constraint on the whole chain, a global man- agement is required to ensure the deadline. In order to guarantee latency, due to dy- namic variations of the computation on each core, centralized control or software control cannot be done since it would not respond fast enough to handle all the dy- namic variations. We choose a Worst Case Execution Cycle (WCEC) based static management to select a set point for each task. A heuristic based algorithm, as in [7], can be used. To benefit from dynamic slack time induced when the actual number of cycles to complete a task is less than WCEC, a local control is implemented. Such a hybrid (local and global) approach has also been adopted in [11]. Based on worst case, a global power manager (such as the host processor) dis- patches the available latency among tasks. Hence, each core is given a timeslot to complete its task. For each IP core, its Local Power Manager (LPM) controls the Vdd-Hopping by spreading the computation over the given timeslot. The LPM is activated by the NI in a data-flow manner according to NoC traffic and HW tasks. Two control schemes are proposed and presented below, with NI task or IP Core task synchronization. One must notice that NOC bandwidth must be enough to tolerate uncorrelated IP frequencies variations, to smooth applicative traffic, hypothesis which is respected in the addressed application and corresponding NOC (see section 5). 98 P. Vivet et al.

3.1 NI Task Synchronization

The first proposed solution interacts with the NoC platform programming model to control the power modes, in a generic way. As soon as a new task is loaded in the NI, the Vdd-Hopping transitions can start. The LPM control of the IP is thus activated in a data-flow manner according to the NoC incoming traffic and task. Given the WCEC Nwcec, the number of cycles to spend at high voltage Nh and at low voltage Nl can be derived from the given timeslot τ for the task. Let fh and fl be the maximum available frequency at respectively high set point and low set point, we have: N N N N − N τ = h + l = h + wcec h (1) f h fl f h fl For the task computation, the number of cycles at high and low level is given by: f N = h ()N −τ × f and N = N − N (2) h − wcec l l wcec h f h fl N The timeslot is equivalent to a mean frequency: f = wcec . t arg et τ

Task 1 loaded 0 t 1 Core Active 0 t V high V low t

Fig. 3. NI task synchronization

The LPM switches periodically from high to low while the task is loaded in the NI, so that the target frequency is reached when the core is actually computing (Figure 3). If the hopping frequency is increased while keeping the Nh and Nl ratio, the mean frequency is not modified and NoC traffic is smoothened. Since extra energy is con- sumed during transitions, a tradeoff is required between transition number, NoC traf- fic regularity and energy. Lastly, if the targeted frequency is lower than the fastest frequency at Vlow, the frequency is decreased (this is DFS at Vlow). Finally, as seen Figure 3, task loading in the NI may not match the actual computation phase, because the IP core may wait for additional data before starting. In that case, extra energy could be saved thanks to a tighter control.

3.2 Core Task Synchronization

Better control is obtained if LPM is synchronized with actual IP core computation. In this case, a dedicated signal must be generated by the IP core to indicate its own

On Line Power Optimization of Data Flow Multi-core Architecture 99

Ta s k 1 Loaded 0 t 1 Atomic Ta s k 0 t V high V low t

Fig. 4. Core task synchronization

activity/inactivity. The number of cycles Nh and Nl are still calculated as described in section 4.1. Instead of controlling with NI task activity, the LPM performs the Vdd-Hopping transitions with IP core task activity. An atomic task is defined when the number of cycles and the number of input/output data are known. In order to balance the fre- quency of hops, the LPM is able to perform switching over several atomic tasks or within a single task. In case the actual number of cycles of the atomic task is less than the worst case, it is possible to start the computation at low level [17]. In Figure 4, the NI task consists of five atomic tasks, with only one transition low to high done within each task. The unit gets back to low level as soon as the task is completed; most of the computation is spent at low level.

4 Local On-Line Optimization

The Actual number of Execution Cycles (AEC) needed by a task may be less than the WCEC. The computation time may depend on data, the communication time is vari- able and the architecture can have unpredictable events such as cache defaults, leading to dynamic slack time. The LPM can exploit this dynamic slack time by reducing the speed of the unit. Even though it is possible to predict the number of cycles for next task from the execution history, this approach may not meet the timing constraints. A prediction mistake will induce a timing violation. We rather assume the current task still runs at WCEC and benefit from the dynamic slack time from the previous task. The cycle budgets at high and low levels are updated according to the remaining cycles at high and low levels.

N N l h V high V low t N' N' wait l h compute V high V low k-1 k k+1 t

T T' T

Fig. 5. Local on-line optimization principle 100 P. Vivet et al.

Figure 5 presents the on-line optimization principle. The first chronogram shows the LPM control without online optimization. The second uses the online optimiza- tion. The first task k-1 runs at WCEC while the following tasks do not use as much cycles. In this case, the third task is slowed down while respecting the deadline. When a task k is over and cycles are remaining, respectively nh at high level and nl at low level, the unit switches to low level and keeps on counting the number of elapsed cycles. Hence, when the next task starts, the remaining cycles nh and nl reflect the dynamic slack time. The timeslot for the following task is extended to: n n τ +τ ′ = τ + h + l (3) f h fl The updated number of cycles Nh’ is then given by: ′ f N = h ()N − ()τ +τ ′ × f h f − f wcec l h l (4) f f = N − h n − l n h − l − h f h f l f h f l Thus, before the next task (k+1) begins, we compute its parameters with the extended time:

⎧ ′ f ⎛ f ⎞ ⎪N = N − h ⎜n + l n ⎟ ⎪ h h f − f ⎜ l f h ⎟ ⎨ h l ⎝ h ⎠ (5) ⎪ ′ ′ = − ⎩⎪Nl N wcec N h The above equations provide the main principles of the on-line optimization algo- rithm. In order to implement such control efficiently in the hardware LPM controller, some simplifications are required. The LPM requires mainly two counters to keep track of the number of elapsed cycles in high and low voltage. In order to have simple ()− hardware, the computations of both ratios fh fh fl and fl f h must be either done in software or simplified to be done in hardware. The new budget Nh’ should not be overestimated; otherwise the deadline might be violated. If those ratios are under- = estimated, then the efficiency is reduced. Assuming f h 2 fl , we obtain the following simplified equations for the updated cycle budgets at VHIGH and VLOW with regard to dynamic slack time of previous task: ′ ⎧ = − ()+ ⎪N h N h 2* nl 0,5nh ⎨ ′ ′ (6) = − ⎩⎪N l N wcec N h The LPM controller is then programmed with the two input parameters: the timeslotτ , and the target NWCEC. It implements two counters, and it can be implemented as a sim- ple state machine to control any of the AEC mode, the NI mode or the CORE mode. The LPM controller has been fully modeled in SystemC. From the algorithmic complexity and the number of registers, the LPM is estimated to be less than 2Kgates. The area cost of the PSU including the Vdd-Hopping is 3% of the core area for a 200Kgates IP core. On Line Power Optimization of Data Flow Multi-core Architecture 101

5 Case Study on a 3GPP LTE Telecom Application

The targeted application and circuit [14] is based on the 3GPP LTE telecommunica- tion protocol; we focus on the baseband demodulation of the downstream. Once the application is mapped onto the NoC architecture, the application is divided into sev- eral sequential phases, a whole frame is constituted of 14 OFDM symbols. There are three main phases and each phase is separated by memory buffering. The IP core tasks are periodic and sequenced in a data flow manner (Figure 6).

00 01 02 03 04 NoC Interface

CFO MC 8051 ARM Chan. estim. mc8051_ 12 mep_10

10 11 12 13 14

OFDM MIMO MIMO Turbo- SME SME sme_10w demod. sme_21 decoding decoding decoding trx_ofdm_20 mep_22 mep_23 asip_ 24

NoC Interface 20 21 22 23 24

OFDM CFO Deinterleav. SME demod. Chan. estim. sme_ 22s Demod. trx_ ofdm_ 20s mep_ 21s rx_bit_23s

Fig. 6. Task mapping on the Low Power GALS NoC architecture

The GALS NoC architecture is build with dedicated hardware engines, such as TurboCode, RX/TX bit engines, OFDM modulation/demodulation, MEP engines (advanced configurable VLIW-like core) and finally some SME (Smart Memory Engine) used to handle memory buffers. Each IP core is encapsulated with a PSU, a LPM and a LCG providing 16 frequencies in the [400MHz-1GHz] range, with addi- tional scaling factors.

5.1 Simulation Platform and Applicative Scenarios

The simulation platform used to qualify the energy savings is based on an existing timed SystemC/TLM platform. The power consumption has been included in the simulation platform, along with DVFS modeling [15]. The SystemC model takes into account leakage current, dynamic power, the inactivity phase’s consumption, and the variation of energy per operation due to Vdd-Hopping. For each IP blocks, power consumption values have been extracted from post Place&Route gate-level simulation thanks to PrimePower® tool. As a conclusion, fast power estimation and exploration at high level can be performed on a real application. The tool provides power profile traces (in vcd) and power statistics (per core, per mode, …). For the targeted 3GPP-LTE application, the global constraints (timeslot and NWCEC) have been derived manually for each IP, to enforce throughput and latency constraints. For all proposed LPM scenarios (Table 1), except the first two ones, IDLE mode is used as soon as end of task is reached. All scenarios respect the application timing constraints, except the first one at Low level, which is given as a reference.

102 P. Vivet et al.

Table 1. Power Mode Scenarios

Low LOW mode at maximal achievable flow

High HIGH mode at maximal achievable fhigh

On/Off HIGH mode at fh max, and IDLE when tasks complete DFS HIGH mode using only Dynamic Frequency Scaling DVFS NI DVFS synchronized with NI DVFS Core DVFS synchronized with CORE DVFS AEC DVFS synchronized with CORE, plus on-line optimization using Actual Execution Cycle

5.2 Obtained Energy Savings

For each LPM scenario, power profiling has been done, the achieved energy savings are presented per IP (Figure 7).

3.0

2.5 asip_24 2.0 rx_bit_23s mep_23 mep_22 1.5 mep_21s

Energy (mJ) Energy mep_10 1.0 trx_ofdm_20s trx_ofdm_20

0.5

0.0 Low High On/Off DFS DVFS DVFS DVFS NI Core AEC

Fig. 7. Energy consumption per IP Core

The On/Off scenario exhibits substantial energy savings, thanks to the efficiency of IDLE mode (we recall that IDLE is done with IP clock gating at Vlow). When only using DFS, there is almost no gain since the computation is only spread over time, reducing peak power but not energy. When using DVFS, we observe that energy savings clearly depend on core profile. For under-constrained cores (trx-ofdm cores) with low target frequency, DVFS enables high energy savings compared to On/Off scenario. Synchronization with task loading is relevant as these units does not spend time waiting for data. These units have a steady number of computation cycles, and online optimization is useless. Synchronization of DVFS with core computation will bring benefits when the IP cores wait for a long time incoming data (mep_10, mep_21s). For more constrained cores (mep_22, mep_23) with high target frequency, when they require less cycles than the predicted WCEC to complete their task, local optimization is relevant. For tasks with target frequency close to fh, up to 30% energy savings has been achieved compared to simple core synchronization. We exhibit 45% extra energy savings with DVFS AEC compared to On/Off scenario. On Line Power Optimization of Data Flow Multi-core Architecture 103

5.0

4.5

4.0

3.5 NoC 3.0 total IP 2.5 total SME

2.0 Energy (mJ)

1.5

1.0

0.5

0.0 Low High On/Off DFS DVFS DVFS DVFS NI Core AEC

Fig. 8. Energy savings for NoC, SME and IP Cores

In Figure 8 is given power consumption for the whole SoC, considering HW IPs, SME IPs and the NoC. The NoC power consumption represents only 5% of the total power consumption, and is slightly equivalent for each scenario. The advanced IDLE mode from On/Off scenario brings 35% power reduction on the whole chip. As a global result, the power reductions obtained on the IP Cores (Figure 7) are mitigated due to inefficient power reduction on Smart Memory Engines. Because SMEs do not actually perform computation but must run fast enough to handle data traffic, a power control based on traffic arrival like in [10] should be efficient. Finally, the total chip budget is reduced from 340 mW at full speed (High mode) to 160 mW using the DVFS scheme with on-line optimization.

6 Conclusions

In this paper, we presented a new Local Power Manager unit to reduce energy in a data-flow heterogeneous architecture by using the Vdd-Hopping technique. The Vdd- Hopping is an efficient DVFS technique with only two set points and zero overhead, which can be easily integrated for per-core DVFS. In the proposed LPM, we use a hybrid local and global scheme to enforce timing constraints, a LPM synchronization scheme with core computation to benefits from all inactivity phases, and an on-line optimization technique to distribute dynamic slack time. Energy savings have been qualified on a real application, using a SystemC platform instrumented with power. Results show that advanced idle mode achieves significant energy savings (35%). As expected, DFS achieves few energy savings. DVFS enables to reduce energy by 45% compared to IDLE mode. Finally, when number of cycle per-task varies, 30% addi- tional energy savings are achieved by local on-line optimization. Future work will address the design of an efficient DVFS control for SMEs, the RTL design of the LPM, as well as HW task automatic profiling.

References

1. Bhunia, S., Datta, A., Banerjee, N., Roy, K.: GAARP: A Power-Aware GALS Architec- ture for Real-Time Algorithm-Specific Tasks. IEEE Transactions on Computer, Special Is- sue on low-Power Design (99), 752–766 (June 2005) 104 P. Vivet et al.

2. Sylvain, M., Vivet, P., Renaudin, M.: A Power Supply Selector for Energy- and Area- Efficient Local Dynamic Voltage Scaling. In: Azémard, N., Svensson, L. (eds.) PATMOS 2007. LNCS, vol. 4644, pp. 556–565. Springer, Heidelberg (2007) 3. Truonga, D., et al.: A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling. In: Proc. Symposium on VLSI Circuits (June 2008) 4. Mishra, R., Rastogi, N., Zhu, D., Mosse, D., Melhem, R.: Energy aware scheduling for dis- tributed real-time systems. In: Proc. of Parallel and Distributed Processing Symposium (April 2003) 5. Watanabe, R., Kondo, M., Imai, M., Nakamura, H., Nanya, T.: Task Scheduling under Per- formance Constraints for Reducing the Energy Consumption of the GALS Multi-Processor SoC Design. In: DATE 2007 (2007) 6. Xian, C., Lu, Y., Li, Z.: Energy-Aware Scheduling for Real-Time Multiprocessor Systems with Uncertain Task Execution Time. In: DAC 2007, pp. 664–669 (2007) 7. Grosse, P., Durand, Y., Feautrier, P.: Methods for Power Optimization in SoC-based Data Flow Systems. ACM Transactions On Design Automation of Electronic Systems (TODAES 2009) 14(3), Article No. 38 (2009) 8. Niyogi, K., Marculescu, D.: Speed and voltage selection for GALS systems based on volt- age/frequency islands. In: Proceedings of, ASP-DAC 2005 (2005) 9. Puschini, D., Clermidy, F., Benoit, P., Sassatelli, G., Torres, L.: Temperature-Aware Dis- tributed Run-Time Optimization on MP-SoC Using Game Theory. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2008, pp. 375–380 (2008) 10. Alimonda, A., Acquaviva, A., Carta, S., Pisano, A.: A Control Theoretic Approach to Run- Time Energy Optimization of Pipelined Processing in MPSoCs Design. In: Proceedings of Design Automation and Test in Europe, DATE 2006 (2006) 11. Maxiaguine, A., Chakraborty, S., Thiele, L.: DVS for buffer-constrained architectures with predictable QoS-energy tradeoffs. In: 3rd International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2005, pp. 111–116 (2005) 12. Beigné, E., Clermidy, F., Miermont, S., Vivet, P.: Dynamic Voltage and Frequency Scal- ing Architecture for Units Integration within a GALS NoC. In: Proceedings of NOCS 2008 (2008) 13. Beigné, E., et al.: An Asynchronous Power Aware and Adaptive NoC based Circuit. IEEE Journal Of Solid State Circuits 44, 1167–1177 (2009) 14. Clermidy, F., et al.: A 477mW NoC-Based Digital Baseband for MIMO 4G SDR. In: Pro- ceedings of IEEE International Solid-State Circuits Conference, ISSCC 2010 (2010) 15. Lebreton, H., Vivet, P.: Power Modeling in SystemC at Transaction Level, Application to a DVFS Architecture. In: Proc. of Int. Symposium on VLSI, ISVLSI 2008, pp. 463–466 (2008) 16. Soongsoo, L., Sakurai, T.: Run-time Voltage Hopping for Low-Power Real-time Systems. In: Proc. of 37th Design Automation Conference, DAC 2000, pp. 806–809 (June 2000) 17. Yan, Z., Zhijian, L., Lach, J., Skadron, K., Stan, M.R.: Optimal procrastinating voltage scheduling for hard real-time systems. In: DAC 2005, pp. 905–909 (June 2005) Self-Timed SRAM for Energy Harvesting Systems

Abdullah Baz, Delong Shang, Fei Xia, and Alex Yakovlev

Microelectronic System Design Group, School of EECE, Newcastle University Newcastle upon Tyne, NE1 7RU, England, United Kingdom {Abdullah.baz,delong.shang,fei.xia,alex.yakovlev}@ncl.ac.uk

Abstract. Portable digital systems tend to be not just low power but power effi- cient as they are powered by low batteries or energy harvesters. Energy harvest- ing systems tend to provide nondeterministic, rather than stable, power over time. Existing memory systems use delay elements to cope with the problems under different Vdds. However, this introduces huge penalties on performance, as the delay elements need to follow the worst case timing assumption under the worst environment. In this paper, the latency mismatch between memory cells and the corresponding controller using typical delay elements is investigated and found to be highly variable for different Vdd values. A Speed Independent (SI) SRAM memory is then developed which can help avoid such mismatch problems. It can also be used to replace typical delay lines for use in bundled- data memory banks. A 1Kb SI memory bank is implemented based on this method and analysed in terms of the latency and power consumption.

1 Introduction

With the wide advancement in such remote and mobile fields as wireless sensor based applications, microelectronic system design is becoming more energy conscious. This is mainly because of limited energy supply (scavenged energy or low battery) and excessive heat with associated thermal stress and device wear-out. At the same time, the high density of devices per die and the ability to operate with a high degree of parallelism, coupled with environmental variations, create almost permanent instabil- ity in voltage supply (cf. Vdd droop), making systems highly power variant. In the not so long past low power design was targeted merely at the reduction of capacitance, Vdd and switching activity, whilst maintaining the required system performance. In many current applications, the design objectives are changing to maximizing the per- formance within the dynamic power constrains from energy supply and consumption regimes. Such systems can no longer be simply regarded as low power systems, but rather as power adaptive or power resilient systems. Normally, this kind of system has the following properties: 1) power efficient not just low power; 2) non-deterministic supply voltage (probably with known range, which tends to be low) variable over time. Recently a possible solution is proposed for this kind of system. It is a power elastic system which takes power and energy as dynamic resources [13]. For example, when power is not enough, some of the sub- systems could either be powered off or be executed under lower power supplies (Vdds). When power is enough, systems can provide high performance. This means

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 105–115, 2011. © Springer-Verlag Berlin Heidelberg 2011 106 A. Baz et al. that all tasks in a system are managed based on the power resources, performance requirements, and thermal constraints. When systems are subjected to varying environmental conditions, with voltage and thermal fluctuations, timing tends to be the first issue affected. Most systems are still designed with global clocking and the design is often made overly pessimistic to avoid failures due to Vdd (timing) variations. Along with the advent of the nanometre CMOS technology, the continuation of the scaling process is vital to the future development of the digital industries. The Interna- tional Technology Roadmap for Semiconductors (ITRS) [1] predicts poorer scaling for wires than transistors in future technology nodes. This makes the above worst timing assumption even worse along with power supply voltage drooping [17]. Asynchronous techniques may provide solutions to all these problems. Unlike syn- chronous systems, asynchronous designs can completely remove global clocking. As a result, asynchronous designs may be more tolerant to timing variations. The ITRS also predicts that asynchrony will increase with the complexity of on- chip systems. The power, design effort, and reliability cost of global clocks will also make increased asynchrony more attractive. Increasingly complex asynchronous sys- tems or subsystems will thus become more prevalent in future VLSI systems. In order to fully realize the potential of asynchrony in an environment of variable supply voltage and latencies, system memories may need to be asynchronous together with the computation parts. In this paper, we concentrate on asynchronous SRAM. Our main contributions include: analysing the behaviour of latency in SRAM memory systems under different Vdds, developing asynchronous SRAM memory, and propos- ing a new method to build delay elements for bundled SRAM memory. We develop a fully Speed Independent (SI) [16] SRAM cell and a bundled SRAM bank technology by using such SI SRAM cells as delay elements. The remainder of the paper is organized as follows. Section 2 introduces existing asynchronous SRAM memory structures. Section 3 analyses the effects on the latency of the SRAM memory and its controller of different Vdds. Section 4 gives our asyn- chronous SRAM solutions and implementations, and proposes a new method to build SI delay elements for SRAM memory. Section 5 demonstrates a memory bank and the measurements in terms of latency, power consumption. Section 6 gives the con- clusions and the future work.

2 Existing Asynchronous SRAM Memory

Several asynchronous SRAM methods have been reported [5,6,7,8,9]. In [5] a methodology was mostly developed for designing and verifying low power asynchronous SRAM. An SI SRAM cell was alluded to in [5]. This memory cell is different from the conventional six transistor cell [15] and provides the possibility of checking that the data has been stored in memory. The paper however does not ex- plain how the cell needs to be controlled nor does it include a controller design. [6,7,8,9] focus on asynchronous SRAM memory designs. [6] presents a four-phase handshake asynchronous SRAM design for self-timed systems. It proposes an SI circuit to realize completion detection of reading operations. However, the paper claims that completion detection is not suitable for writing operations. Because the Self-Timed SRAM for Energy Harvesting Systems 107 critical circuit is the memory cell, it is said to be impractical to add a monitoring sen- sor to each memory cell to generate completion detection signals. Instead the paper proposes a delay based solution, which uses several delay lines for different delay regions as variation is considered. [8] presents an asynchronous SRAM with SI implementation in the reading. The writing works under such relative timing assump- tions that the control path takes more transitions than the data path. This is imple- mented with circuits which behave similarly to classical delay elements such as chains of inverters. The other works [7,9] abandon SI altogether and adopt bundled data methods based on delays. Noting that the delay of inverter chains commonly used in conventional SRAM to generate required timings for precharge and data ac- cess phase hardly match all the timing variations of the bit line activities across a wide range of supply voltages [11,12], the authors of [9] used a duplicated column of memory cells to replace inverter chains to serve as delay elements. Although in theory this offers potentially correct delay matching for memory under variable Vdd, so long as process variation [3] is kept under control, the method requires voltage references for precharge and sensing data. The voltage reference is assumed to be adjustable to accommodate the process, voltage, and temperature conditions. In summary, most of existing solutions work under worst case timing assumptions, and some of them also require adjustable and known reference voltages. However, in the energy harvesting environment, there may not be stable reference voltages in a system at all, so anything based on comparators will not work. All voltages in the system may be non-deterministic. All delays may therefore be non-deterministic.

3 Latency Investigation on SRAM Cells under Different Vdds

SRAM memory is constructed from SRAM sells, address decoders, precharge driver, write driver, read driver, and controller. Although there exist different structures of SRAM cells, here we only focus on the simplest 6T [15] cell which offers the best prospect for use in energy harvesting systems. Normally memory works based on timing assumptions. However, energy harvest- ing systems work under a wide range of non-deterministic power. It is necessary to know how timing assumptions are affected under different Vdds. Here we investigate the difference between the latency on SRAM including bit line driver and its corresponding controllers typically implemented in inverter-chain kind of delay elements under different Vdds. This potential mismatch has already been pointed out in papers [11,12]. [11] concludes that the latency on inverter chains are getting worse and worse with reducing Vdd. [12] concludes that the percentage of the bit line drive time of the total access time under reducing Vdds is getting greater sig- nificantly. But do both types of delays increase at the same rate under the same Vdd reduction rate? To emphasize the mismatch, we directly show the difference between the read- ing/writing times of memory and the latency of delay elements under various Vdds in the right hand side of Figure 1. The experiment bundles an SRAM with one cell and an inverter chain, with both operating under the same variable Vdd as shown in the left hand side of Figure 1. A start signal triggers reading/writing operation of the cell. This start signal is also con- nected to the inverter chain as its input signal. We measure the number of inverters

108 A. Baz et al.

startSRAM finish

Fig. 1. Investigation on delay elements in various Vdd: Block diagram (left) and Results (right) the start signal has passed through when the reading/writing operation finishes. In reading, under lowest Vdd the memory is about 3 times slower than under the normal Vdd in terms of the number of inverters. In writing, under lowest Vdd the memory is about 2 times slower than under the normal Vdd in terms of the number of inverters. Interestingly, this mismatch is quite small when Vdd is above 700mV, which coinci- dentally was the lowest voltage investigated in some of the previous work (e.g. [8]). In other words, both reading from and writing to memory become slower at a much higher rate than inverter chains when Vdd is reduced below 700MV, and inverter chain type delays do not track memory operation delays when both are under the same variable Vdd. This demonstrates that using standard inverter chains for memory delay bundling would require precise design-time delay characterization and conser- vative worst-case provisions which could be 2-3 times more wasteful for some cases. Other conventional methods such as schedulable or programmable delay chains will not be useful without knowledge of the Vdd in real time, which we do not assume.

4 Asynchronous SRAM Solutions

The characteristics of the energy harvesting systems lead to non-deterministic Vdd and delays across the entire system. To deal with this it is possible to employ asyn- chrony in the form of memory bundling or completion detection. For bundling, the above discussion has established that normal delay elements built using inverter chains are unsuitable for memory. A natural extension of using dummy SRAM cells as delay elements exists [9], but the method has too many assumptions and requirements such as known and variable reference voltages which may not be possible for energy harvesting systems.

WL

QQb BL BLb WL WE QQb

Db D BL BLb

CDbBL BLb CD (a) (b) (c)

Fig. 2. Intuitive SI SRAM cell (a), write driver (b), and standard 6T cell (c) Self-Timed SRAM for Energy Harvesting Systems 109

In this section, two fully Speed Independent (SI) SRAM solutions are proposed. The SI circuits are not affected by delays on gates but delays on wires are assumed as zero or very little. This is generally not a problem for circuits of small size such as an individual 6T SRAM cell. However, fully SI solutions for memory banks can be ex- pensive in terms of power and size of circuits and also reduce performance [16]. A new method in which an asynchronous SRAM memory is bundled with SI SRAM serving as delay elements is proposed as an alternative.

4.1 Intuitive Speed Independent SRAM

As discussed in [6], reading completion detection can be built by monitoring the bit lines. For a 6T cell (Figure 2 (c)), in reading, the precharge pulls up the two bit lines to high. Then the reading sets the WL high to open the two pass transistors. After that, one bit line will be discharge to low. This means that the data is ready for reading. However, the writing operation is to write each bit of data to its corresponding cell. It is impractical to monitor all cells. Instead, we still monitor the bit lines. Figure 2 (a) shows a straight forward SI SRAM cell which is based on the normal 6T cell. This duplicates the bit lines and uses the six extra transistors to control the two discharge channels. Reading completions can be checked in the same way as for the normal 6T cell. To check writing completions, the writing operation is arranged as: 1) precharging the four bit lines to high; 2) enabling the writing data on BL and BLb; 3) setting the WL high to write the data into cell; 4) monitoring the CD and CDb; 5) when one of them changes to low, writing done. The writing driver used is shown in Figure 2 (b). After the four bit lines are precharged to high, the writing driver is enabled. One of BL and BLb is low and the other is floating. If the new data is the same as the data stored in the cell, for example D=1, CD will be discharged (Qb goes to CD). If the new data and the data stored inside cells are not the same, for example, Q=1 and D=0, BL is low and then waiting for Qb high to discharge CDb. In this situation, BL is low and written to Q. But only after Q is propagated to Qb, the discharging path is opened. CD or CDb being discharged means that the writing is finished. However, this SI SRAM is impractically large and power hungry. It may also cause complicated writ- ing fight.

4.2 More Practical Speed Independent SRAM

In fact, the above proposed SI SRAM introduces a reading at the writing operation with the execution order “precharging, writing, reading”. However, unlike the normal reading operation, it uses the duplicated bit lines as a reading port and to guarantee the writing data being stored into the cell. Especially the solutions have problems as discussed the above. We optimize this completion detection method based on ideas borrowed from [14]. By changing the execution order of the writing operation to “precharging, reading, writing”, the duplicated bit lines in Figure 2 (a) can be removed. The normal 6T SRAM cell in Figure 2 (c) can be used instead with considerable savings, resulting in a new SI SRAM based on the standard 6T SRAM cell and an intelligent controller. SRAM cells depend on control signals. The control signals PreCharge, WL, and WE, are issued based on timing assumptions in existing asynchronous SRAMs. 110 A. Baz et al.

Data Wr Pre Wa Dn WL Memory Dn Rr WE Controller Dn Ra

Fig. 3. Block diagram of the proposed SI RAM

An intelligent controller is designed to manage these control signals based on the new execution order. To completely remove timing assumption, Delay Insensitive (DI) circuits are the best choice. However, DI circuits are limited in practice [2]. Instead, SI circuits suffice here. The block diagram of the controller is shown in Figure 3. Two handshake protocols ((Wr,Wa) and (Rr,Ra)) connect with the processing unit and three protocols ((Pre,Dn), (WL,Dn), and (WE,Dn)) connect with the memory system. The signals (Wr,Wa) are the writing request and acknowledgement. The (Rr,Ra) pair is the reading request and acknowledgement. The (Pre,Dn) handshake is the precharge request and done. “WL” and “WE” are defined in Figure 2. All “Dn” signals are hidden inside the SI controllers.

Reading: Writing: Rr+ Pre− (BL,BLb) Pre+ WL+ Wr+ Pre− (BL,BLb) Pre+ WL+ (BL,BLb) WE+ (1,1) (1,1) (1,0) or (0,1)

Ra− WL− Rr− Ra+ (BL,BLb) Wa− WE− WL− Wr− Wa+ (Q,Qb)=(BL,BLb) (1,0) or (0,1)

Fig. 4. STG specifications

The STG specifications of the reading and writing operation are shown in Figure 4. The bit lines are monitored to form a “Dn” signal. For example, after the precharging is triggered, when (BL,BLb) equals to (1,1), the “Dn” signal is generated. We combine the two STG specifications. The controller shown in Figure 5 is ob- tained from optimizing the Petrify solution of the combine specification. Initially, Wr, Rr, x2, and x3 are 0, 0, 1, 0. Consequently Wa, Ra, PreCharge, WL, WE, x1, x5, and x6 are 0, 0, 1, 0, 0, 0, 1, 0. The x4 is in a “don’t care” value initially. We use the writing operation as an example to show how the controller works. Af- ter the address and data are ready, the Wr signal is issued. Wr goes through gate 7 and then through to gate 10. As x2 is 1, so x1 is 1 and then it makes PreCharge 0. The low PreCharge signal opens the P-type transistors in precharge drivers. The PreCharge also goes to the SR latch formed by gates 6 and 8 to reset the latch when PreCharge is low. After the bit lines are 1 and the SR latch is reset, x1 is changed to 0. And then PreCharge is removed. After PreCharge is removed, WL is generated, which opens the pass transistors in the 6T cell. And then the data stored in the cell is read to the bit lines. This makes x4 equal to 1. As the SR latch has been reset, x6 will be 1. And then WE is 1, which opens the write driver. If the new data is the same as the data stored in the cell, either (D,BL)=(1,1) or (Db,BLb)=(1,1), Wa is generated to notify the data processing unit that the data has been written into the cell. If, for example, new data is 1 and the stored data is 0, after the write driver is opened, BLb is low and then Qb is Self-Timed SRAM for Energy Harvesting Systems 111 discharged to 0, Q is charged to 1. That 1 will transfer to BL. after that writing is finished. After Wa is generated, Wr is removed and then only after the controller is returned to the initial states, Wa is withdrawn to wait for new Reading/Writing opera- tions. Here data is assumed to be withdrawn only after Wa is removed. Clearly there is no need for duplicated bit lines in the memory cell in this method.

Wr D BL x4 3 BL 12 Wa BBL x5 x6 45 WE 6 DB x3 0 BBL

Wr x2 Rr 8 1 7 Ra Rr 9 11 x1 12 10

PreCharge 13 WL

Fig. 5. Possible implementation of the controller

Fig. 6. Waveforms under variable Vdd

As for memory banks, gate 1 is duplicated. The number of the duplicated gates equals to the bits of the memory word. The inputs of each gate are a pair of bit lines corresponding to each bit of the memory word. All outputs of the duplicated gates are 112 A. Baz et al. collected in a C element. The output of the C element is used to replace x4. Gate 5 is also duplicated. All outputs of the duplicated gates are collected in a C element and the output of the C element is the new Wa signal. Here an SI SRAM cell is investigated under variable Vdd. In this experiment, we use a sinusoidal Vdd starting at a low level as an example. The lowest Vdd level is 300mV and the highest is 1V and the sinusoid’s frequency is 700KHz. Figure 6 shows the obtained waveforms. This experiment consists of a writing 0 operation followed by a reading operation and then a writing 1 operation followed by a reading operation. As Vdd is variable, each operation takes a different amount of time. For example, the first writing works under lower Vdd. Precharging, writing data and then generating the Wa (WAck) signal took a long time. The second writing works under the highest Vdd, it goes very fast and generates the WAck signal very fast as well. This experiment also demonstrates that the SI SRAM structure works under continuously variable Vdd as expected.

4.3 A Possible Bundled SRAM Based on SI Delay Elements

However, a fully SI solution for large memory banks has penalties on performance, areas and power because this requires a large completion detection overhead. Here a new bundled method is proposed to overcome the problems. We can choose a worst column in a memory bank, usually the far end column [18], and fill it with SI SRAM cells for completion monitoring. This means that gate 1 and gate 5 are connected with the bit lines of this column in the SI controller. The memory cells of the other columns use the same control signals generated from the controller but do not provide feedback information. This means that the far end column is used as a delay element and the other columns are bundled with it. Compared to the existing method which duplicates a column SRAM cell, the new reference method does not employ duplicated cells and referent voltages. And the delay elements, being SI SRAM cells based on the same kind of cells used elsewhere in the bank, should provide correct delay tracking over a wide Vdd range. However, to actually employ such a bundling method, such issues as the depend- ency of delay on data values stored and written need to be investigated in the future.

5 1Kb Memory Bank Design and Measurements

Using the proposed circuit, 1k-bit (64x16) SI SRAM is implemented using the Ca- dence toolkit with the UMC 90nm CMOS technology. The design is verified with analogue simulations with SPECTRE provided in the toolkit. The chip is fully func- tional from as low as 190mV up to 1V. The SRAM chip was simulated by writing 16- bits to the chip, then reading them and latching the data into SI latches. Meanwhile the energy consumption and the worst case latency under different Vdds from 190mV to 1V are measured. Figure 7 shows the energy consumption of the chip during reading and writing when the data is 1 and 0. The four curves show that the minimum energy point of the chip is at 400mV-500mV. The SRAM consumes 5.8pJ in 1V when writing a 16-bit word to the SRAM memory and 1.9pJ in 400mV. Self-Timed SRAM for Energy Harvesting Systems 113

Fig. 7. Energy consumption of SRAM

Figure 8 shows the access time of the SRAM. The access time is the latency from the reading/writing request to the done signal. For example, under 1V, the worst ac- cess time for writing and reading are 5.4ns and 3.0ns. And under 190mV, they are 1.6μs and 4.0μs respectively.

Fig. 8. Access time of SRAM

6 Conclusions and Future Work

In this paper, we focus on SRAM memory design for energy harvesting systems. Normally, this kind of system works under a variable power supply with high power efficiency and not just low power. Under such a non-deterministic power supply as- sumption, existing asynchronous SRAMs based on bundled delays have huge penal- ties or are impractical because of a need for voltage references. The latency mismatch between SRAM memory and its controller under different Vdds is investigated. As Vdd goes down, mismatch grows if traditional delays are used. Under 190mV, the mismatch is more than twice greater than under the normal 1V Vdd in the UMC90nm technology. An SI SRAM is proposed and designed. The SRAM has a simple interface, which is similar to the normal SRAM including data, address, reading request, reading ac- knowledgement, writing request, and writing acknowledgement. The internal signals for memory control are fully triggered by the corresponding events of the memory systems. This works by monitoring the bit lines of memory. 114 A. Baz et al.

A new method is proposed to implement SI writing based on the ideas from [14]. This solves the problem of completion detection for writing operations, previously considered impractical or impossible. A 1Kb (64X16) SI SRAM is implemented using Cadence toolkits. The simulation results show the SRAM working as expected from 190mV to 1V. Meanwhile, the energy consumption and the worst case performance are measured. The measure- ments show the SRAM cell has acceptable characteristics. However, the completion detection logic in SI SRAM is expensive in terms of area, performance, and power. A simplified SRAM is therefore possible based on the bundled delay principle. Unlike the existing asynchronous SRAM solutions, a column (the worst column, if it can be identified, or a dedicated column) of SI SRAM cells acts as a delay element. This column should be slower anyway than the other columns because of its completion detection overhead. The other columns of the memory cells are bundled with this column. This bundled SI SRAM method requires more investigations, e.g. the effect of data values. In addition, we have only investigated basic asynchronous SRAM design. Other issues, such as static noise margin, readability, stability, failure rates, etc. need further study. These are the targets of our future research. We will also investigate multi-port asynchronous SRAM in the context of variable and nondeterministic Vdd.

Acknowledgement

This work is supported by the EPSRC project Holistic (EP/G066728/1) at Newcastle University. During the work, we get very helpful discussions from our colleagues, Dr Alex Bystrov and other members of the MSD research group. The authors would like to express our thanks to them.

References

[1] International Technology Roadmap for Semiconductors, http://public.itrs.net/ [2] Martin, A.J.: The limitations to delay-insensitivity in asynchronous circuits. In: Dally, W.J. (ed.) Advanced Research in VLSI, pp. 263–278. MIT press, Cambridge (1990) [3] Sylvester, D., Agarwal, K., Shah, S.: Variability in nanometer CMOS: Impact, analysis, and minimization. Integration the VLSI journal (41), 319–339 (2008) [4] Saito, H., Kondratyev, A., Cortadella, J., Lavagno, L., Yakovlev, A.: What is the cost of delay insensitivity? In: Proc. ICCAD 1999, San Jose, CA, pp. 316–323 (November 1999) [5] Nielsen, L.S., Staunstrup, J.: Design and verification of a self-timed RAM. In: Proc. of the IFIP International Conference on VLSI 1995 (1995) [6] Sit, V.W.-Y., et al.: A four phase handshaking asynchronous static RAM design for self- timed systems. IEEE Journal of solid-state circuits 34(1), 90–96 (1999) [7] Soon-Hwei, T., et al.: A 160Mhz 45mw asynchronous dual-port 1Mb CMOS SRAM. In: Proc. of IEEE Conference on Electron Devices and Solid-State Circuits (2005) [8] Dama, J., Lines, A.: GHz asynchronous SRAM in 65nm. In: Proc. of 15th IEEE Sympo- sium on Asynchronous Circuits and Systems (2009) Self-Timed SRAM for Energy Harvesting Systems 115

[9] Chang, M.F., Yang, S.M., Chen, K.T.: Wide Vdd embedded asynchronous SRAM with dual-mode self-timed technique for dynamic voltage systems. IEEE Trans. on Circuits and Systems I 56(8), 1657–1667 (2009) [10] Wang, A., Chandrakasan, A.: A 180mv subthreshold FFT processor using a minimum energy design methodology. IEEE Journal of Solid-State Circuits 40(1), 310–319 (2005) [11] Sekiyama, A., et al.: A 1-V operating 256 Kb full CMOS SRAM. IEEE Journal of Solid- State Circuits 27(5), 776–782 (1992) [12] Amrutur, B.S., Horowitz, A.: A Replica technique for wordline and sense control in low power SRAM’s. IEEE Journal of Solid-State Circuits 33(8), 1208–1219 (1998) [13] Mokhov, A., et al.: Power elastic systems: Discrete event control, concurrency reduction and hardware implementation, Tech. Report NCL-EECE-MSD-TR-2009-151, School of EECE, New-castle University [14] Varshavsky, V., et al.: CMOS-based SRAM Cell”, USSR Patent Application 4049181/24/52011 (favourable decision made 10.10.86) [15] Zhai, B., et al.: A Sub-200mV 6T SRAM in 0.13um CMOS. In: Proc. of ISSCC (2007) [16] Sparsø, J., Furber, S.: Principles of asynchronous circuit design: a system perspective. Kluwer Academic Publishers, Boston (2001) [17] Reddi, V., Gupta, M., Holloway, G., et al.: Voltage emergency prediction: a signature- based approach to reducing voltage emergencies. In: Proc. of International Symposium on High-Performance Computer Architecture, HPCA-15 (2009) [18] Amelifard, B., Fallah, F.D., Pedram, M.: Leakage minimization of SRAM cells in a dual- Vt and dual Tox technology. IEEE Trans. on VLSI 16(7), 851–860 (2008) L1 Data Cache Power Reduction Using a Forwarding Predictor

P. Carazo1, R. Apolloni2,F.Castro3,D.Chaver3,L.Pinuel3, and F. Tirado3

1 Universidad Politecnica de Madrid, Spain 2 Universidad Nacional de San Luis, Argentina 3 Universidad Complutense de Madrid, Spain

Abstract. In most modern processor designs the L1 data cache has become a major consumer of power due to its increasing size and high frequency access rate. In order to reduce this power consumption, we pro- pose in this paper a straightforward filtering technique. The mechanism is based on a highly accurate forwarding predictor that determines if a load instruction will take its corresponding data via forwarding from the load-store structure –thus avoiding the data cache access– or it should catch it from the data cache. Our simulation results show that 36% data cache power savings can be achieved on average, with a negligible per- formance penalty of 0.1%.

1 Introduction

Power dissipation in an out of order microprocessor is spread across different structures including caches, register files, the branch predictor, etc. Specifically, on-chip caches consume a significant part of the overall power by themselves. In this paper we intend to reduce the L1 data cache (DL1) power consump- tion in an out of order processor. It can be argued that this research problem is not a major concern now due to the trend towards multi-core architectures made by the industry, in which in some cases the pipelines employed are simpler. However homogeneous multi-manycore architectures with in-order pipelines will only provide substantial benefits for scalable applications/workloads, and some researchers have recently highlighted that future designs will benefit from asym- metric architectures that combine simple and power-efficient cores with a few complex and power-hungry cores [1]. The local inefficiencies of a complex core can translate into global performance/per-watt improvements since a complex core could accelerate the serial phases of applications when the power-efficient cores are idle. This way, a single chip will be able to provide good scalability for parallel applications as well as ensure high serial performance. In summary, as promoted in [2], researchers should still investigate methods of improving sequential performance despite we have entered into the multicore era. Further- more if several out-of-order cores are employed –either in an asymetric or an homogeneous multi-core design– our technique can be applied to each private DL1 cache, leading to a higher benefit.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 116–125, 2011. c Springer-Verlag Berlin Heidelberg 2011 L1 Data Cache Power Reduction Using a Forwarding Predictor 117

The mechanism that we propose in this paper for reducing the DL1 power consumption is based on an effcient usage of the LSQ (load-store queue), a struc- ture responsible of keeping all in flight memory instructions and detecting and enforcing memory dependences in an out of order processor. One of the main LSQ tasks is to supply the correct data to load instructions via a forwarding process –store to load forwarding– ruling out the cache data and therefore turn- ing the cache access unnecessary. Taking advantage of Nicolaescu’s CLSQ [3], in which the number of loads that receive their data from a previous store in- creases a lot, and using an accurate forwarding predictor, that suggests if a load instruction is likely to receive its data through forwarding, we manage to reduce significantly the amount of accesses to data cache in an x86 architecture. The small misprediction rate obtained translates into an IPC that remains largely unchanged. The rest of the paper is organized as follows. Section 2 recaps related work. Section 3 reviews the conventional implementation and brings in our new mech- anism. Section 4 details our experimental environment, while Section 5 outlines experimental results and analyses. Finally, Section 6 concludes.

2 Background

Many techniques for reducing the cache energy consumption have been explored recently. Next, we recap some of the more outstanding ones. One alternative is to partition caches into several smaller caches [4] with the corresponding reduction in both access time and power cost per access. Another design, known as filter cache [5], trades performance for power consumption by filtering cache references through an unusually small L1 cache. An L2 cache, which is similar in size and structure to a typical L1 cache, is placed after the filter cache to minimize the performance loss. A different alternative, named selective cache ways [6], provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, whereas the full cache will be operational for more cache-intensive periods. Another different approach takes advantage of the special behavior in memory references: we can replace the conventional unified data cache with multiple specialized caches. Each one handles different kinds of memory references according to their particular locality characteristics [7]. These alternatives make it possible to improve in terms of performance or power efficiency. Finally, Jin et.al [8] obtain power savings in L1 cache by exploiting loads spatial locality. In their technique, loads always bring a macro data from the processor cache, allowing additional opportunities for load to load forwarding. Nicolaescu et.al [3] propose to avoid the data cache access for those loads that receive their data through forwarding. To increase them, they modify the LSQ design to retain load and store instructions after their commit. Thereby, a later load increases its chances of receiving its data from a previous instruction, either an in-flight store, a commited store, or a commited load. The mechanism –named cached load store queue, CLSQ– is based on the low observed rates 118 P. Carazo et al. of LSQ occupancy for some program phases, that make it possible to earmark unoccupied entries to already commited load or store instructions. Our work extends and improves this job. Finally, as we are using a forwarding predictor in our design, we should men- tion that there are many proposals relying on memory dependence prediction, that propose techniques to know in advance which pairs of store-load instruc- tions will depend and take appropriate actions [9] [10]. However, they all are overprovisioned for the goal of our job.

3 Filtering DL1 Accesses Using a Forwarding Predictor

3.1 Rationale In most conventional microprocessors each load instruction consults the first level data cache (DL1) in order to move the required data into an available register. In parallel, the Store-Queue (SQ) is searched looking for a previous matching in-flight store. If it is found, the store forwards the corresponding data. Other- wise, the data is provided by the cache (Figure 1, Original Architecture). The technique that we propose in this paper is based on the observation that if a load gets its data directly from an earlier store, the data cache access becomes completely unnecessary, and hence we could avoid it saving some power. Obvi- ously, this is only useful if the percentage of loads that get the data from the SQ is high enough. In a RISC processor, the amount of architectural registers is commonly set to 32 and a register-register architecture is generally implemented. With such configuration, the number of store to load forwardings is relatively small (for example, in [11], less than 15% on average), and maybe the benefits of trying to avoid the DL1 access in such reduced occasions could turn meaningless. However, in a register-memory architecture with only 16 architectural registers –as in the case of x86-64, the architecture employed in this job– the number of store to load forwardings is higher as a result of the extra operations due to register spilling. In a complementary way, we can use Nicolaescu’s CLSQ from [3], which signif- icantly increases the number of loads that receive their data via forwarding, both due to store-load forwarding from the Cached-SQ and to load-load forwarding from the Cached-LQ. In summary, on a x86-64 architecture using Nicolaescu’s Cached-LSQ, the number of forwardings can be relatively high – up to 40% of the loads –, which makes our initial intuition appealling. However, in order to be able to filter out these accesses, we need to either serialize the LSQ and DL1 cache searches, or know in advance –i.e. make a prediction– whether the load will receive the data via forwarding or not. This is a key issue that has to be addressed.

3.2 Overall Structure As we have just mentioned, an obvious implementation would be to serialize the accesses (as Nicolaescu in [3]): the load first scans the SQ, and then –only when L1 Data Cache Power Reduction Using a Forwarding Predictor 119 neccessary– the cache is accessed (Figure 1, Nicolaescu’s Proposal). However, this design is not effcient: when a previous matching store is not found the delay incurred in accessing to the data cache will result in a significant slowdown. In this paper we will turn up with a much more convenient approach. The design that we propose (Figure 1, Proposed Architecture) is based on a forwarding predictor: for each load, we predict whether it will receive its data through forwarding. For convenience of discussion, we loosely refer to these loads as predicted-dependent loads and the remainder predicted-independent loads. For predicted-dependent loads, only the SQ and the cached-LQ are accessed, omit- ting the DL1 access (of course, at the risk of being wrong, in which case the cache access is launched with a delay of 1 cycle). For the remaining, both the SQ, the cached-LQ and the DL1 are accessed in parallel (note that in this case, if the predictor is wrong, the data cache access is unnecessary). A predictor with high accuracy provides significant power savings at the cost of a tiny performance degradation. This idea has been explored in similar, yet different contexts [12]. There is a whole lot of research in the field of memory dependence prediction (Section 2). However, they all employ sophisticated predictor structures, which are excessive for our goal of predicting in advance if a load will receive its data through forwarding. For this reason, we have not considered them in our job. Instead, we have evaluated two kinds of simple predictors: Bloom Filter based [13] and Branch Predictor based [14].

Bloom Filter based predictor. In this first kind of predictors, we imple- ment a low-overhead hash table of counters: At issue time, every load and store hash their memory addresses to a single entry and increment the correspond- ing counter. Then, at commit, the entry is decremented. Besides, at issue time, loads read the counter before it was incremented to perform the prediction. If it is bigger than zero, there is a likely (but not certain) address match with another memory instruction, and the load is predicted to receive its data via a forwarding. On the other hand, if the counter is zero, the load is predicted-independent1.

Branch Predictor based. The second kind of predictors is based on the well- known bimodal branch predictor. Similarly to branch instructions, the majority of loads are usually strongly biased, so such a predictor works well. An advantage of this Bimodal Predictor versus the Bloom Filter based is that the prediction can be performed as soon as the load instruction is decoded, based on its PC. On the contrary, a Bloom Filter is consulted with the memory address of the load, that needs to be calculated first, so the prediction is delayed to issue phase in this case.

Combined Predictor. Finally, we should mention that we have also considered in our evaluation a combined predictor, merging a Bloom Filter with a Bimodal 1 As explained in [15], the SQ and LQ accesses could be avoided in this case. However, since a DL1 cache access is much more power consuming than an LQ-SQ access, in this paper we do not consider such LQ or SQ filtering capability, that would require a deeper study. 120 P. Carazo et al.

OriginalArchitecture (with DL1 Nicolaescu´s CachedͲLSQ)

Associative Ͳ Previous matching store Æ FWD Search Ͳ Otherwise Æ DATAfrom DL1 loadinstruction Cached ST Queue InFlight ST Queue Cached LD Queue

Nicolaescu´s proposal for Associative Search DL1 DATA saving DL1energy Cached ST Noprevious Queue matching ld/st (a) 1cycle delay loadinstruction InFli g ht ST Queue Cached LD Previous Queue matching ld/st FWD(DL1filtered) (b) 1DL1access saved

Associative ShSearch DL1 DATA Proposed Architecture Cached ST Noprevious Queue matching ld/st InFlight ST (a) 1cycle delay Queue Cached LD Previous PredictedͲ Queue matching ld/st Dependent FWD(DL1filtered) Forwarding loadinstruction Associative (b) 1DL1access saved predictor Search Cached ST PredictedͲ Queue Independent InFli g ht ST Queue Ͳ Previous matching loador store Æ FWD Cached LD Ͳ Otherwise Æ DATAfrom DL1 Queue (c) Same energy anddelay DL1

Fig. 1. Original Architecture (with the Cached-LSQ), Nicolaescu’s Architecture, and our Proposed Architecture predictor. For extracting the final decision, we predict that a load will receive its data through forwarding only when both structures predict the load to be dependent. Such a structure benefits from both the past forwarding information of loads and memory address information, giving the best results as we will show in the Evaluation Section.

3.3 Supporting Coherence and Consistency

The LSQ from the baseline architecture receives the invalidation requests from remote processors, so coherence and consistency functionalities can easily be supported in our technnique. However, we should highlight a conflict situation L1 Data Cache Power Reduction Using a Forwarding Predictor 121 that turns up in our design when implemented in a system with a MESI coherence protocol: If a data is replaced from the DL1 but remains in the Cached-LSQ, the Shared Line will not be activated due to a remote read request, potentially putting the remote data in an erroneous Exclusive State (instead of a Shared State). A possible solution is to force the LSQ to activate the Shared line for every remote read to a load whose data was received via forwarding. As a future work we intend to improve this management since –although straightforward– it is relatively inefficient.

4 Experimental Framework

We have evaluated our proposed design using the PTLsim [16], a performance- oriented simulation tool. The microarchitecture models the default PTLsim con- figuration that results from the merging of different features of an Intel Pentium 4 [17], an AMD K8 and an Intel Core 2 [18]. Some of the main simulation pa- rameters are listed in Table 1.

Table 1. Simulation parameters for default PTLSim configuration

Branch predictor Combined (Bim-2bits + Gshare), 2K BTAC Instruction Fetch queue size 32 ROB size 128 LSQ size 80 (LQ: 48, SQ: 32) LSAP size 16 Physical Registers 256 Fuctional Units (INT) 8: 4 ALU (2 INT, 2 FP), 2 Load, 2 Store Fetch/Decode/Issue/Commit width 4/4/4 L1 Instruction Cache 32KB (4 way, 64B line) L1 Data Cache 16KB (4 way, 64B line, 2 cycles latency) L2 Data Cache 256KB (16 way, 64B line, 6 cycles latency) L3 Data Cache 4MB (32 way, 64B line, 16 cycles latency) Main memory latency 140 cycles

The evaluation of our proposal has been performed using 24 benchmarks from the SPEC CPU2006 suite, compiled for the x86 instruction set. The technology parameters correspond to 45 nm, with a 1.0V Vdd. We simulate regions of 100M instructions after reaching a triggering point [19], that marks the beginning of code area in which the application behavior is representative of the overall execution. To evaluate the impact of our data cache filtering over the power consumption of the DL1, we use CACTI 5.3 [20] to model the cache of Table 1. Specifically, in order to estimate the cache power consumption, we have multiplied the number of reads and writes to DL1 by the power consumption of each kind of access to this cache. Furthermore, the simulator has been modified to incorporate our pre- dictors in the microarchitectural simulation, although their power consumption is considered negligible compared with the power savings obtained in the data cache. In the following, we perform some quantitative analysis to further understand the effectiveness of the proposed design. 122 P. Carazo et al.

5 Evaluation

5.1 Main Results

In this section we compare the data cache power and whole system performance using either the baseline or our alternative. Figure 2 shows the power savings achieved in the data cache in our technique with respect to the Original Archi- tecture. Figure 3 illustrates the performance impact of our proposal with respect to the Original Architecture. In these experiments we always employ the com- bined predictor, since it reports the highest accuracy values as we will report in next subsection. We can extract the following conclusions. First, by including our proposed scheme, a significant fraction of loads are cor- rectly predicted-dependent, and therefore the corresponding data cache accesses avoided. This leads to a significant fraction of the DL1 dynamic power con- sumption eliminated, as Figure 2 shows. On average, for a Bloom Filter with 64 entries and a Bimodal Predictor of 256, the DL1 power savings of our approach are around 36%. Second, and more important, in our architecture average performance remains almost untouched (around 0.1% of slowdown), something that would not happen

100 BF=64+Bimodal=256 90 BF=64+Bimodal=512 80 BF=64+Bimodal=1024 (%)  70 BF=64+Bimodal=2048 60 Savings

 50 40

Power 30  20 DL1 10 0

Fig. 2. DL1 Power Savings

0,6 OurProposal(BF=64+Bimodal=256) 0,5 OurProposal(BF=64+Bimodal=2048)

0,4

0,3

0,2

Slowdown (%) 0,1

0,0

Fig. 3. Performance Impact L1 Data Cache Power Reduction Using a Forwarding Predictor 123 with Nicolaescu’s Proposal. The reason is that in his case, when a load finds no previous dependent stores in the LSQ (i.e. has no forwarding) incurs a delay of 1 cycle accessing the DL1, while in our case the forwarding predictor avoids this to happen by predicting most of these loads as independent.

5.2 Forwarding Predictors In order to compare the accuracy of the forwarding predictors evaluated –Bloom Filter, Bimodal (with 1 and 2 bits per entry), and Bimodal (2 bits) plus Bloom Filter– we follow Grunwald et.al and employ the following metrics used in con- fidence estimation for speculation control [21]:

– Predictive Value of a Positive test (PVP). It identifies the probability that the prediction of a load as dependent is correct. It is computed as the ratio between the number of correctly dependent-predicted loads and the total number of loads predicted as dependent. – Predictive Value of a Negative test (PVN ). It identifies the probability that the prediction of a load as independent is incorrect. It is computed as the ratio between the number of mispredicted independent loads and the total number of loads predicted as independent.

In our case, using predictors with a high PVP avoids degrading performance. On the other hand, if many loads are incorrectly independent-predicted (high PVN), many cache accesses are carried out unnecessarily, resulting in missed opportunities to reduce the DL1 power consumption. Therefore, in our design, only very high PVP values and very low PVN values are acceptable. In Figure 4, we visually present the measurements of PVP and PVN for different sizes of all studied predictors. Intuitively, as we increase the size of

15,00 Bimodal_2Ͳbits+BloomͲFilter Bimodal_2Ͳbit 256 512 BloomͲFilter 256 10,00 1024 256+64 Bimodal_1Ͳbit 512 2048 512+64 1024 1024+64

(%) 2048

 2048+64 5,00 PVN

BEST 64 128 256 0,00 50 60 70 80 90 100 PVP(%)

Fig. 4. PVP and PVP values for studied predictors. The results shown are the average values for all applications. For Bimodal Predictors (1 and 2 bits) the data points reflects sizes of 256, 512, 1K and 2K. For Bloom Filter we show results for 64, 128 and 256 entries. Finally, the combined predictor uses a 64-entry Bloom Filter and a Bimodal Predictor (2 bits) with 256, 512, 1K and 2K entries. 124 P. Carazo et al. any predictor, PVP value augments and PVN decreases, leading to a better predictor behavior. Note that PVN for Bloom Filter is always zero, since no false negatives exist –when a load is independent-predicted, the predictor is never mistaken–. From this figure we can conclude –according to the intuition– that combining the past forwarding information (Bimodal predictor) and memory addresses (Bloom Filter) results in the most accurate predictor (around up to 95% of hits for predicted-dependent loads and only around 6% of misses for predicted-independent loads).

6 Conclusions

The main contributions of this paper are:

– We implement and evaluate Nicolaescu’s CLSQ [3] in a different and more common microarchitectural model -the widespread x86-64-. – We propose to include a forwarding predictor to know in advance whether a load will receive its data through forwarding, in which case the DL1 access can be avoided. – We study the effectiveness of different predictors, choosing the optimal one based on a tradeoff between accuracy and HW needs.

Overall, the proposed filtering mechanism translates into DL1 power savings up to 36% on average for the studied predictor configuration (BF of 64 entries and Bimodal of 256 entries). Including this scheme leaves performance almost unvaried –less than 0.1% slowdown on average– with a minimal hardware cost of less than 100B.

References

1. Bower, F., Sorin, D., Cox, L.: The impact of dynamically heterogeneous multicore processors on thread scheduling. IEEE Micro 28(3), 17–25 (2008) 2. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Computer 41(7), 33–38 (2008) 3. Nicolaescu, D., Veidenbaum, A., Nicolau, A.: Reducing Data Cache Energy Con- sumption via Cached Load/Store Queue. In: ISLPED 2003, pp. 252–257 (2003) 4. Racunas, P., Patt, Y.N.: Partitioned First-Level Cache Design for Clustered Mi- croarchitectures. In: ICS 2003, pp. 22–31 (2003) 5. Kin, J., Gupta, M., Mangione-Smith, W.: The Filter Cache: An Energy Efficient Memory Structure. In: MICRO 1997, pp. 184–193 (1997) 6. Albonesi, D.: Selective Cache Ways: On-Demand Cache Resource Allocation. Jour- nal of Instruction-Level Parallelism 2 (2000) 7. Lee, H., Smelyanskiy, M., Newburn, C., Tyson, G.: Stack Value File: Custom Mi- croarchitecture for the Stack. In: HPCA 2001, pp. 5–14 (2001) 8. Jin, L., Cho, S.: Reducing Cache Traffic and Energy with Macro Data Load. In: ISLPED 2006, pp. 147–150 (2006) 9. Subramaniam, S., Loh, G.: Store Vectors for Scalable Memory Dependence Pre- diction and Scheduling. In: HPCA 2006, pp. 65–76 (2006) L1 Data Cache Power Reduction Using a Forwarding Predictor 125

10. Park, I., Ooi, C., Vijaykumar, T.: Reducing Design Complexity of the Load/Store Queue. In: MICRO 2003, pp. 411–422 (2003) 11. Castro, F., Chaver, D., Pinuel, L., Prieto, M., Huang, M., Tirado, F.: A Load- Store Queue Design based on Predictive State Filtering. Journal of Low Power Electronics 2(1), 27–36 (2006) 12. Sha, T., Martin, M., Roth, A.: Scalable Store-Load Forwarding via Store Queue Index Prediction. In: MICRO 2005, pp. 159–170 (2005) 13. Bloom, B.: Space/Time Trade-offs in Hash Coding with Allowable Errors. Com- munic. of the ACM 13(7), 422–426 (1970) 14. McFarling, S.: Combining Branch Predictors. Technical report tn-36, Western Re- search Laboratory, Digital Equipment Corporation (June 1993) 15. Sethumadhavan, S., Desikan, R., Burger, D., Moore, C., Keckler, S.: Scalable Hard- ware Memory Disambiguation for High ILP Procs. In: MICRO 2003, pp. 399–410 (2003) 16. Yourst, M.T.: PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator. In: ISPASS 2007, pp. 23–34 (2007) 17. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., Roussel, P.: The Microarchitecture of the Pentium 4 Proc. Intel Technology Journal (Q1 2001) 18. Copenhagen Univ. College of Eng.: The Microarch. of Intel and AMD CPU’s: an Optimization Guide for Assembly Programmers and Compiler Makers (2009) 19. A hybrid timing-address oriented LSQ filtering for an x86 arch. Technical report 20. http://www.hpl.hp.com/research/cacti/ 21. Grunwald, D., Klauser, A., Manne, S., Pleszkun, A.: Confidence Estimation for Speculation Control. In: ISCA 1998, pp. 122–131 (1998) Statistical Leakage Power Optimization of Asynchronous Circuits Considering Process Variations

Mohsen Raji, Alireza Tajary, Behnam Ghavami, Hossein Pedram, and Hamid R. Zarandi

Department of Computer Engineering and Information Technology, Amirkabir University of Technology (Tehran Polytechnic), Tehran, I. R. Iran {raji,tajary,ghavamib,pedram,h_zarandi}@aut.ac.ir

Abstract. Increasing levels of process variability in deep sub micron era has become a critical concern for performance and power constraint designs. This paper introduces a framework for the statistical leakage power minimization of template-based asynchronous circuits considering process variation. We pro- pose a statistical Dual-Vt assignment of asynchronous circuits that considers both the variability in performance and leakage power consumption of a circuit. The utilized circuit model is an extended Timed Petri-Net named Variant- Timed Petri-Net which captures the dynamic behavior of the circuit with statis- tical delay and leakage power values. We applied a genetic algorithm that uses a 2-dimensional graph to calculate the fitness to each threshold voltage assign- ment. Experimental results show that using this statistically aware optimization, leakage power can be reduced by 40.5% and 54.4% for the mean and the va- riance values.

1 Introduction

In asynchronous circuits, local signalling eliminates the need for global synchroniza- tion which exploits some potential advantages in comparison with synchronous ones [1] [2] [3] [4] [5]. Asynchronous design allows reducing dynamic power consump- tion because activity is controlled by request, not upon clock edge. On the other hand, the request receiver and acknowledgment emission capacities have a cost in the num- ber of transistors. However, in deep sub-micron technologies the leakage current is becoming more and more significant [6]. There are many techniques to design of dual threshold voltage (dual-Vth in se- quence) synchronous circuits. However, dual-Vth cannot be applied directly to asyn- chronous circuits in the same way that it can be done for synchronous circuits. It is due to the fact that it is difficult to define or to identify a critical path in asynchronous circuits, where it starts, where it stops, at least with CAD tool that have been designed for synchronous circuits. In [7], a method to synthesize a dual-Vth asynchronous de- sign has been proposed. As process geometries continue to shrink, the ability to control critical device para- meters is becoming increasingly difficult and significant variations in device length, doping concentrations, and oxide thicknesses have resulted. This issue is called process

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 126–136, 2011. © Springer-Verlag Berlin Heidelberg 2011 Statistical Leakage Power Optimization 127 variation. In deep submicron technologies, the variability of circuit features, such as delay or leakage power, due to process variations has become a significant concern. The tremendous impact of variability was demonstrated recently in [11], showing 20X variation in leakage power for a 1.3X variation in delay between fast and slow dies. Wide spread in the leakage power distribution has emerged as an important cause of yield loss due to bound on static power dissipation [12]. Statistical analysis is a practi- cal approach in circuit design to tolerate process variation [21] [27][28] [27] . There is a lot of works which applied statistical analysis in synchronous circuits to mitigate the impact of variation [27] [28] . However, a statistical performance analysis of asynchronous circuits has been proposed in [23]. To the best of our knowledge, there is not any proposed method that considers the process variation in power con- sumption analysis of asynchronous circuits. In this paper, we present a process varia- tion-aware leakage power optimization framework for asynchronous circuits. The remainder of the paper is organized as followed: section 2 provides a back- ground of necessary information for reading the paper. Section 3 introduces the statis- tical threshold voltage assignment framework. Vth assignment algorithm is described in detail in section 4 while in section 5 we give our experimental results by the use of some related benchmarks. Finally, some conclusions are inferred in section 6.

2 Background

2.1 Dual-Vth Circuit Design

The dual-Vth design technique uses two kinds of transistors in the same circuit. Some transistors have a high threshold voltage, while other transistors have a low threshold voltage. The high threshold-voltage transistors have less sub-threshold leakage power dissipation but also have a larger delay as compared to the low-threshold-voltage transistors. In dual threshold voltage implementation of custom VLSI designs, the gates on noncritical paths are assigned as high-Vth, and the gates on the critical path are as- signed as low-Vth. The objective is to maximize the number of transistors having high threshold voltage without sacrificing the performance of the circuit. The impact of this approach heavily relies on the efficiency of the threshold voltage assignment al- gorithm. Recently, researchers have proposed many design techniques, for selecting and assigning threshold voltage to gates of circuits which reduce leakage power under performance constraints [14]. However, the dual-threshold-voltage-design technique proposed in the literature for custom VLSI designs cannot be used for asynchronous ones. This is because the performance analysis of asynchronous circuit is completely different from synchron- ous one, because of the dependencies between highly concurrent events. While syn- chronous performance estimation is based on a static critical path analysis affected only by the delay of components and interconnecting wires, it has been shown that the performance of an asynchronous circuit depends on dynamic factors like the number of tokens in the circuit. In the clocked case, the critical path has a clear beginning and a clear end because all paths are broken by latches. But importantly, no clear separa- tion is available in asynchronous circuits. Therefore, it is necessary to have special approach to analyze the performance of the asynchronous circuits. 128 M. Raji et al.

2.2 Timed Petri-NET

Petri-Nets are used as an elegant modelling formalism to model concurrency and syn- chronization in many applications including asynchronous circuit modelling [20]. A Petri Net is a four-tuple ,,, where P is a finite set of places, T is a finite set of transitions and FPTTP⊆×∪×()() is a flow relation, and is the initial marking. A marking is a token assignment for the place and it shows the state of the system. Timed Petri-Net (TPN in sequence) is a Petri-Net in which some transitions or places can be annotated with delays. Variant-Timed Petri-Net (VTPN) is a TPN which the delays on the transitions or places are modelled statistically using probability density functions. In order to analyse the asynchronous circuits statistical- ly, we use VTPNT to model the circuit.

3 Statistical Dual Vth Asynchronous Circuits Design Framework

Fig. 1 shows the general structure of the proposed statistical leakage power optimiza- tion scheme and its interface with the asynchronous synthesis flow. To model the dual-threshold design of asynchronous circuits as an optimization problem, a suitable circuit and performance model of asynchronous circuit is required. In this work, the output of Decomposition is translating to Variant-Timed Petri-Nets model for perfor- mance analysis and assigns low or high Vth to each template. Then, a VTPN simulator runs the circuit model and provides the dynamic information of the original circuit such as token assignment. The proposed optimizer includes a statistical static performance analyser in order to provide performance information and a Vth-assignment engine which assign high/low Vth to the templates of the circuit. Assignment of Vth is done using a heuris- tic method. Then the optimized circuit is given as input to Template Synthesizer to generate a netlist of standard-cell elements.

4 A Vth-Assignment Algorithm

The power optimization flow uses a genetic algorithm and is shown in Fig. 2. It shows the basic configuration of the GAs. The genetic algorithm maintains a population of m individuals at each generation g. Each individual is a candidate of a solution for the dual-Vth assignment algorithm and has n chromosomes, i.e. the number of VTPN nodes. Each chromosome can have two conditions; ‘0’ shows that low Vth has been assigned to the node and ‘1’ shows that high Vth has been assigned to it. As there is a tradeoff between the performance and the power consumption of the circuit in dual- Vth technique, the proposed algorithm in the Vth assignment process has two optimiza- tion objectives. When the performance and the leakage power analyzed, the fitness of the individuals should be evaluated. We applied a 2-dimensional fitness graph to as- sign a total fitness value to each individual. Genetic operations are then applied to reproduce the population for the new generation. This process will continue until a termination criterion is met. Statistical Leakage Power Optimization 129

Fig. 1. Statistical Dual-Vth Asynchronous Circuit Design Framework

Fig. 2. The Vth assignment flow 130 M. Raji et al.

4.1 Statistical Mathematical Operations

The delay and leakage power of each node in VTPN is modelled as random variables with normal distribution. So the delay and power of the nodes in VTPN have a mean value, , and a set of parameter variation. The linear model used to approximate delay in the analysis is as follows:

∑ (1)

Where d is the delay of a gate, is the mean value for the delay; si is the delay sensitiv- ity of process parameter pi, pi is the parameter variation in pi for this gate, and m is the number of process parameters. As the computation will be done statistically, it is noteworthy to explain about the statistical operations first. The three operations used in our method are SUM, DIV and MAX. First of all, suppose there are tow random variable modeled as below: (2) ∑ ,,

(3) ∑ ,, In order to make the problem simpler, it is assumed that the parameters are uncorre- lated. So the standard deviation of the random variable is calculated like this: (4) ∑ It is interesting to notice that the covariance between paths (here between path 1 and 2) can be calculated easily through the equation below: , ∑ ,, (5)

4.1.1 SUM Operation The sum of two random variables with normal distribution results in a random variable with normal distribution. The SUM operation along each cycle is computed as follows:

μ ∑ , (6) μ μ μ (7) , , ,

4.1.2 DIV Operation In calculating the SCM of a cycle, the sum of delay values of the cycle will be divided by the number of the tokens in the cycle. As the sum of the delays modeled by normal random variable is still a normal random variable, the parameters of the division are calculated as follows:

/ (8) Statistical Leakage Power Optimization 131

/ (9) , ,

4.1.3 MAX Operation The maximum of two normal random variables does not necessarily results in a normal random variable. The MAX of two random variables with normal distribution N1 and N2 can be approximated to another normal random variable Nmax using the relationship proposed in [21], that is as follows: , μ ∑ ,

(10) 2

(11) (12) μ μ μ (13) , , , Here, ρ represents the correlation coefficient between A and B , and φ and are the cumulative density function, CDF, and the probability density function, PDF, of a standard normal (i.e., mean 0, STD 1) distribution, respectively.

4.2 Performance and Leakage Power Analysis

Performance of any computation modeled with a VTPN is dictated by the cycle time of the VTPN and thus the largest cycle metric. A cycle c in a VTPN is a sequence of places p1,p2,p3,…,p1 connected by arcs and transitions whose the first and the last place are the same. The statistical cycle metric, (SCM(c)), is the statistical sum of the delays of all associated places along the cycle c, d(c), divided by the number of tokens that reside in the cycle, m0(c), defined as:

c / (14) The cycle time of a VTPN is defined as the largest cycle metric among all cycles in the VTPN which must be computed statistically, i.e. , where C is the set of all cycles in the Variant-TPN. As mentioned before, the delays and the power consumptions of the nodes in VTPN are modeled statistically. So the algorithm has to use the statistical mathemati- cal operations. Performance analysis of asynchronous circuits which are modeled by VTPN is comprehensively discussed in [8] [9] [23]. On the other hand, power analysis needs a main calculation: finding the sum of the power consumptions of the nodes of the VTPN.

4.3 Fitness Function The fitness of a chromosome should be related to both the leakage power consump- tion and performance metric of that particular configuration since improvement of 132 M. Raji et al. each cause the other to degrade. So we applied a 2-dimentional fitness evaluation to the individuals. In each step, the fitness weight of each configuration is calculated so that it shows the number of the configuration that both of their parameters are better than the current configuration. Fig. 3 shows an example for a step of fitness evalua- tion. In this figure, for example, individual with weight 4 means that there is four in- dividuals with both better leakage power and delay metric than that individual. As the power and performance analysis is performed statistically, we have to consider a de- terministic measurement to find a position in a 2-dimensional graph. So we use bel- low formula to find a deterministic value for each of the parameters:

(15)

(16) where is the mean value of each statistical cycle metric for each configuration, is the standard deviation of each configuration and and are mean value and standard deviation value of that configuration respectively. In the last step, we have to choose a configuration as the result of the optimization. Based on the application for which the optimization is done, the power and the per- formance of the desired configuration can have specific weights in the last optimiza- tion step.

Fig. 3. An Example for Fitness Evaluations Method

5 Experimental Results

To test our method, we construct a multiple-Vth standard cell library using 90 nm process. For NMOS (PMOS) transistors, the high threshold voltage and the low thre- shold voltage are 0.22V (-0.22V) and 0.12V (-0.12V) respectively. The library was characterized using Berkeley 90 nm BSIM predictive model [26]. An asynchronous synthesis toolset (for the sake of blind review, we don’t cite its name here) employed to synthesis benchmarks. The circuits were optimized for maximum speed and lowest leakage power consumption simultaneously using the 2-dimensional fitness graph. It is observed that, on the average, in dual-Vth asynchronous circuits 86% leakage power can be reduced in standby mode. Statistical Leakage Power Optimization 133

To verify the results of our statistical dual-Vth assignment method, we used Monte Carlo (MC) simulation for comparison. To balance the accuracy, we chose to run 10,000 iterations for the MC simulation. The runtime for the MC simulation ranges from 30 minutes to 48 hours, depending on circuit sizes and its complexity. A com- parison of these results with those from statistical approach is shown in Table 1 and 2. For each test case, the mean and standard deviation (SD) values for the leakage power consumption and the performance metric of both methods are listed. The results of the proposed method can be seen to be close to the MC results: the average error is %3.56 and 52.08% for the mean and the variance value of the delays, respectively; and the average error for the mean and the variance values of the power is 5.23% and 48.39% respectively. Although there is some error between the implemented proposed method and MC simulation, but there is a considerable difference in the runtime of the me- thods as shown in Table 3.

Table 1. Result Comparison of the Statistical Dual-Vth Assignment and MC-based Dual and Single Vth Assignment Simulation (Delay Values)

Monte-Carlo Dual- Monte-Carlo Single- Proposed Flow Vth Vth # of # of Benchmarks the the Delay (ns) Delay (ns) Delay (ns) Nodes Cycles Sigma Mu () Mu () Sigma () Mu () Sigma () () A 6 17 8.540 0.243 8.091 1.607 8.102 1.986 B 10 51 7.533 0.235 8.54 1.033 8.601 1.589 C 16 1389 14.711 0.251 14.54 1.105 14.729 2.307 D 26 1864 17.207 0.407 16.984 1.554 17.108 8.032 E 35 7369 15.909 0.198 15.317 0.998 15.399 3.671 F 20 276 13.724 0.247 14.79 2.193 14.84 1.903 G 56 812 16.932 0.341 16.428 1.817 16.609 2.108

Table 2. Result Comparison of the Statistical Dual-Vth Assignment and MC-based Dual and Single Vth Assignment Simulation (Leakage Power Values)

Monte-Carlo Dual- Monte-Carlo Single- Proposed Flow Vth Vth # of # of the Benchmarks the Power (mW) Power (mW) Power (mW) Cycles Nodes Sigma Mu () Mu () Sigma () Mu () Sigma () () A 6 17 32.00 1.400 34.27 2.716 54.56 3.021 B 10 51 81.00 2.5865 75.87 5.020 137.36 4.907 C 16 1389 108.00 2.7893 103.30 6.049 183.85 8.145 D 26 1864 186.00 3.6633 175.90 7.9210 318.28 6.0843 E 35 7369 159.11 2.8184 152.51 5.0319 276.05 8.7823 F 20 276 169.00 3.8458 157.85 10.675 263.91 8.134 G 56 812 339.36 4.6304 344.91 6.2742 609.35 8.6292

134 M. Raji et al.

The results of dual-Vth are compared with the delay and power values of single- Vth technique in Table 1 and 2. As reported, the proposed method optimizes the lea- kage power consumption of the benchmarks at expense of some performance over- head. The average value of optimization is 40.5% and 54.4% for the mean and the variance value of power, respectively. Table 3 shows the runtime for our benchmark for each method. It varies for the benchmarks depending on circuit sizes and timing constraints.

Table 3. The Runtime for the Statistical Dual-Vth Assignment in Comparison with MC-based Simulation

Runtime # of the # of the Benchmarks Nodes Cycles SDV MC (Minute) (Hour)

A 6 17 2 0.5 B 10 51 4 3.2 C 16 1389 6 11.7 D 26 1864 7 17.3 E 35 7369 9 37.3 F 20 276 7 16.4 G 56 812 11 47.8

6 Conclusions

In this paper, an efficient method for exploiting statistically dual-threshold voltage assignment technique for reducing leakage power of asynchronous circuits while maintaining the high performance of theses circuits is presented. The issue of process variation is considered through exploiting the statistical approach to timing and power analysis of asynchronous circuits. The decomposed circuit is used to generate a Va- riant-Timed Petri Net model. The proposed assigning high and low threshold voltage method is based on a genetic algorithm. The experimental results show that the effi- ciency of the proposed method. We see many avenues for further investigation. In order to propose a more accurate framework and reduce the error of the method, we will consider correlation of delay and leakage power values in our future work. In addition, the application of our me- thod to a broader class of concurrent systems, such as GALS and embedded systems is a good topic for the researchers in the asynchronous circuit designs similarly to the synchronous ones.

References

[1] Tang, C.K., Lin, C.Y., Lu, Y.C.: An Asynchronous Circuit Design with Fast Forwarding Technique at Advanced Technology Node. In: Proceedings of ISQED 2008. IEEE Com- puter Society, Los Alamitos (2008) Statistical Leakage Power Optimization 135

[2] Beerel, P.A.: Asynchronous Circuits: An Increasingly Practical Design Solution. In: Pro- ceedings of ISQED 2002. IEEE Computer Society, Los Alamitos (2002) [3] Martin, A.J., et al.: The Lutonium: A Sub-Nanojoule Asynchronous 8051 Microcontrol- ler. In: Proceedings of ASYNC 2003 (2003) [4] Yun, K.Y., Beerel, P.A., Vakilotojar, V., Dooply, A.E., Arceo, J.: A low-control- overhead asynchronous differential equation solver. In: Proceedings of ASYNC 1997 (1997) [5] Garnica, O., Lanchares, J., Hermida, R.: Fine-grain asynchronous circuits for low-power high performance DSP implementations. In: Proceedings of SiPS (2000) [6] Narendra, S.G., Chandrakasan, A. (eds.): Leakage in Nanometer CMOS Technologies. Springer, Heidelberg (2005) [7] Ghavami, B., Pedram, H.: Design of dual threshold voltages asynchronous circuits. In: Proceedings of ISLPED 2008 (2008) [8] Raji, M., Ghavami, B., Pedram, H.: Statistical Static Performance Analysis of Asyn- chronous Circuits Considering Process Variation. In: Proceedings ISQED 2009, pp. 291– 296 (2009) [9] Raji, M., Ghavami, B., Pedram, H., Zarandi, H.R.: Process Variation Aware Performance Analysis of Asynchronous Circuits Considering Spatial Correlation. In: Monteiro, J., van Leuken, R. (eds.) PATMOS 2009. LNCS, vol. 5953, pp. 5–15. Springer, Heidelberg (2010) [10] Orshansky, M., Nassif, S.R., Boning, D.: Design for Manufacturability and Statistical Design, A Constructive Approach, pp. 11–15. Springer, Heidelberg [11] Borkar, S., et al.: Parameter variation and Impact on Circuits and Microarchitecture. In: Proceedings of DAC 2003, pp. 338–342 (2003) [12] Rao, R., et al.: Parametric yield estimation considering leakage variability. In: Proceed- ings of DAC 2004, pp. 442–447 (June 2004) [13] Orshansky, M., Nassif, S.R., Boning, D.: Design for Manufacturability and Statistical Design, A Constructive Approach, pp. 11–15. Springer, Heidelberg (2008) [14] Wei, L., Chen, Z., Roy, K., Johnson, M.C., Ye, Y., De, V.K.: Design optimization of dual-threshold circuits for lowvoltage low-power applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 7(1), 16–24 (1999) [15] Wong, C.G., Martin, A.J.: High-Level Synthesis of Asynchronous Systems by Data Dri- ven Decomposition. In: Proceedings of DAC (2003) [16] Dinh Duc, A.V., Rigaud, J.B., Rezzag, A., Sirianni, A., Fragoso, J., Fesquet, L., Renau- din, M.: TASTCAD Tools: Tutorial. In: Proceedings of ASYNC (2002) [17] Prakash, P., Martin, A.J.: Slack Matching Quasi Delay-Insensitive Circuits. In: Proceed- ings of ASYNC, pp. 195–204 (2006) [18] Wong, C.G., Martin, A.J.: High-Level Synthesis of Asynchronous Systems by Data Dri- ven Decomposition. In: Proceedings of 40th DAC, Anneheim, CA, USA (2003) [19] Beerel, P.A., Kim, N.-H., Lines, A., Davies, M.: Slack Matching Asynchronous Designs. In: Proceedings of ASYNC, Washington, DC, USA (2006) [20] Peterson, J.L.: Petrinet Theory and the Modeling of Systems. Prentice-Hall, Englewood Cliffs (1981) [21] Li, X., Le, J., Pileggi, L.T.: Statistical Performance Modeling and Optimization. In: Foundation and Trends in Electronic Design Automation, vol. 1(4), pp. 331–480 (2003) [22] Kuo, J.T., Cheng, W.C., Chen, L.: Multiobjective water resources systems analysis using genetic algorithms - application to Chou-Shui River Basin, Taiwan. Water Science and Technology 48(10), 71–77 (2003) 136 M. Raji et al.

[23] Raji, M., et al.: Process variation-aware performance analysis of asynchronous circuits. Microelectron. J. (2010) doi:10.1016/j.mejo.2009.12.013 [24] Lane, B.: SystemC Language Reference Manual, Copyright © Open SystemC Initiative, San Jose, CA (2003) [25] Karp, R.M.: A characterization of the minimum cycle mean in a diagraph. Discrete Ma- thematics Journal 23, 309–311 (1978) [26] Sheu, B.J., Scharfetter, D.L., Ko, P.K., Teng, M.C.: BSIM: Berkeley Short-Channel IGFET Model for MOS Transistors. IEEE Journal of Solid-State Circuits SC-22(4), 558– 566 (1987) [27] Chang, H., Sapatnekar, S.: Statistical timing analysis under spatial correlations. IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems 24(9), 1467– 1482 (2005) [28] Agarwal, A., Blaauw, D., Zolotov, V.: Statistical timing analysis forintra - die process variations with spatial correlations. In: Proceedings of ICCAD, pp. 900–907 (2003) Optimizing and Comparing CMOS Implementations of the C-Element in 65nm Technology: Self-Timed Ring Case

Oussama Elissati1,2, Eslam Yahya1,3, Sébastien Rieubon2, and Laurent Fesquet1

1 TIMA Laboratory, Grenoble, France {Oussama.Elissati,Eslam.Yahya,Laurent.Fesquet}@imag.fr 2 ST-Ericsson, Grenoble, France [email protected] 3 Banha High Institute of Technology, Banha, Egypt

Abstract. Self-timed rings are a promising approach for designing high-speed serial links or clock generators. This study focuses on the ring stage compo- nents – a C-element and an inverter - and compares the performances of differ- ent implementations of this component in terms of speed, power consumption and phase noise. We also proposed a new self-timed ring stage - only composed by a C-element with complementary outputs - which allows us to increase the maximum speed of 25% and reduce the power consumption of 60% at the maximum frequency. All the electrical simulations and results have been per- formed using a CMOS 65nm technology from STMicroelectronics.

1 Introduction

Oscillators and especially voltage controlled oscillators are basic blocks in almost all designs. Indeed, they are employed for generating the clock synchronization signal, for modulating and demodulating signals or retrieving signals in noise. The oscillator features depends on the application, however communication applications often em- bed their oscillators in Phase-Locked Loops (PLL) with strong requirements on stabil- ity, phase noise and power consumption. Moreover, with the advanced nanometric technologies, it is also required to deal with the process variability of the technology. Today many studies are oriented to asynchronous ring oscillators which present well- suited characteristics for managing process variability and offering an appropriate structure to limit the phase noise. Therefore self-timed rings are considered as promis- ing solution for generating clocks. In [1], Self-timed rings are efficiently used to generate high-resolution timing sig- nals. Their robustness against process variability in comparison to inverter rings is proven in [2]. They can be implemented in data driven clocks in [3]. Moreover self- timed rings can easily be configured to change its frequency by controlling its initiali- zation at reset time, while at the opposite inverter rings are not programmable [4]. Fully programmable stoppable oscillator based on self-timed rings is presented in [5]. The goal of this paper is to give to the designer some guidelines for using self- timed rings as oscillators depending on design requirements. The paper is mainly

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 137–149, 2011. © Springer-Verlag Berlin Heidelberg 2011 138 O. Elissati et al. oriented on phase noise reduction, speed and power consumption. The paper is organ- ized as follows. Section 2 provides the paper background and gives some definitions. In section 3, we present the C-element implementations which are the main compo- nent of the ring. In order to target an optimal design of the stage, we used the logical effort method introduced by I. Sutherland et al. [10] and electrical simulations. We also proposed a new self-timed ring stage only composed by a C-element with com- plementary outputs and we compare the performances of the different implementa- tions of the C-element in terms of speed power and consumption.

2 Self-Timed Rings

The C-element is the basic element in asynchronous circuit design, introduced by D. E. Muller. C-elements set their output to the input values if their inputs are equal and hold their output otherwise. Fig. 1 shows a possible CMOS implementation where the initialization circuit is omitted.

Fig. 1. Muller C-element

Each stage of STR is composed of a C-element and an inverter connected to the input B. The input which is connected to the previous stage is marked F (Forward) and the input which is connected to the following stage is marked R (Reverse), C denotes the output of the stage, as shown in Fig. 2.

Fig. 2. Self-Timed Ring

Tokens and bubbles: This subsection introduces the notions of Tokens “T” and Bubbles “B” which are very important to understand the behavior of the STR. Stagei Optimizing and Comparing CMOS Implementations 139

contains a token if its output Ci is not equal to the output Ci+1 of stagei+1. On the other hand, Stagei contains a bubble if its output Ci is equal to the output Ci+1 of stagei+1. = = {} ≠ = {} Ci Ci+1 ⇒ Stage i Bubble and Ci Ci+1 ⇒ Stage i Token

The number of tokens and bubbles will be respectively denoted NT and NB. For keep- ing the ring oscillating, NT must be an even number; the reader can think about this as the duality of designing the inverter ring by odd number of stages. Each stage of the STR contains either a token or a bubble. NT + NB = N, where N is the number of the ring stages.

2.1 Propagation Rules

If a token is present in a stagei, it will propagate to stagei+1, if and only if stagei+1 contain a bubble. The Bubble of stagei+1 will move backward to stagei. This implies a transition on stagei+1 output. For example, hereafter the token/bubble movements in a five stage STR which contains 4 tokens and one bubble. TTBTT (01001)Î TBTTT (01101)Î BTTTT (00101)Î TTTTB (10101)Î TTTBT (10100) Î TTBTT (01001)

2.2 Configurability

The oscillation frequency in STR depends on the initialization (number of tokens and bubbles). The oscillation frequency in a self-timed ring can be approximated accord- ing to the number of token and bubbles by the formula [5]:

1 ⎧()D ,N N if D D ≥ N N F = ()D,R = rr T B ff rr T B (1) OSC + ⎨() ≤ 2.D.(R 1) ⎩ Dff ,N B NT if Dff Drr NT N B where Dff the static forward propagation delay from input F to the output C and Drr the static reverse propagation delay from input R to the output C. = The maximum frequency is achieved when Dff Drr NT N B . This equality ensures the evenly spaced propagation mode.

2.3 Phase Noise

Noise in the MOS is divided into two main contributors: thermal noise and flicker noise. The thermal noise is responsible for the noise floor at high frequencies while the flicker noise is reflected by a rise in noise at low frequencies. The phenomenon of up-conversion of the amplitude noise in phase noise is complex and has different origins. However, beyond the offset frequency f0/2Qch, HF thermal noise imposes a noise floor. The phase noise is given by the semi-empirical Leeson formula [13]

2 ⎛ ⎡ ⎛ ⎞ ⎛ ⎞⎛ ⎞⎤⎞ = × ⎜ 1 + f0 + fc FkT0 ⎟ L( fm ) 10 log ⎢1 ⎜ ⎟ ⎜1 ⎟⎜ ⎟⎥ (2) ⎜ 2 ⎢ ⎜ 2Q f ⎟ ⎜ f ⎟⎜ P ⎟⎥⎟ ⎝ ⎣ ⎝ ch m ⎠ ⎝ m ⎠⎝ s ⎠⎦⎠ 140 O. Elissati et al.

Where:

ƒ Qch : Loaded Q-factor. ƒ F : Noise factor. ƒ f0 : carrier frequency. ƒ k : Boltzmann’s constant,. ƒ fm : Frequency offset. ƒ T0 : Temperature (290K). ƒ fc : Corner frequency. ƒ Ps : Signal power.

Figure Of Merit (FOM) is a parameter that allows comparison of oscillators by standardizing the phase noise compared to the oscillation frequency and power con- sumption. It is calculated using the equation [14]:

⎛ f ⎞ ⎛ P ⎞ = − ⎜ 0 ⎟ + s (3) FOM L( fm ) 20log⎜ ⎟ 10log⎜ ⎟ ⎝ fm ⎠ ⎝1mW ⎠

Fig. 3. Up-conversion of noise in oscillators

3 C-Element Implementations

As the C-element is the main component of the self-timed ring, it seems essential to study it to find the most interesting implementation depending on the application and specifications. This section presents different implementations of the C-element, a comparison in terms of consumption, frequency and phase noise is made. C-element are also studied in order to find design rules to optimize these cells in terms of speed and phase noise by applying the "logical effort" model introduced by I. Sutherland et al. [10] and by simulations using CMOS 65 nm technology from STMicroelectronics. In addition to the dynamic implementation [11], there are three different static im- plementations of the C-element in the literature: the Weak-feedback by Martin [7], the Conventional by Sutherland [8] and Symmetric by Van Berkel [9]. The dynamic implementation (Fig. 3.a) is composed by the main tree of transistors of the C-element and an output inverter. These transistors called “switchers” contrib- ute to the switching of the output. For the static implementations, in addition to the “switchers” we have a mechanism for memorizing the output value; these transistors are called "keepers". The "keepers" are not active during the switching, they provide "feedback" to keep the output state when the input values are different, so they are as small as possible to reduce their load and limit the race problem [11]. Optimizing and Comparing CMOS Implementations 141

(a) (b)

(c) (d)

Fig. 4. C-element implementations: Dynamic (a), weak feedback (b), conventional(c) and Symmetric (d)

The weak feedback implementation of the C-element is shown in Fig. 3.b; this im- plementation is composed by the same “switchers” of the dynamic one, in addition to a weak-reaction inverter (N4 and P4) to maintain the state of the output. This circuit suffers from a race problem at node C’. In the conventional implementation (see Fig. 3.c), in addition to the weak-feedback inverter, we have four additional transistors (N5, N6, P5 and P6) to disconnect this weak-feedback inverter when the inputs are equal. N4, N5, N6, P4, P5 and P6 are sized at the minimal width allowed by the technology. The C-element introduced by Van Berkel is illustrated in Fig. 3.d. This implemen- tation is slightly different from the previous ones. The transistors are split in two parts. The "keepers" are N6 and P6 and the splited transistors are also involved in the state holding.

4 Design of the Ring Stages

4.1 Designing with the Logical Effort Method

The first step is to find the most optimized way to design the stage of the self-timed ring composed by the C-element and an inverter. To do this we applied the "logical 142 O. Elissati et al. effort" method [10] introduced by I. Sutherland et al. This method allows us to opti- mize the stage speed. We expect that this optimization of speed will involve the opti- mization of the phase noise.

Table 1. Key definitions of logical effort

Term Stage expression Path expression = Logical effort g G ∏g i = = Electrical Effort h Cout Cin H Cout−path Cin−path = = Branching effort bi Cused CTotal B ∏ bi Effort f=gh F=GBH  1 Stage effort f = F N

The logical effort g captures the effect of the logic gate’s topology on its ability to produce output current. The electrical effort h describes how the electrical environ- ment of the logic gate affects the performance and how the size of the transistors in the gate determines its load driving capability. The branching effort b describes the fan-out of the gate. The output of a self-timed ring is connected to F input of the following stage, and to R input of the previous stage. Therefore the output capacitance of the stage is: = + = ()+ γ ⋅ = × ()+ γ ⋅ C out C R C F (1) where C F 1 Wn (2) and C R U 2 1 Wn (3)

CF and CR and Cout are respectively the F input, R input and output capacitances of the stage. wn is the NMOS transistor width, γ represent PMOS/NMOS width ratio, U1 and U2 the contribution of wn in the input and output inverter capacitances of the stage. = ()()+ × + γ ⋅ Cout U 2 1 1 Wn (4) We start by the path R → C . This path is composed of three sub-stages, the input inverter, the main tree of the C-element and the output inverter. C C + C C = out = R F = + F = + 1 The electrical effort of the path is: H R→C 1 1 (5) Cin C R C R U 2

The branching effort is: = = × × = B ∏ bi 1 1 1 1 The logical effort is: G = ∏b =1× 2×1= 2 i

The effort of the path R → C (Drr) is:

⎛ ⎞ F = G × B× H = 2×⎜1+ 1 ⎟ (6) ⎝ U 2 ⎠

Fig. 5. Self-Timed Ring Stage Optimizing and Comparing CMOS Implementations 143

1  1 ⎛ ⎞ 3 The stage effort to have the minimum delay is: f = F N = ⎜2 + 2 ⎟ (7) ⎝ U 2 ⎠  = × To have the minimum Delay we must respect the following relation Cin gi Cout f  = where Cin Cout gi f we apply this rule in our circuit, we find:

Cin 1 ()1+ γ ⋅ U ⋅ W U 1 = = 1 n = 1 (8) C 1 ()()U +1 × 1+ γ ⋅ W 1+ U out ⎛ ⎞ 3 2 n 2 ⎜2 + 2 ⎟ ⎝ U 2 ⎠

C ()+ γ ⋅ in2 2 1 Wn 1 C ()1+ γ ⋅ U ⋅ W = = = (9) in = 1 = 2 n = (10) U 2 1 ()+ γ ⋅ ⋅ 1 Cin 3 1 U1 Wn U1 C ()1+ γ ⋅ W 1 ⎛ ⎞ in 2 ⎛ ⎞ 3 n ⎜2 + 2 ⎟ ⎜2 + 2 ⎟ ⎝ U 2 ⎠ ⎝ U 2 ⎠ C, Cand C and are respectively the input capacitance of the input inverter, in in 1 in 2 the main tree of C-elements and the output inverter. From equation (10), we find that U2 = 0.56 and from equations (8) and (9) we find that U1 = 0.89. The path F → C (Dff) is composed of two sub-stages, the main tree of the C- element and the output inverter. = ()()+ × + γ ⋅ Cout U 2 1 1 Wn (11)

C C + C C = out = R F = + R = + Electrical effort of the path is H F→C 1 1 U2 (12) Cin CF CF = = × = Branching effort B ∏ bi 1 1 1 = = × = Logical effort G ∏ bi 1 2 2 → = × × = ×()+ The effort of the path F C F G B H 2 1 U 2  1 1 = N = ()+ ⋅ 2 The stage effort to have the minimum delay is: f F 2 2 U 2

Cin 1 ()1+ γ ⋅ U U 1 = = 1 = 1 1 (13) C ()+ ⋅ 2 ()()U +1 × 1+ γ 1+ U out 2 2 U 2 2 2

C ()+ γ in2 = 2 = 1 = 1 1 (14) C ()+ ⋅ 2 ()1+ γ ⋅ U U in1 2 2 U 2 1 1

We found that U1 = 0.89 and U2 = 0.56 are solutions of these two equations. So we have the same constraints on the two paths.

4.2 Designing with Electrical Simulations

To check the efficiency of the logical effort technique, we carried out simulations based on the Eldo RF simulator in CMOS 65 nm technology from STMicroelectron- ics. The goal is to find the design rules for sizing the ring stage in order to optimize its

144 O. Elissati et al. speed. We simulate a few examples of rings with different implementations of the C- element and we compared the performance of the four implementations presented in section 3. For a given current consumption and for each value of the pair (U1, U2), we extract the frequency, phase noise, the FOM and the area. Then we performed simulations for various Token/Bubbles configurations.

Fig. 6. The frequency (U1, U2)

Fig. 6 show frequency simulation results as a function of U1 and U2 parameters. We note that there is an optimal point for speed. The following table presents the optimal point for different combinations Token / Bubbles.

Table 2. Optimal frequency

Optimum frequency Ring U1 U2 3 stages 1B/2T 1 0.9 4 stages 2B/2T 1 0.9 5 stages 1B/4T 1 0.9 5 stages 3B/2T 0.9 0.5

= = Note that the optimal point for the first three cases is U1 1 andU 2 0.9 . For the = = case 3B/2T, this optimal point is located in U1 0.9 andU 2 0.5 . It is the only ring which corresponds to the results obtained by the logical effort method. In the cases 1B/2T, 2B/2T and 1B/4T, the optimization is done on a single path Drr. = The ratio NT NB Dff Drr (which corresponds to the highest frequency) cannot be achieved because it requires having a greater or equal Dff than Drr, which is impossible with the proposed structure. In these cases, the algorithm seeks to optimize the path R → C taking into account the input F as capacitance as Dff does not act on the oscilla- tion frequency. This explains the different values of U1 and U2 compared to that ob- = tained with the “logical effort” method. In the case 3B/2T, NT NB Dff Drr (maximal frequency) is easily reached; the optimization is done on both path R → C and F → C . Optimizing and Comparing CMOS Implementations 145

Fig. 7 represents the Frequency vs. Power consumption diagram in the optimal case for the four C-element implementations for a five-stage ring with two Tokens and three bubbles. The power has been computed with values of wn between 0.12 μm and 3μm. We performed this simulation with other rings and for different values γ = of wp wn . The conclusions were the same. The symmetric implementation is a good compromise between low-power consumption and a robust circuit behavior. For the high speed applications, the dynamic implementation is a good choice while the conventional and weak feedback implementation allows us to have lower frequencies.

Fig. 7. Power Vs. Freq. in STR (3B/2T)

We also extract the phase noise and FOM as a function of the U1 and U2 parame- ters, the optimal frequency corresponds to the optimal FOM for four different rings. This confirms our initial hypothesis. Moreover, this optimal point always involves a very small area. The highest frequency that we can achieve with this structure of Self- Timed Rings is around 6.6 GHz with the dynamic implementation in the CMOS 65 nm technology from STMicroelectronics. In order to improve the performance of the self-timed ring, we propose a modified ring stage. The modified stage is simply a C-element, without the R input inverter. We just interconnect the ring structure with the complementary outputs C and C’.

4.3 Modified Self-Timed Ring Stage

Fig. 8 represents our modified Self-Time Ring. For each stage the output C is con- nected to the following stage input F and the complementary output C’ is connected to the previous stage output R. This Modified Self-Timed Ring Stage allows us to improve the maximal speed by 25% and to reduce the power consumption by 55% at the maximum frequency and by 30% the power consumption by on bubble or token. With such a modified structure we can achieve a maximal frequency of 8.3 GHz with the symmetric implementation in CMOS 65 nm (See Table 3).

146 O. Elissati et al.

Fig. 8. Optimized self-timed ring stage

Table 3. Frequency and Power with various T/B configurations

Modified Classical Modified Config. 2T/1B 2T/3B 2T/3B Freq.(GHz) 7.9 6.4 6.1 Power (μW) 398 892 698

Fig. 9. Power vs. Frequency (Ghz) in modified STR

Fig. 9 represents the Frequency vs. Consumption diagram with the modified STR stage. The behavior is the same compare to the classical STR, with one main differ- ence: the performances of the symmetric implementation are very close or even better than the dynamic one when the wn is enough large. This improvement is due to the symmetric implementation which is divided in two sub-trees. Indeed, with a dynamic implementation, the PMOS and NMOS transistors achieve their saturation delay ear- lier than the symmetric implementation transistors and for large wn, the "keepers" effect on this delay becomes negligible. In addition, the symmetric implementation ensures at lower speed better operating conditions of the C-element.

4.4 Performances Comparison

Fig. 10 shows the Fig. Of Merit (FOM) according to the wn value, this Fig. shows that the noise performance of the weak feedback implementation is less efficient compared Optimizing and Comparing CMOS Implementations 147 to the other implementations. Notice that the conventional implementation is slightly better in most cases. We can also see that in Fig. 11 that for a given frequency the phase noise is better in the conventional implementation than in the weak feedback implementation despite that it consumes more power.

Fig. 10. FOM Vs. wn in Power in STR

Fig. 11. PN vs. Frequency in Power in STR

Table 4. Comparison between the four implementations

Power Speed Phase noise FOM Frequency range Consu. Dynamic High Low High Low Short Symmetric High Low High Low Short Conventional Medium Medium Low Low Medium Weak feed-back Low High Medium High Large

148 O. Elissati et al.

As we can see from Fig. 7 and Fig. 9, the weak feedback implementation has a large frequency range. At the opposite, the symmetric and dynamic implementations have a short one. Moreover the weak feedback implementation is able to reach low frequencies at low area cost. Table 4 presents a summary comparison between the implementations. We note that this comparison is true for both classical and modified stages.

5 Conclusions

This paper addresses the difficult problem of designing Self-Timed Ring Oscillator targeting low-phase noise applications. Self-Timed Ring is chosen as the oscillator core because of its known advantages with respect to many points of view: configura- bility, accuracy, robustness against process variability, etc. A comparison of the C- element implementations in terms of speed, power consumption and phase noise has been done. We conclude that the symmetric implementation is a good trade-off be- tween low-power and a robust behavior of the C-element. For high speed and low- power applications, conventional and weak feedback implementation allows us to access lower frequencies with a low area cost. For low-phase noise applications, we strongly recommend avoiding the usage of weak feedback implementations. In this goal, the conventional implementation seems to be the best choice. We also proposed a new self-timed ring stage - only composed by a simple C-element with its comple- mentary output - which allows us to increase the maximum speed of 30% and reduce power consumption of 60% at the maximal frequency. Moreover these implementa- tions (classical ones and modified) take advantage of the STR programmability, which gives more flexibility to the designer. We also suggested design rules to reduce the phase noise in STR. This work will be completed by a circuit fabrication and test chip measurements.

References

[1] Ebergen, J.C., Fairbanks, S., Sutherland, I.E.: Predicting performance of micro-pipelines using Charlie diagrams. In: ASYNC 1998, San Diego, CA, USA, pp. 238–246. IEEE, Los Alamitos (April 1998) [2] Fairbanks, S., Moore, S.: Analog micropipeline rings for high precision timing. In: ASYNC 2004, CRETE, Greece, pp. 41–50. IEEE, Los Alamitos (April 2004) [3] Mullins, R., Moore, S.: Demystifying Data-Driven and Pausible Clocking Schemes. In: ASYNC 2007, Berkeley, California, USA, pp. 175–185. IEEE, Los Alamitos (March 2007) [4] Hamon, J., Fesquet, L., Miscopein, B., Renaudin, M.: High-Level Time-Accurate Model for the Design of Self-Timed Ring Oscillators. In: ASYNC 2008, Newcastle, UK, pp. 29– 38. IEEE, Los Alamitos (April 2008) [5] Yahya, E., Elissati, O., Zakaria, H., Fesquet, L., Renaudin, M.: Programma-ble/Stoppable Oscillator Based on Self-Timed Rings. In: 15th IEEE Symposium on ASYNC 2009, Chapel Hill, USA, May 17-20, pp. 3–12 (2009) [6] Winstanley, A., Greenstreet, M.R.: Temporal properties of self timed rings. In: CHARM 2001, London, UK, pp. 140–154. Springer, Heidelberg (April 2001) Optimizing and Comparing CMOS Implementations 149

[7] Martin, A.J.: Formal progress transformations for VLSI circuit synthesis. In: Dijkstra, E.W. (ed.) Formal Development of Programs and Proofs, pp. 59–80. Addison-Wesley, Reading (1989) [8] Sutherland, I.E.: Micropipelines. ACM Commun. 32, 720–738 (1989) [9] Berkel, K.v., Burgess, R., Kessels, J., Peeters, A., Roncken, M., Schalij, F.: A fully- asynchronous low-power error corrector for the DCC player. IEEE J. Solid-State Cir- cuits 29, 1429–1439 (1994) [10] Sutherland, I., Sproull, B., Harris, D.: Logical Effort: Designing Fast CMOS Circuits. Morgan Kaufmann, San Fransisco (1999) [11] Shams, M., Ebergen, J.C., Elmasry, M.I.: Optimizing CMOS implementations of C-element. In: Proc. Int. Conf. Comput. Design (ICCD), pp. 700–705 (October 1997) [12] Razavi, B.: A Study of Phase Noise in CMOS Oscillators. IEE Journal of Solid-State Cir- cuits 31(3) (March 1996) [13] Leeson, D.B.: A simple model of feedback oscillator noise spectrum. Proc. IEEE 54, 329–330 (1966) [14] Bunch, R.L.: A Fully Monolithic 2.5GHz LC Voltage Controlled Oscillator in 0.35mm CMOS Technology. Master of Science in Electrical Engineering, Virginia Polytechnic Institute and State University, pp. 1–7 & 53–72 (April 2001) [15] Hajimiri, A., Limotyrakis, S., Lee, T.H.: Jitter and phase noise in ring oscillators. IEEE Journal of Solid-State Circuits 34(6), 790–804 (1999) Hermes-A – An Asynchronous NoC Router with Distributed Routing

Julian Pontes, Matheus Moreira, Fernando Moraes, and Ney Calazans

Faculty of Informatics, PUCRS, Porto Alegre, Brazil {julian.pontes,matheus.moreira,fernando.moraes, ney.calazans}@pucrs.br

Abstract. This work presents the architecture and ASIC implementation of Hermes-A, an asynchronous network on chip router. Hermes-A is coupled to a network interface that enables communication between router and synchronous processing elements. The ASIC implementation of the router employed stan- dard CAD tools and a specific library of components. Area and timing charac- teristics for 180nm technology attest the quality of the design, which displays a maximum throughput of 3.6 Gbits/s.

Keywords: asynchronous circuits, network on chip.

1 Introduction

Interest in asynchronous circuits has increased due the growing limitations faced dur- ing the design of synchronous System on a Chip (SoC) circuits, which often result in over constrained design and operation [1]. However, asynchronous computer aided design (CAD) tools still have to undergo a long evolutionary path before being ac- cepted by most designers. The lack of such tools renders difficult the access of tradi- tional circuit designers to the full capabilities of asynchronous circuits. Globally Asynchronous Locally Synchronous (GALS) design techniques may help overcoming limitations of synchronous design while maintaining a mostly synchro- nous design flow [2]. GALS techniques simplify the task of reaching the overall tim- ing closure for SoCs, but typically require the addition of synchronization interfaces between each pair of communicating modules. Synchronization interfaces bring a new set of design concerns, including metasta- bility-free operation and keeping latency and throughput figures at acceptable levels when traversing several synchronization points. A good approach is to reduce as much as possible the number of synchronization points, to achieve better data transfer rates and improve overall robustness. One way to reduce this number in a complex GALS SoC is to employ fully asynchronous communication mechanisms. Communication in current and future SoCs relies on the use of Networks on Chip (NoCs) [3]. Using a fully asynchronous NoC as communication architecture for a SoC composed by synchronous processing elements (PEs), the number of synchronizations involved in a single point to point data transfer is reduced to two: one at the sender- NoC interface and another at the NoC-receiver interface. This paper describes the

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 150–159, 2011. © Springer-Verlag Berlin Heidelberg 2011 Hermes-A – An Asynchronous NoC Router with Distributed Routing 151 design and implementation of an asynchronous NoC router that can give support to implement fully asynchronous NoCs. The rest of this paper is divided into five Sections. Section 2 describes related work and positions the new proposition with regard to it. Section 3 describes the architec- ture of the Hermes-A router, while Section 4 explores the characteristics of the router to PE interface. Section 5 discusses the ASIC implementation of Hermes-A and Section 6 presents conclusions and directions for further work.

2 Related Work

During this decade there has been a small, yet steady movement towards research and implementation of fully asynchronous routers and corresponding NoCs. An encom- passing review of the state of the art revealed ten relevant propositions of fully asyn- chronous interconnect architectures. Table 1 summarizes the main features for each of these, with the last row of the Table presenting the features for the proposed Hermes- A router and NoC. Table 1 is organized by the date of the first proposition for each interconnect archi- tecture, in a temporal line, although in some cases it cites later papers, where updated data about the NoC is present. Chain and RasP belong to a first generation of asynchronous interconnect frame- works, based on the careful design of point-to-point links using repeaters, pipelining and wire length control. To support implementation, both offer a set of asynchronous components (the so-called routers, arbiters and multiplexers) that permit sharing the point-to-point links from multiple sources to one destination. Nexus is a very efficient industrial implementation of an asynchronous (16x16) crossbar. Strictly speaking, none of these three architectures really agree with the most accepted definition of NoCs as a network of multi-port routers and wires organized in a topology that for- wards packets of information among processing elements. Accordingly, all three should display scalability problems as the number of PEs grow without bounds, what is expected for future technologies. Another group of works include the propositions of Quartana et al. and the asyn- chronous version of the Proteo NoC. These are experiments in prototyping asynchro- nous NoCs in FPGAs, with the corresponding lack of performance and prohibitive cost in area. Implementations of asynchronous devices in FPGAs more efficient than those cited in these works exist, as described in [14]. These rely on use of FPGA lay- out and timing control tools to create asynchronous devices as FPGA hard macros that are compact and respect tight timing constraints. However, so far these have not been used for NoCs. The remaining five NoCs/routers in Table 1 (QoS, MANGO, asynchronous QNoC, ANoC, ASPIN) and Hermes-A propose ASIC implementations of routers and links for 2D mesh topologies, although in some cases there is mention to adequacy to sup- port other topologies as well. This is not the case for ASPIN, because of the chosen router organization. In this NoC, the router ports are distributed around the periphery of the PE, making inter router links small compared to intra router links. This facili- tates connection of PEs by abutment, but prevents easy use of topologies other than 2D mesh. Even a similar 2D torus would be problematic to build in this case. 152 J. Pontes et al.

Table 1. A comparison of fully asynchronous interconnection networks and/or routers for GALS SoCs. Legend: A2S, S2A – Async. to Sync./Sync. to Async., As. -Asynchronous, BE – Best Effort service, DI - delay insensitive, GS – guaranteed service, Irreg/Reg- Irregu- lar/Regular, N.A. - Information Not Available, OCP – Open Core Protocol, VC – virtual channel.

Characteristics Routing / Flow Network Asynchro Links and Implementat Æ Topology Control Interface nous Style encoding ion NoC Framework / Point-to-point 180nm, QDI / Chain [4] point-to-point Source / EOP Ad hoc 1-of-4 DI / 8-bit 1Gbits/s per pipelined (Irreg/Reg) flits link, ASIC 2D Mesh 4 XY / wormhole / 1-of-4 DI / 8-bit Simulation QoS [5] N.A. QDI 3GS/1BE VCs credit-based flits only QDI / 1- 130nm, Single 16x16 Source / BOP- 1-of-4 DI / 36- Nexus [6] A2S, S2A clock 780Gbits/s, Crossbar EOP bit phits converters ASIC 2D Mesh A2S, 4-phase 130nm, Dual-rail, 2-ph. MANGO [7] (Irreg/Reg) Source S2A, bundled- 650Mflits/s, DI / 33-bit flits 4GS/1BE VCs OCP data ASIC Source / 2D Mesh 4-phase 180 nm, wormhole / As. QNoC [8] (Irreg/Reg) N.A. bundled- 10-bit flits 200Mflits/s, credit-based with 8VCs data ASIC preemption Self- Quartana et al. Crossbar or FPGA, 56 N.A. timed QDI N.A. [9] Octagon Mflits/s FIFOs 2D Mesh 65nm, Source / odd- A2S, S2A ANoC [10] (Irreg/Reg) / 2 QDI 34-bit flits 550Mflits/s, even / wormhole FIFOs VCs ASIC QDI / 4- Bidirectional FPGA, 202 As. Proteo [11] Oblivious OCP phase 32-bit flits Ring Kbits/s dual-rail Framework / Point-to-point 180nm, Source / bit RasP [12] point-to-point Ad hoc Dual-rail pipelined serial 700Mbits/s serial (Irreg/Reg) links Simulation 2D Mesh Distributed XY / A2S, S2A Bundled- Dual-rail, 4-ph., 90nm, ASPIN [13] (Reg) wormhole / EOP FIFOs data 34-bit flits 714Mflits/s Distributed XY / Dual-Rail Dual-rail / 180nm, 2D Mesh Hermes-A wormhole / SCAFFI bundled Dual-Rail 727Mbits/s, (Reg) BOP-EOP [14] data ASIC

Four of the NoCs (QoS, MANGO, asynchronous QNoC, ANoC) claim support to quality of service through the use of virtual channels and/or special circuits (GS routers). ANoC is the most developed of the proposals and presents the best overall performance. It has been successfully used to build at least two complete integrated circuits [15]. However, most of the characterization for ANoC (and for other asyn- chronous NoCs) derives from a detailed knowledge of the application in sight. If the application has unpredictable dynamic behavior, it is fundamental to employ a more flexible approach to topology choice, routing and incorporating the capacity to take decisions based on dynamic information of the network. These are some reasons be- hind the proposal of Hermes-A, described in the next Sections. Hermes-A – An Asynchronous NoC Router with Distributed Routing 153

3 The Hermes-A Router Architecture

Unlike most other asynchronous routers, Hermes-A employs a distributed routing scheme, where the router itself decides which path incoming packets will follow. This enables the use of adaptive routing algorithms and, more importantly, the router may employ these algorithms to solve network congestion problems in real time. Another characteristic of Hermes-A is that it uses an independent arbitration at each router port. The reason for this design choice is to allow that dynamic voltage level schemes be used to assign distinct voltage levels to distinct paths along a NoC. Such a fine grained voltage level resolution can be quite useful to fulfill important power- performance constraints so frequent in SoCs. Distributed routing and scheduling are characteristics shared by Hermes-A and ASPIN. Differences between these NoCs are on the lumped router design for Hermes-A, which facilitates the use of the router in topologies other than 2D meshes and the concerns for designing the router to support multiple voltage levels and adaptive routing algorithms. A traditional 2D mesh topology NoC with wormhole packet switching is the test environment used to validate the Hermes-A router. Each router in the experimented setup comprises up to five ports: East, West, North, South and Local. As usual in di- rect NoCs, the Local port is responsible for the communication between the NoC and its local PE. All experiments described herein assume the use of 8-bit flits. The packet format is extremely simple: the first flit contains the XY address of the destination router and the subsequent flits contain the packet payload. Two sideband signals con- trol the transfer of packets and support arbitrary-size packets: begin of packet (BOP), activated with the first flit of a packet, and end of packet (EOP), activated with the last flit. All intermediate flits display BOP=EOP=0. Most of the router architecture employs a delay insensitive, 4-phase, dual-rail encoding. Note that each input port interface consists of 21 wires: 16 wires carry the 8-bit dual-rail flit value (DR-Data), four wires contain the dual-rail BOP and EOP information and the last is the single rail acknowledge signal. The router detects data availability when every pair of wires that define each bit value in the DR-Data signal is distinct from “00”. Thus, the all 0’s value in DR-Data is the spacer for the DI code. A. Input Port Figure 1 depicts the Hermes-A input port structure as a simplified asynchronous data- flow diagram [16]. There are three alternative paths in this module, one used for the first flit (1), one for intermediate flits (2) and one for the last flit (3). In Figure 1 two wires represent each bit. Thus, a 10-bit path is in fact a 20-wire bus. When BOP is signaled at the input port, the first demux selects the path that feeds the module responsible for computing the path to use. This module receives ten information bits that are forwarded (8 data bits plus EOP and BOP), plus four destina- tion bits using dual-rail one-hot encoding. Note that just the bit associated to the se- lected path is enabled in this 4-bit code. Since the routing decision must be kept for all flits in a packet, a loop was added to register the decision. The loop appears in Figure 1 as a chain of three asynchronous registers (4) in order to enable the data flow inside the 4-phase dual-rail loop. Each two successive asynchronous stages communicate using an individual handshake operation [16]. Thus, in this kind of circuit it is not possible that three successive stages exchange two data simultaneously. Exactly three stages are the minimum necessary to propagate information circularly. Less than three 154 J. Pontes et al. stages incur in deadlock situation. This can be better understood remembering that between every two valid data there is always a spacer, and that before propagating a spacer the first data must be copied to the next stage.

Fig. 1. Hermes-A router input port architecture. All paths employ dual-rail encoding.

After computing the output port where to send the incoming flit, the rightmost mod- ule in Figure 1 (Output demux) sends the flit, based on the 4-bit routing information. Subsequent flits in a packet go through the lower output of the leftmost demux and are input to a second demux after the fork element. This demux looks for the EOP bit before choosing the right direction for each flit. If there is no EOP indication the flit follows path (2) to the first merge component. Otherwise, the S-Control module is used. The next Sections cover the behavior of the Path Calculation and S-Control modules. a) Path Calculation The basic route computation architecture is depicted in Figure 2. In direct 2D topolo- gies like 2D mesh or 2D torus, each router is defined with two values, its X and Y coordinates. The first flit of a packet carries the destination X address in the four less significant bits and the destination Y address in the four most significant bits. When a flit is accompanied by an active BOP signal it feeds the Path Calculation module. This



Fig. 2. Hermes-A Path Calculation circuit Hermes-A – An Asynchronous NoC Router with Distributed Routing 155 flit arrives at the input of a completion detector (CD). Detection of a valid dual rail data token causes the propagation of the destination X and Y coordinates to two sub- traction circuits. The outputs of these circuits will determine the path the packet must follow. If both subtractions result in 0, then the packet reached the target router and it pro- ceeds to the Local port. For the XY routing algorithm, if the X axis subtraction is dif- ferent from zero, the packet will follow either to East or West, depending only on the sign of the result (positive and negative, respectively). If the X subtraction result is 0 but the Y subtraction is not, the packet may follow to North or South, depending again only on the signal of the result (positive and negative, respectively). The Rout- ing Logic module is just a purely combinational logic that produces the resulting one- hot dual-rail 4-bit packet destination code. It points the output port to use. b) S-Control When the last flit of a packet is received (EOP=1), it is directed to the S-Control module (see Figure 1). The S-Control protocol description appears in Figure 3.

Fig. 3. State machine for the S-control module

The function of this module is to send the last flit through the output marked A in Figure 1, and then send a kill token in the output marked B to indicate the end of a packet transmission. This has as effect to de-allocate the output currently reserved for this packet. To avoid defining a new dual-rail signal, the unused code BOP=EOP=1 is employed internally to the router to signal this situation. The circuits that interpret this code are two: the allocated output port and the one that controls the chain (4) of asyn- chronous registers (not explicit in Figure 1). The later, upon receiving the code, empty the chain using spacers. Remembering that asynchronous circuits rely on explicit lo- cal handshake between every pair of communicating modules, the S-Control only generates an acknowledge signal to the previous demux after receiving the acknowl- edge signals for both, A and B outputs. Completeness detectors produce all request signals. The Petrify tool was used to synthesize the equations that implement a speed- independent controller operating as the state machine in Figure 3. 156 J. Pontes et al.

B. Output Port In the Hermes-A router each output port receives four data flows. For instance, Figure 4 shows the Local output port structure that receives data from input ports North, South, East and West.

 ELWV 1RUWK  ELWV '5 2XW '5 &RQWURO

 ELWV 6RXWK  ELWV '5 2XW '5 &RQWURO  ELWV 2XWSXW '5 0X[ 'DWD2XW  ELWV (DVW  ELWV '5 2XW '5 &RQWURO

 ELWV :HVW  ELWV '5 2XW '5 &RQWURO

&

& $UELWHU &

&

$FN1RUWK

$FN6RXWK

$FN(DVW

$FN:HVW $FN,Q Fig. 4. Local output port structure. Dashed lines represent actual wires. Solid lines represent dual-rail encoded lines.   9DOLGLW\'HWHFWLRQ 5HJLVWHU

Fig. 5. Output control structure. All paths employ dual-rail encoding.

An arbiter circuit controls the behavior of each output port. This arbiter achieves fairness with a structure of six 2-input, 2-output arbiters connected in a shuffle- exchange topology. Each atomic arbiter decides which request to serve from between two input requests, using a first-come-first-served strategy. This allows the processing of up to four simultaneous input port requests. The bit used to produce the request to Hermes-A – An Asynchronous NoC Router with Distributed Routing 157 the output port is produced by the logic that computes routing on the input port. Since this bit is a dual-rail representation, conversion to single-rail is necessary, since arbi- ters are the only single-rail module in the output port. A 2-input C-element with one negated input executes the conversion. Figure 5 details the structure of each output control circuit of an output port. This module receives data directly from some input port. Its role is to generate requests for the output port arbiter or to undo the internal connection between input and output ports after transmitting the last packet flit and receiving the kill token.

4 Network Interface

The synchronization mechanism is one of the crucial components of a GALS system. Traditional synchronizers like the series-connected flip-flops do not guarantee elimi- nation of metastability, and since synchronization latency is usually large in such syn- chronizers, these components often impose low throughput to the communication architecture. To overcome these limitations, this work chooses to employ clock stretching techniques, which do eliminate the risk of metastability. Also, this kind of synchronization can support higher throughput than traditional synchronizers. The synchronization mechanism adopted here is based on the SCAFFI [14] asyn- chronous interface. SCAFFI is an asynchronous interface based on clock stretching that supports dual rail communication schemes. The network interface between Local Ports and PEs appears in Figure 6. More details on this interface are available in reference [14].   &RQYHUWHU 'XDOWR6LQJOH 2XWSXW3RUW 6\QFKURQRXV3( & ,Q3RUW

Fig. 6. SCAFFI network interface between a Hermes-A router and a synchronous PE. The in- terface employs clock stretching techniques to avoid metastability. The stretcher circuits are not represented in the picture.

5 ASIC Implementation

Since traditional design kits do not usually contain asynchronous components, the Hermes-A ASIC implementation started with the implementation of an asynchronous 158 J. Pontes et al. digital cell library. The library includes several versions of C-elements, metastability filters and control circuits, like sequencers. The first version of the asynchronous library uses the XFab 180nm design rules and includes liberty timing files (.lib), abstract views (.lef) and Verilog models using UDP primitives to enable timing anno- tated simulations. The asynchronous library is the base to develop a set of data flow elements (fork, join, merge, mux, demux, half-buffer registers, validity detectors, etc.). During the asynchronous router synthesis it is important to guarantee that the (syn- chronous) synthesis tool do not change asynchronous components. For instance, in the Cadence RTL Compiler synthesis tool it is possible to ensure that this will not happen by using the PRESERVE property, which can be assigned to each module instance. This property instructs the tool not to touch the cell instance characteristics. The results presented in the Table 2 refer to the XFab 180nm ASIC implementa- tion of the Hermes-A router. The operating conditions are 25°C, 1.8 Volts. Also, the library build employs typical transistor models. Power results were obtained when all router input and output ports are operating at their highest rate of 727 Mbits/s on each router link. The throughput presented in Table 2 is for a single link operation. The router can sustain, in the best possible case, operation at this performance level on all of its five ports, totalizing approximately 3.6 Gbits/s of maximum throughput for the whole router.

Table 2. ASIC Implementation results for a 180nm XFab technology

Throughput Area (mm2 ) Total Power (Mbits/s) Cell – Total Area (mW) 727 0.21 – 0.33 11.14

6 Conclusions and Future Work

The Hermes-A router demonstrates that asynchronous circuits are useful as a commu- nication architecture for a high performance complex GALS SoCs. Ongoing work proceeds in several directions, including: (1) providing support to adaptive routing algorithms in Hermes-A; (2) enabling Hermes-A to work with multiple supply volt- ages and power shutoff features, in order to reduce the power utilization mainly in idle ports; (3) implementing complete NoC topologies and applications for testing router operation, such as 2D meshes and 2D tori. It is important to note that in the case of a 2D torus, the routing module has to be modified, since a pure XY routing algorithm is not deadlock-free for this network topology.

Acknowledgements

The Authors would like to acknowledge the support of the CNPq through research grants 551473/2010-0, 309255/2008-2, and 301599/2009-2. Also, they would like to acknowledge the National Science and Technology Institute on Embedded Critical Systems (INCT-SEC) for the support to this reseach. Hermes-A – An Asynchronous NoC Router with Distributed Routing 159

References

[1] Ho, R., Mai, K., Horowitz, M.: The future of wires. Proceedings of the IEEE 89(4), 490– 504 (2001) [2] Chapiro, D.: Globally-Asynchronous Locally Synchronous Systems. PhD th., Stanford University, 134 p. (October 1984) [3] Marculescu, R., Ogras, U., Peh, L.-S., Jerger, N., Hoskote, Y.: Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28(1), 3–21 (2009) [4] Bainbridge, J., Furber, S.: Chain: A Delay-Insensitive Chip Area Interconnect. IEEE Mi- cro 22(5), 16–23 (2002) [5] Felicijan, T., Furber, S.: An Asynchronous On-Chip Router with Quality-of-Service (QoS) Support. In: 17th IEEE Int. SoC Conf. (SOCC 2004), pp. 274–277 (2004) [6] Lines, A.: Asynchronous Interconnect for Synchronous SoC Design. IEEE Micro 24(1), 32–41 (2004) [7] Bjerregaard, T., Stensgaard, M., Sparsø, J.: A Scalable, Timing-Safe, Network-on-Chip Architecture with an Integrated Clock Distribution Method. In: Design, Automation, and Test Europe (DATE 2007), pp. 1–6 (April 2007) [8] Dobkin, R., Ginosar, R., Kolodny, A.: QNoC asynchronous router. Integration the VLSI Journal 42(2), 103–115 (2009) [9] Quartana, J., Renane, S., Baixas, A., Fesquet, L., Renaudin, M.: GALS systems prototyp- ing using multiclock FPGAs and asynchronous network-on-chips. In: Int. Conf. on Field Programmable Logic and Applications (FPL 2005), pp. 299–304 (2005) [10] Beigné, E., Clermidy, F., Vivet, P., Clouard, A., Renaudin, M.: An Asynchronous NoC Architecture Providing Low Latency Service and its Multi-level Design Framework. In: IEEE Int. Symp. on Asynchronous Circuits and Systems (ASYNC 2005), pp. 54–63 (2005) [11] Wang, X., Ahonen, T., Nurmi, J.: Prototyping a Globally Asynchronous Locally Syn- chronous Network-On-Chip on a Conventional FPGA Device Using Synchronous Design Tools. In: Int. Conf. on Field Programmable Logic and Applications (FPL 2006), pp. 657–662 (2006) [12] Hollis, S., Moore, S.: RasP: An Area-efficient, On-chip Network. In: Int. Conf. on Com- puter Design (ICCD 2006), pp. 63–69 (2006) [13] Sheibanyrad, A., Greiner, A., Miro-Panades, I.: Multisynchronous and Fully Asynchro- nous NoCs for GALS Architectures. IEEE Design and Test of Computers 25(6), 572–580 (2008) [14] Pontes, J., Soares, R., Carvalho, E., Moraes, F., Calazans, N.: SCAFFI: An intrachip FPGA asynchronous interface based on hard macros. In: Int. Conf. on Computer Design (ICCD 2007), pp. 541–546 (2007) [15] Thonnart, Y., Vivet, P., Clermidy, F.: A Fully Asynchronous Low-Power Framework for GALS NoC Integration. In: Design, Automation, and Test Europe (DATE 2010), pp. 33– 38 (2010) [16] Sparsø, J., Furber, S.: Principles of Asynchronous Circuit Design – A Systems Perspec- tive. 354 p. Kluwer Academic Publishers, Boston (2001) Practical and Theoretical Considerations on Low-Power Probability-Codes for Networks-on-Chip

Alberto Garcia-Ortiz1 and Leandro S. Indrusiak2

1 Institute for Theoretical Electrical Eng. and Microelectronics (ITEM), University of Bremen, Otto-Hahn-Allee 1, NW1, 28359 Bremen, Germany [email protected] 2 Dept. of Computer Science - Real-Time Systems Group (RTS), University of York, YO10 5DD York, UK [email protected]

Abstract. Low-power coding represents an important technique to reduce con- sumption in modern interconnect architectures. In the case of Network-on-Chip, and specially if they include virtual channels, the coding techniques require to be effective (large reduction of transition activity) and extremely efficient (re- duced hardware resources). This work proposes a coding template called PM with those characteristics. Moreover, it shows with a detailed theoretical analy- sis and a number of experiments the good characteristics of the approach. Some relevant theoretical results on Exact Probability Coding are also developed in the paper.

1 Introduction

The increasing miniaturisation capabilities of nanometric technologies allows the in- tegration of hundreds of processing units in a single chip. However, such systems de- mand for an optimised communication architecture. Networks-on-Chip are emerging as a promising approach to address that problem [2]. Stringent constraints such as power, performance and latency must be observed and requirements such as reliability, fault tolerance, correctness (data ordering) and completion (no data loss) must be complied. The power consumption of NoC interconnects is not negligible. The internal struc- ture of a NoC router can be quite complex, with arbitration, routing and switching logic, as well as temporary storage. The wires between routers also contribute significantly to the dynamic power consumption [6]. One alternative to reduce the dynamic power consumption on Networks-on-Chips is the application of coding techniques that either minimise the signal transition activity [7,4,5]. Crosstalk Avoidance Codes and Error Correction Codes are also proposed [3] to allow a reduction in the transmitted voltage swings (and thus, the power) without sacrificing the reliability. For the relevant case of NoCs with virtual channels, standard low-power coding ap- proaches [1,8,7] are not applicable. The packet multiplexing which occurs on the virtual channels destroys the low transition characteristics introduced by the encoding. Novel approaches as PMD [4] are required in this case. A major challenge is to find coding architectures where the overhead of the coder/ decoder does not eliminates the power savings in the interconnects achieved by the

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 160Ð169, 2011. c Springer-Verlag Berlin Heidelberg 2011 Practical and Theoretical Considerations on Low-Power Probability-Codes 161 coding procedure. This works aims at analysing the suitability of a simpler coding strategies than PMD for NoC networks with Virtual Channels. First, Sec.2 investigates the possibility of removing the Correlator and Decorrelator from the switch. The resulting template (called PM Code) provides an interesting trade- off between coding-complexity and activity reduction. Since the low-power coding ef- ficiency is slightly smaller than that of PMD, we investigate in Sec.3 the theoretical limits of Probability-Coding. The main focus is to understand from a solid foundation which are the implications of the probabilistic characteristics of the signal to be coded into the efficiency (activity reduction ratio) which can be obtained. Finally, we validate experimentally the results of the work. The data are reported in Section 4.

2 Probability-Multiplex Coding

Since the Probability-Multiplex (PM) coding template is based on the Probability- Multiplex-Decorrelator (PMD) strategy, let us describe first PMD. The interested reader is referred to [4] for a complete description. The goal of a standard low-power Transition-Code is to minimise the number of tran- sitions in the wires (or the number of transitions in opposite directions for neighbour wires if couping is considered.) The goal of Probability-Coding is to minimise the num- ber of ones at the output of the coder. A Transition-Code can be created by adding a XOR-Decorrelator to a Probability-Code [8]. As shown in [4] low-power transmission in NoCs with virtual channels cannot be achieved using a Transition-Code, but it can be obtained by using a Probability-Code and a distribution of XOR-Decorrelators and XOR-Correlators over the NoC links. PMD is composed by three consecutive steps: first, a Probability Coder which min- imises the number of ones; second, the time multiplexing intrinsic to the virtual chan- nels of the NoC; an third, the XOR-Decorrelator which maps ones to transitions. The decoding applies a XOR-Correlator, a demultiplexing (intrinsic to the virtual channels) and a Probability Decoder. The Probability Coder and Decoder are located in the Net- work Interface of the NoC fabric, while the XOR-Correlator and XOR-Decorrelator are distributed over the Links. Although different architectures can be used for the Probability Coder, the code “Corr-K0” consisting of a XOR-Correlator followed by XORing the bus with the MSB has been shown to provide a good compromise between hardware complexity and power reduction. The box P.Coder of Fig. 1 illustrates the circuit. In this work we analyse the possibility of reducing even further the hardware com- plexity of PMD by removing the XOR-Correlators and XOR-Decorrelators located on the Links. Fig. 1 shows the proposed coding template called Probability-Multiplex (PM). The main advantage of PM respect to PMD is that it does not require any modifica- tions on NoC Switch itself but only the Network Interface with the Processing Element. Thus, the critical timing path between the NoC Switches is not modified. The power and overhead of the procedure is also reduced. 162 A. Garcia-Ortiz and L.S. Indrusiak

Low−power switch

Switch

Link Link

P. Coder P. Decoder

FF FF

Processing Element Processing Element FF FF

FF FF

P.Coder P.Decoder

Fig. 1. PM coding template with an example of a P.Coder (Corr-K0)

2.1 Dynamic Power Considerations In order to analyse exactly the switching activity in the Links, we can consider two (temporally) uncorrelated signals Xa and Xb, which are time-multiplexed to generate a resulting Xm. This model describes the transmission of data over virtual channels in a NoC. Let us denote by pai the probability of being 1 for the i-th bit of signal Xa and pbi for signal Xb. The probability of having a bit transition in the form Sai = 0 → Sbi = 1 is Prob[Sai = 0,Sbi = 1]=(1− pai)pbi where we have used the fact that the signals Xa and Xb are statistically independent. Adding the opposite transition:

tmi =(1 − pai)pbi + pai(1 − pbi) (1) which is independent of transition activity of Xa and Xb, and it depends only on the bit probabilities. If we assume that the probability of Xa and Xb are equal to pi, we obtain that tmi = 2pi(1 − pi) (2)

Let us note that the activity for a PMD code is simply tmi = pi, while a “classical” low- power code in the context of virtual channels has tmi = 1/2. Thus, PM is less efficient than PMD by a factor 2(1− pi), but achieves a switching reduction of pi(1− pi).Fig.2 shows the activity reduction factor of PM and PMD as a function of the entropy of each single wire. We observe that the reduction in coding efficiency of PM respect to PMD can range from 0 to approximately 23%. Experimental results (see Section 4) confirm that a typical value of 10%-15% should be expected.

2.2 Static Power Considerations Leakage is a major concern for current technologies. In this subsection we analyse the implications in terms of leakage of using PM instead of PMD. Practical and Theoretical Considerations on Low-Power Probability-Codes 163

100 uncoded PMD code PM code 80 Penalty of PM

60

40

Activity Reduction [%] 20

0

0 0.2 0.4 0.6 0.8 1 Entropy [bit]

Fig. 2. Coding activity reduction for PMD and PM as a function of the signal entropy

Since the absence of the XOR-Correlator and XOR-Decorrelators does not change the signal characteristics inside the buffers of the NoC Switch, PM maintains the same savings in terms of static power inside the switch reported for PMD. The reductions are 21% in the average case and 32% for multimedia signals.

3 Analysis of Probability-Coding in NoCs with VC

Since PMD and PM use a Probability Coder, it is useful to analyse the exact (i.e, opti- mal) Probability-Coding. In the context of Transition-Coding, Exact Transition-Coding has been proposed in [1]. The core of the techniques is actually an Exact Probability-Coder, referred as E.It provides the best possible Probability-Code for a coding scheme which employs only the current and previous vale of the signal during the codification process. Although Exact Transition-Coding (and Exact Probability-Coding) are completely impractical for a real implementation (see [1]) they establish a theoretical limit on what it is achievable by low-power coding. Let us consider a B-bit Boolean stationary random variable X with a known Joint Probability Distribution PXY (x,y)=P(X[n]=x,X[n − 1]=y). The Exact Probability- Code can be viewed as a Boolean function E(x,y) : BB × BB → BB which is decodable and minimises the expected number of ones at the output for decoder (for the given JPD). The authors of [1] provide an algorithm to obtain the specifications of such coder E. The algorithm requires to sort 4B probability values, and then to visit that list while keeping a table with some “forbidden” values. 164 A. Garcia-Ortiz and L.S. Indrusiak

Another point of view for the problem is to consider that the coding procedure is composed by two consecutive steps. The first is a coding function Ep which minimise the digital numeric value of the output rather than the number of ones. The second step is the value-based-mapping (vbm) described in [8]. The vbm is a Boolean function vbm(x) : BB → BB which maps the inputs with smaller digital value with those outputs codes with the lesser number of ones. The structure is shown in Fig. 4. It is straight forward to see that both approaches are equivalent.

Probability vbmCost 1 B

0 0

0 2B −1 0 2B −1

x[n] z[n] vdm w[n] E p x[n−1] Prob FF

Prob B 2 −1 B x[n] 2 −1 x[n−1] 0 B 2 −1 sort x[n]

x[n−1]

Fig. 3. Exact-Coding from a probabilistic point of view

We can observe that the optimal coding values corresponding to the k-th row of Ep are the indexes used for sorting in decreasing order Ep(x,k). Thus, Ep can be found just by 2B sorts of a set of 2B values (which is much simpler than the approach presented in [1]). The sort has to be done in a row-by-row basis to guaranty the decodability of the resulting low-power code. Fig. 4 shows graphically how Ep sorts the PDF of X . Once Ep is performed, we can apply the vdm coder. After calculating the one dimen- sional probability of each value, we can obtain the expected (average) number of ones, and thus the activity in the Links of the NoC after the XOR-Decorrelator. Using the = = [ = hamming weight function (number of ones), we can write: ELink Pones ∑i Prob Z i] HammingWeigth(i). However, it turns to be easier to define an equivalent cost func- tion before the application of the vdm coder: vdmCost(k)=HammingWeigth[vdm(k)] (3) Practical and Theoretical Considerations on Low-Power Probability-Codes 165

The vdmCost(k) is a monotonous function composed by B + 1 steps or cost-regions B with values from 0 to B. The width of the k-th cost-region is k corresponding to the B ( ) k words of B-bits with exactly k ones. Using vdmCost k we can write:

ELink(X)=Pones = ∑Prob[W = k] vdmCost(k) (4) k where the key point is that now the random variable W is used instead of Z. The advan- tage is that the distribution of W is easier to obtain than that of Z Because of the relevance for DSP and multimedia applications we focus on Normally distributed signals. Let us consider a B-bit Gaussian stationary signal with standard de- viation σ and temporal correlation ρ. For the sake of simplicity, we assume a continuous instead of a 2B discrete signal. The PDF of the signal is: 1 x2 + y2 − 2ρxy fXY (x,y)= exp − 2πσ2 1 − ρ2 2σ2(1 − ρ2) The expression for one “slice” of the PDF at value Y = k is: 1 (x − ρk)2 k2 fXY (x,k)= exp − exp − 2πσ2 1 − ρ2 2σ2(1 − ρ2) 2σ2

We observe that with respect to X, the shape of fXY(x,k) is similar to a Gaussian bell 2 with centre in µ = ρk and standard deviation σk = σ 1 − ρ . The next step is to sort the “slice”. We can think of this step as first moving the shape of the slide to zero (which removes the mean), and then to mirror the negative-side into the positive-side. Then, for x ≥ 0 2 x2 k2 sort( fXY (x,k)) = exp − exp − 2πσ2 1 − ρ2 2σ2(1 − ρ2) 2σ2 2 Finally, we have to add all the “slices”. Since √1 exp −x = 1 we conclude that: σ 2π 2σ2 ⎧ ⎨ − 2 √ 2√ exp w if w ≥ 0 ( )= 2πσ 1−ρ2 2σ2(1−ρ2) pW w ⎩ (5) 0ifw < 0

Thus, p(w) is twice the positive side of a Gaussian PDF with zero mean and standard 2 deviation σp = σ 1 − ρ . In summary, using Eq. (4): 2 −w2 ELink(X)=∑ √ exp vdmCostB(w) (6) 2 σ2( − ρ2) w 2πσ 1 − ρ 2 1 A key point is that Gaussian random signals with different standard deviation and corre- lation, but equal σp will have the same power cost after Exact-Coding. The remarkable fact is that the entropy of a Markov Gaussian random variable is given by: 1 2 HG(σ,ρ)= log (2πe)+log σ 1 − ρ (7) 2 2 2 166 A. Garcia-Ortiz and L.S. Indrusiak 2 2 which is a function of the same parameter σp = σ 1 − ρ . Thus, σ 1 − ρ can be obtained as a function of HG(σ,ρ). Moreover, since vdmCostB(w) is a function B,itis straight forward to define a function φG such that:

ELink(X)=φG(HG,B) (8) We have proved the following theorem, which characterise the maximum dynamic power reduction that can be obtained in the presence of temporal correlation. Theorem 1. The efficiency of the low-power Exact-Code for temporally correlated Markov Gaussian Signals depends only on the bit-width and the entropy of that signal. It is worth to note that the exact dependency on the entropy does not hold for other coding strategies as Gray-Code, Bus-Invert, etc. However, it does hold for an ideal infinite code, as shown in [9].

4 Experimental Results

In order to compare experimentally PM with respect to PMD, we have used the same simulation environment than [4]. It employs a simplified behavioural model of the NoC and emulates a 4 by 4 mesh topology. However, only three Processing Elements are actually active during the simulation. The flit bit-width is 8b. Four flits are used for the header, and 128 for the payload. The main focus of this paper is the coding strategy for NoCs. To isolate the issues related with NoC traffic and congestion from the coding itself, we employ a quite idealised network. It uses two Processing Elements which are working as idealised data producers, and an idealised data receiver. The internal buffers of the switch are assumed to be unlimited. Moreover, all Processing Elements are able to produce/consume a flit per clock cycle. For the analysis of the coding, we trace the signals in the switch connected to the receiver. The data from the two producers arrives to the switch from the same port, but through different virtual channels. The transmitted data correspond to the following signals: Raw image: The red component of a 800x130x8b image. It corresponds to the wel- come image from the PATMOS’08 web page. Male voice: A male voice signal. It consists of 5000 samples with σ = 50 and ρ = 0.88. Music: A short piece of classical music (Bach). OFDM: FFT input in a HiperLan/2 OFDM receiver, using 64QAM modulation and a type C channel. It consist of 50000 samples. It has σ = 42 and ρ = 0.22. gzip: The gzip executable in ELF 32-bit. The experiment has been performed with and without the XOR-Correlator and Decorre- lator in the Links to simulate PMD and PM respectively. Tab. 1 summarises the results. Since PMD and PM are templates which can be used with different codes, we have analysed different alternatives as shown in Tab. 1. Following the framework of [8], the difference based coding (dbm), value based coding (vbm), XOR-Correlation (corr), and XOR-Decorrelator (decor), are combined to produce different coding strategies. The K1 and K0 memoryless coders [4] are also used. Practical and Theoretical Considerations on Low-Power Probability-Codes 167

Table 1. Comparison of mean transition activity resulting from using PM and PMD coding tem- plates with real signals in a virtual channel based 4x4 NoC

Raw Image Male voice Music GZIP exe OFDM data Mean Code PMD PM PMD PM PMD PM PMD PM PMD PM PMD PM K1 3.84 3.10 2.48 3.20 2.90 3.23 2.92 3.68 2.73 3.05 2.97 3.25 K0 4.03 3.05 2.65 3.33 3.16 3.29 2.81 3.60 2.95 3.11 3.12 3.27 vbm 5.69 3.15 3.60 3.96 4.06 4.00 3.02 3.74 4.01 4.00 4.08 3.77 corr+ none 1.21 1.93 2.41 3.10 2.13 2.49 2.98 3.70 3.79 3.97 2.50 3.04 corr+ K1 1.20 1.92 1.94 2.60 1.93 2.24 3.14 3.80 2.93 3.20 2.23 2.75 corr+ K0 1.10 1.78 1.78 2.42 1.63 1.98 3.08 3.75 2.98 3.22 2.11 2.63 corr+ vbm 0.87 1.53 1.92 2.89 1.45 2.32 2.93 3.71 3.67 3.97 2.17 2.88 dbm+ none 1.02 1.67 2.17 2.90 1.66 2.13 3.24 3.83 3.86 3.98 2.39 2.90 dbm+ K1 1.03 1.70 1.63 2.28 1.43 1.81 3.15 3.79 2.73 3.15 2.00 2.55 dbm+ K0 1.15 1.84 1.81 2.41 1.72 1.87 2.96 3.68 2.96 3.22 2.12 2.60 dbm+ vbm 0.80 1.42 1.82 2.78 1.27 2.09 3.13 3.81 3.75 3.98 2.15 2.82

It can be observed that the code corr+K0 is the most practical one not only for PMD, but also for PM. The only code alternative which improves the quality of corr + K0is dbm+ K1. Since the dbm requires three adders to be implemented, and K1hasaworst timing path proportional to the bit-width of the signal, the dbm+K1 code is much more expensive than corr + K0. As shown in Fig. 1, corr + K0 has a complexity of 2B Flip- Flops and 2B − 1 XOR gates, while the worst timing path is only 2 XOR gates (around 210ps in a 180nm technology). We have compared PMD and PM with the Exact-Code in terms of activity reduc- tion. The values have been obtained using a MATLAB script. The computation of the Ep requires an estimation of the JPD, which has been calculated using a two dimen- sional histogram. Once the JPD is known, the matrix Ep and the average cost are easily calculated employing the approach described in Section 3. The results are reported in Fig. 4. For the real signals used in the experiment, the maximum reduction in the activity that could be obtained is 70%. PMD provides a reduction of 47%, and thus it is close to the theoretical maximum. The PM code achieves about one half of the maximum pos- sible reduction (34%). We observe that PM behaves quite well for multimedia signals. However, for random data, as in the case of the GZIP executable, the degradation is no- table. As it is depicted in Fig. 2, when the entropy of the signal increases the degradation of PM respect to PMD is more relevant. Finally, Tab. 2 compares the complexity of PMD and PM for the NoC Switch used in current experimental setup (i.e., NoC Switch with bit-width of 8 bits, and 4 Links for constructing a Mesh). The results refer to a 180nm technology. To have a better insight in the characteristics of PM and PMD, Tab. 2 reports the results corresponding to the Network Interface, the Links, and the overal NoC Swithc. The overhead of the encoder and decoder in the Network Interface is equal for PM and PMD, since both techniques use the same Probability Coder and Decoder. How- ever, the 4 XOR-Correlators and 4 XOR-Decorrelators used in the Links by the PMD 168 A. Garcia-Ortiz and L.S. Indrusiak

90 Exact 80 PMD PM 70

60

50

40

30

Activity Reduction [%] 20

10

0 Raw Male Music GZIP OFDM Mean Signal

Fig. 4. Comparison of activity reduction for Exact-Code, PMD, and PM code for real signals

Table 2. Comparison of the complexity of PMD and PM in terms of area and delay

Network Interface Data Link Overall Area[eq.gates] Delay [ps] Area [eq.gates] Delay [ps] Area [eq.gates] Delay [ps] PMD 141 210 576 210 717 210 PM 141 210 0 0 141 0 technique are not required for PM. Thus, the area is reduced approximately by a factor of 5 (from 717 to 141 equivalent gates). Moreover, PM does not incur in the 210ps tim- ing degradation in the Link. Finally, it should be notived, that the complexity and delay of Exact Coding is orders of magnitude larger than PM or PMD.

5 Conclusions

This work has presented a thorough study of some practical and theoretical aspects related with the incorporation of low-power coding techniques to NoCs systems with virtual channels. From a practical point of view, a major result of this work is a novel alternative for low-power coding called PM. The architecture is based on a Probability-Coder in the Network Interface. Although it can be customised for different coders, the work has focused in a “corr+K0” code, which requires a minimum number of gates while Practical and Theoretical Considerations on Low-Power Probability-Codes 169 providing a good switching reduction. The approach provides an average reduction in transitions at the data links of 34%, and 45% for multimedia signals. The technique maintains the same savings as PMD in terms of static power in the switch buffers. It achieves reductions of 22% in the average case and 32% for multimedia signals. Although PM is less effective than PMD (around 13%), the hardware complexity is reduced approximately by a factor of five. From the theoretical point of view, this paper provides the analysis in probabilistic terms of Exact-Coding. It proves that for Markov Gaussian random variables entropy is the key parameter to determine the achievable reductions on switching activity. The results establishes a link with the ideal case of entropic coding.

References

1. Benini, L., Macii, A., Macii, E., Poncino, M., Scarsi, R.: Architectures and synthesis algo- rithms for power-efficient bus interfaces. IEEE Trans. on CAD 19, 969Ð980 (2000) 2. de Micheli, G., Benini, L.: Networks on chip: A new paradigm for systems on chip design. In: DATE 2002, Washington, DC, USA, p. 418. IEEE Computer Society, Los Alamitos (2002) 3. Ganguly, A., Pande, P., Belzer, B.: Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NOC Interconnects. IEEE Trans. on VLSI 17(11), 1626Ð1639 (2009) 4. Garc«õa Ortiz, A., Indrusiak, L.S., Murgan, T., Glesner, M.: Low-Power Coding for Networks- on-Chip with Virtual Channels. Journal of Low Power Electronics (JOLPE) 1(4), 77Ð84 (2009) 5. Lee, K., Lee, S., Yoo, H.: Low-Power Network-on-Chip for High-Performance SoC Design. IEEE Trans. on VLSI 14(02), 148Ð160 (2006) 6. Mullins, R.: Minimising dynamic power consumption in on-chip networks. In: Procs of the Intl. Symp. on System-on-Chip, Tampere, Finland (2006) 7. Palma, J.-C., Indrusiak, L., Moraes, F., Garc«õa Ortiz, A., Glesner, M., Reis, R.: Adaptive Cod- ing in Networks-on-Chip: Transition Activity Reduction Versus Power Overhead of the Codec Circuitry. In: Vounckx, J., Az«emard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 603Ð613. Springer, Heidelberg (2006) 8. Ramprasad, S., Shanbhag, N., Hajj, I.: A coding framework for low-power address and data buses. IEEE Trans. on VLSI Systems 7, 212Ð221 (1999) 9. Sotiriadis, P.P., Tarokh, V., Chandrakasan, A.P.: Energy reduction in VLSI computation mod- ules: an information-theoretic approach. IEEE Transactions on Information Theory 49(4), 790Ð808 (2003) Logic Architecture and VDD Selection for Reducing the

Impact of Intra-die Random VT Variations on Timing

Bahman Kheradmand-Boroujeni1,2, Christian Piguet1, and Yusuf Leblebici2

1 Integrated and Wireless Systems, Centre Suisse d’Electronique et de Microtechnique (CSEM), Neuchâtel, Switzerland 2 Microelectronic Systems Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

Abstract. We show that in logic circuits working at supply voltage (VDD) be- low nominal value, proper selection of logic architecture and VDD together can reduce the impact of device-to-device random process variations (PV) on tim- ing. First we show that σ/μ of transistor current and delay strongly depend on VDD. Then we compare the PV sensitivity of Low-Power Slow (LP-S) and High-Power Fast (HP-F) architectures. The results propose the idea that for a given technology, equal power budget and delay, LP-S circuits working at higher VDD are about 1.8X less PV sensitive compare to HP-F circuits working at lower VDD.

Keywords: Low-Voltage, Low-Power, Process Variation, Random Variations, Statistical Variability, Flip-Flop, Digital VLSI.

1 Introduction

The primary motivation for low-voltage operation is to reduce energy per operation [1]. Nominal VDD is around 3×VT where VT is the threshold voltage. In this work we are talking about designing low-power logic systems which have VDD below nominal value. This includes subthreshold and moderate inversion regimes. PV could be categorized into inter-die and intra-die variations. Inter-die variations are modeled by slow and fast process corners (SS, FF…). Intra-die variations could be systematic (correlated) or random (uncorrelated). For short-channel narrow-width transistors, which are used in logic gates, intra-die random variations accounts for more than 50% of the total variability for sub-90nm nodes [2 and 3] and are expected to have a significantly greater influence at future technology generations [3].

1.1 Intra-die Random Variability

Intra-die device-to-device random variations could be due to Random Dopant Fluctu- ation (RDF) in the channel and source/drain regions near the channel edge, channel length variations (line edge roughness), oxide thickness variations, poly gate granular- ity [4], Boron clustering, and stress variations. These result to device VT, COX, W, L, and mobility variations. For low-voltage operation VT variation is more pronounced since the drain “on” current depends on (VDD-VT) more strongly. In subthreshold

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 170–179, 2011. © Springer-Verlag Berlin Heidelberg 2011 Logic Architecture and VDD Selection 171 region this dependency is exponential while in strong inversion this goes down to α power law. Table 1 shows the measured random variability in several technology nodes. By scaling VDD is decreasing while σVT is almost increasing which results to higher performance variation. Here all of the transistors have polysilicon gate and doped channel except the ultra-thin body FD-SOI (L=25 nm) device which is using a new high-k metal-gate technology and has undoped channel. While RDF in the channel is known to be the major contributor to device mismatch [5 and 6], σVT=25mV of this undoped device clearly shows the importance of the other variability sources as well.

Table 1. Intra-die random variability in small bulk NMOS transistors

Technology Data VDD W TOXE Mean Sigma (L Drawn) VT VT 340 nm Foundry 3.3 V 360 nm 7.2 nm 439 mV 18 mV 240 nm Foundry 2.5 V 360 nm 6.0 nm 397 mV 21 mV 180 nm Foundry 1.8 V 240 nm 3.90 nm 366 mV 18 mV 90 nm Foundry 1.2 V 160 nm 2.95 nm 409 mV 31 mV 80 nm Foundry 1.2 V 120 nm 2.25 nm 300 mV 27 mV 60 nm Measurement [5] 1.2 V 140 nm 2 nm -- 29 mV 45 nm Measurement [6] 1.1 V ------45 mV 25 nm Measurement [2] 1.0 V 60 nm 1.65 nm 480 mV 25 mV (UTB-SOI) 35 nm Simulation [3] 0.85 V -- 0.88 nm 226 mV 30 mV 13 nm Simulation [3] 0.85 V -- 0.44 nm 226 mV 82 mV

1.2 Conventional PV Compensation Techniques

Chip-to-chip variations, to some extent, can be compensated by using circuit tech- niques like Adaptive Body Biasing (ABB) and Adaptive Supply Voltage (ASV). In [7] we have proposed a novel technique for compensating inter-die and regional var- iations in FPGA fabrics which does not use body effect, is scalable, controls subthre- shold and gate leakage together, and can be applied to all kind of planar and emerging multi-gate devices. Unfortunately none of these techniques can be used for compen- sating Intra-die random variations. This is simply because it is not possible to measure the variations for each single transistor on the chip and generate and apply the appro- priate body, VDD, or source voltage to it. Increasing size of the transistors is the most well-known technique for reducing device-to-device random variations. However, in digital gates this results to power and area overheads. In this paper we are solely talk- ing about intra-die device-to-device random variations. 172 B. Kheradmand-Boroujeni, C. Piguet, and Y. Leblebici

2 Performance Degradation Due to Random PV versus VDD

It has been known that by decreasing VDD PV is more pronounced [8]. Fig. 1(a) shows the ratio of standard deviation (σ) over average value (μ) of the on current (i.e. Ids @ Vgs=Vds=VDD) versus VDD in 80nm CMOS technology node. Here we have done Monte Carlo simulations using device matching models provided by a well- known foundry. Model version is BSIM4.3. These simulations include all components of intra-die random variations. In new technologies PV in NMOS is more than PMOS. As we see in this figure by increasing VDD sensitivity to PV goes down. Fig. 1(a) agrees with equation presented in [8] for calculating sensitivity to VT variations: σ α I = ×σ (1) μ (VDD − V ) VT I T

Here they have assumed that σVT is quite small and α does not depend on (VDD-VT). Both assumptions are inaccurate.

(a) transistor on current (b) minimum size inverter delay

(c) 19 Nand2 ring oscillator (d) Inverter leakage at Tj=65°C

Fig. 1. Intra-die random process variation effects in 80nm node versus VDD

Fig. 1(b) shows σ/μ ratio of inverter delay. These curves are quite similar to Fig. 1(a). Since PV in NMOS is more than PMOS, variations in output fall time is higher than variations in rise time. However both decrease by increasing VDD. In Fig. 1(c) we can see similar trend for the period of a ring oscillator consisting 19 Nand2 gates. Dynamic power is always proportional to square VDD. Leakage current increases by VDD due to DIBL. This is shown in Fig. 1(d). To the best of our knowledge, nobody has studied how this fact can be used for se- lecting optimum VDD and logic architecture to minimize PV effects. Logic Architecture and VDD Selection 173

3 Proposed Idea

In VLSI design, engineers usually design chips using an available design-kit (tech- nology). Maximum power consumption and required performance (clock frequency) are given by the spec. So in most of the cases, logic architecture and supply voltage (VDD) are the only degrees of freedom for the designers. Several architectures are available for each logic function. For example, Ripple Carry Adder (RCA) and Carry Select adder (CSL) do the same job while CSL is much faster but RCA consume less power. Usually there is a tradeoff between power and delay. We may think that usually at design level we can select between Low- Power Slow (LP-S) architectures or High-Power Fast (HP-F) architectures. On the other hand, in low-voltage domain, speed can be significantly improved by increasing VDD. It means that RCA working at higher VDD value can work as fast as CSL adder working at lower VDD. Clearly increasing VDD for RCA increases its leakage and dynamic power as well. In summary, we may think that using LP-S architectures at higher VDD can result to almost equal speed and power comparing to using HP-F architectures at lower VDD. Now the question is: which one will be less sensitive to intra-die random PV? Fig. 1 suggests that the answer is LP-S architectures at higher VDD. Clearly, if for a particular function, structure A results to lower power and higher speed compare to structure B, always the choice will be A. On the other hand, for some simple gates like Inverter or Nand gate, different architectures do not exist. Fortunately, when we go from gate-level to top-level, e.g. NandÆflip-flopÆState MachineÆCPU design, number of design choices and options increase rapidly.

4 Simulation Results

To verify this idea we selected three different logic gates, 16-bit equality comparator, flip-flop, and 16-bit adder; and one synthesis level example, Finite State Machine (FSM) encoding. HP-F architectures are parallel comparator, Sense Amplifier flip- flop (SA), CSL, and one-hot encoding. LP-S circuits are Pre-Evaluation comparator, Conditional Charge flip-flop (CC), RCA, and binary encoding. For detail information about this circuits see Section 6. Monte-Carlo simulation results for the gates are shown in Fig. 2. Since gate delay σ/μ decreases by increasing critical path length, adder delay σ/μ is smaller than com- parator delay σ/μ and that of comparator is smaller than inverter delay σ/μ in Fig. 1(b). Similarly, LP-S gates have smaller σ/μ than HP-F gates at equal VDD because LP-S gates have longer critical path length. In Fig. 1 and 2, y-axis is logarithmic. Table 2 compares dynamic energy per operation (Dynamic Eng.), leakage power, maximum delay, and random PV sensitivity (σ/μ of delay) of HP-F and LP-S archi- tectures at 500mV and 600mV. Values are normalized to HP-F at 500mV. Dynamic power shown here for the flip-flop is the power of the flip-flop itself plus that of the clock tree. We have assumed that 10% of the logic area is occupied by the flip-flops and switching activity of Din input is 10%. For 16-bit comparator and adder we have applied random input pattern. 174 B. Kheradmand-Boroujeni, C. Piguet, and Y. Leblebici

(a) 16-bit Comparator (b) Flip-Flop

(c) 16-bit Adder

Fig. 2. Impact of Intra-die random process variations on various logic block delays in 80nm

As we see in Table 2 at 500mV LP-S architectures are about 2X slower than HP-F but σ/μ of delay is about 25% smaller due to longer critical path length. When we compare HP-F at 500mV and LP-S at 600mV we see that LP-S architectures are about 10% faster and less power hungry but 1.8X less sensitive to intra-die random process variations. By comparing LP-S at 500mV and LP-S at 600mV it is clear that 28% of this improvement is due to higher VDD. It is not clear which design results to smaller area occupation. While CC and pre- evaluation comparator are slightly bigger than SA and parallel comparator, respec- tively, RCA is much smaller than CSL. Static leakage current of LP-S circuits RCA and pre-evaluation comparator are less than HP-F circuits CSL and parallel compara- tor, respectively. But static leakage current of CC is higher than SA. Roughly we can say that using LP-S at 600mV result to equal power and area compare to using HP-F at 500mV but PV sensitivity reduces 1.8X. On the other hand, for comparing transistor-sizing and proposed method, we can say that if we wanted to apply transistor-sizing to reduce PV sensitivity of HP-F ar- chitectures by 1.8X, we had to increase transistor width×Length (W×L) by 3.24X 0.5 because σVT=Avt/(W×L) . Transistor sizing and gate sizing improve the performance and PV sensitivity of all logic circuits but increase the area, leakage, and dynamic power as well. This is uncorrelated to our technique. Sizing can be applied to both LP-S and HP-F architectures to reduce the σVT independent of the VDD.

Logic Architecture and VDD Selection 175

Table 2. Comparing PV Sensitivity of HP-F @500mV and LP-S @600mV

Equality Comparator Parallel @ 500mV Pre-Eval. @ 500mV Pre-Eval. @ 600mV

HP-F LP-S LP-S Dynamic Eng. 1 0.60 0.87 Leakage Pow. 1 0.40 0.54 Area 1 1.06 1.06 Delay 1 1.97 0.90 σ/μ (Delay) 9.1% 7.5% 5.1% Flip-Flop SA@500mV: HP-F CC@500mV: LP-S CC@600mV: LP-S Dynamic Eng. 1 0.61 0.88 Leakage Pow. 1 1.17 1.58 Area 1 1.25 1.25 Delay 1 1.89 0.82 σ/μ (Delay) 15.0% 10.8% 7.4% Adder CSL@500mV:HP-F RCA@500mV: LP-S RCA@600mV: LP-S Dynamic Eng. 1 0.67 0.94 Leakage Pow. 1 0.65 0.87 Area 1 0.68 0.68 Delay 1 2.09 0.96 σ/μ (Delay) 8.1% 5.8% 4.6% Finite State Machine Encoding One-hot & SA Binary & CC Binary & CC @500mV: HP-F @500mV: LP-S @600mV: LP-S

Dynamic Eng. 1 0.68 0.98 Leakage Pow. 1 0.75 1.0 Area 1 ~1 ~1 Delay 1 1.78 0.84 σ/μ (Delay) 7.2% 5.4% 3.94%

5 Discussions

It is not possible to prove or guarantee this idea for all the possible logic circuits be- cause there is no general algorithm which can generate LP-S and HP-F architectures for all logic functions and predict the power and delay of each one. However, since this idea is based on the intrinsic characteristic of the transistor as shown in Fig. 1(a), and transistors are the building block of all logic gates and blocks, this idea seems to be correct when the choice between LP-S and HP-F exist. 176 B. Kheradmand-Boroujeni, C. Piguet, and Y. Leblebici

The idea that we proposed here is based on Fig. 1 in which we have Vgs=VDD in on condition and Vgs=0 in off condition. This is true for all logic styles except single transistor switch Pass Transistor Logic (PTL) in which NMOS charges the internal nodes up to (VDD-VT) in source follower configuration. This kind of PTL has never been used because it does not provide full-swing, needs level restoration, cannot be modeled in HDL easily, and is very PV sensitive. But Transmission-Gate (TG) PTL which has one NMOS and one PMOS in each switch is compatible to Fig. 1 because NMOS discharge the internal nodes and PMOS charge them both with Vgs=VDD. TG-PTL is especially interesting for multiplexer design. Today all of the available standard cell libraries are complementary logic which has separate pull-down and pull-up networks (PDN and PUN) and both have Vgs=VDD in on and Vgs=0 in off conditions, so they are compatible with the proposed idea. Proposed method reduces sensitivity to inter-die variations as well, but less than that of random variability. The longer critical path of LP-S architecture does not result to less sensitivity to inter-die variations because in this case all of the transistors change in the same way and variations do not cancel each other. However, the im- provement due to higher VDD works in this case as well. We have to note that inter- die variations can be reduced by better control on the fabrication process in future. But there is no theoretical solution for the random variability as long as we are doping channel and S/D junctions and we use sub-wavelength lithography.

6 Details of the LP-S and HP-F Circuits

For RCA and CSL adders please see [9]. In digital circuits with flip-flop based regis- ters, the minimum clock period is:

τ =τ +τ +τ (2) clkmin cq logic su

Where τcq is flip-flop clock to output delay and τsu is flip-flop setup-time. Since τcq and τsu contribute to the total delay in the same way, the delay reported in Fig. 2(b) and Table 2 for the flip-flops is τcq+τsu of two successive flip-flops. SA flip-flop is shown in Fig. 3. Setup time of this flip-flop is very small (one inver- ter delay), but in every clock cycle XL and XR are charged and one on them will be discharged. So the power consumption is quite high. N0 turns off M0 at the start of evaluation phase to stop the race current between left and right branches. In some old publication N0 does not exist and M0 has a long channel length and its gate terminal is tied to VDD. This certainly reduces the power consumption but the flip-flop func- tionality will depend on sizing and transistor’s on-resistance. This means that (without N0) SA fails to work at low-voltage in the presence of intra-die PV. So we used the circuit shown in Fig. 3. Conditional Charge Flip-Flop (CC) is also shown in Fig. 3. We have designed this flip-flop by adding the conditional charge transistors (M3,4L/R) to the Race-Free NAND-based DFF which we had proposed in [10]. During the pre-charge phase (clk=0), internal nodes will be charged only if input data has changed. Since in a typical digital system the switching activity of internal signals is much less than Logic Architecture and VDD Selection 177 clock, this simple idea can save a lot of power. However, setup-time of this flip-flop if quite long (inverter delay + charge time of XL/R nodes + NAND delay). One may think that we could apply these conditional charge transistors to SA flip- flop. If we do this and a short glitch pulse happens on the Din input during pre-charge phase, then the XL/R node can be charged to an intermediate voltage level (e.g. VDD/2) and static short circuit current flows through N0 for up to half clock cycle. But no short circuit current can happen during pre-charge cycle in N1L/R in the pro- posed CC flip-flop due to Q/QB feedback. For example, during pre-charge cycle, if XL=0 and XR=VDD, it means that in the previous cycle Din had been zero. So Q=0, QB=VDD, and M4R keeps XR high. Since one zero is enough to turn off NMOS stack in the NAND gate, even if a glitch charges XL to an intermediate voltage level, no short circuit current happens in N1L. Since Q=clk=0, XR=QB=VDD and no short circuit current can happen in N1R as well.

Fig. 3. Sense Amplifier flip-flop (SA) (HP-F) on the left and proposed Conditional Charge flip- flop (CC) (LP-S) on the right side

When clock goes high, depending on the Din value, XL or XR goes low and the positive feedback loop (through N2L/R, M5L/R, and M6L/R) stores the Din value and stops XL/R nodes from further change if Din changed again during clk=VDD period. The last example shown in Table 2 is about Finite State Machine (FSM) synthesis. Various FSM encoding techniques (e.g. one-hot, Gray, Johnson, Binary…) result to different performance and power consumptions. One-hot is very power consuming because each state is represented by one flip-flop. Since there is no extra hardware for decoding present state or encoding next state signals in one-hot combinational logic 178 B. Kheradmand-Boroujeni, C. Piguet, and Y. Leblebici part, one-hot seems to be the fastest FSM style. On the other hand, highly-encoded techniques, e.g. binary (sequential), have minimum number of flip-flops, so they are low-power. But they need wide functions in the combinational part, so they are slow. Here we assumed that flip-flops are the dominant source of power in FSM. In Table 2 we compared two generic FSMs with 14 states. First one which is HP-F has one-hot encoding, use SA flip-flops, and we assumed a series of 20 Nand2 for the longest signal path in next state logic. Since FSM power strongly depends on the application, we assumed that 50% of the total leakage and dynamic power is related to flip-flops and 50% is in next state and outputs combinational logic. For LP-S we used binary encoding, CC flip-flops, and series of 20 Nand2 for longest signal path in next state logic the same as HP-F. Since here 14 states are coded into 4 flip-flops, we add- ed 4:16 decoder at the output of state flip-flops to model extra hardware required for identifying the present state and we added 16:4 encoder at the flip-flop inputs to mod- el the extra hardware required for encoding next state. Since we simply added decod- er/encoder to the output/input of the state flip-flops, the combinational part of both FSMs will be the same. In reality, this decoder and encoder could be merged into the combinational logic to optimize LP-S FSM. Therefore, Table 2 is showing the worth case for LP-S. Since in one-hot FSM each flip-flop represents a single state, that flip-flop could be placed near the combinational logic related to that state. This result to short intercon- nects. In binary FSM each flip-flop is linked to many states and logics. So we have longer interconnects and more capacitive loads which need buffer gates to driver them. We put one buffer for each present and next state signals in binary FSM. Each buffer has a delay equal to two Nand2 delay. Pre-Evaluation equality comparator is shown in Fig. 4. It saves power based on the simple idea that when we are comparing two 16-bit numbers, if A15:12 and B15:12 are not equal we do not need to compare A11:0 and B11:0. In this situation M0 turns off X11:0 and N2:0 but N4 works properly because Eq3=0. If A15:12 and B15:12 are equal then all sixteen bits will be compared. Parallel comparator (HP-F) has exactly

Fig. 4. Pre-evaluation comparator on the left. Submitted tape-out on the right side. Logic Architecture and VDD Selection 179 the same architecture; just there is no M0 and VSS terminal of X11:0 and N2:0 is directly connected to ground. So, all XOR gates work concurrently. Inputs A/B15:0 interconnect parasitic capacitance is an important contributor to the total dynamic power which is not under M0 control. This has been included in Table 2 values.

Conclusion. Random variations are increasing by scaling. Clever selection of VDD and logic architecture together could reduce intra-die PV sensitivity about 1.8X. Our results recommend designers that for reducing intra-die statistical VT variation effect on timing, first they should look for very low-power architectures and then raise VDD to get desired performance.

Acknowledgement. This research has been supported in part by the CCMX program of the Swiss Confederation; under the project title “MMNS: Materials, devices, and design technologies for nanoelectronic systems beyond ultimately scaled CMOS”.

References

1. Vittoz, E.: Weak Inversion for Ultimate Low-Power Logic. In: Piguet, C. (ed.) Low-Power Electronics Design, ch. 16. CRC Press, Boca Raton (2004) 2. Weber, O., Faynot, O., Andrieu, F., Buj-Dufournet, C., Allain, F., Scheiblin, P., Foucher, J., Daval, N., Lafond, D., Tosti, L., Brevard, L., Rozeau, O., Fenouillet-Beranger, C., Ma- rin, M., Boeuf, F., Delprat, D., Bourdelle, K., Nguyen, B.-Y., Deleonibus, S.: High im- munity to threshold voltage variability in undoped ultra-thin FDSOI MOSFETs and its physical understanding. In: IEEE International Electron Devices Meeting (IEDM), pp. 1–4 (2008) 3. Reid, D., Millar, C., Roy, G., Roy, S., Asenov, A.: Analysis of Threshold Voltage Distri- bution Due to Random Dopants: A 100 000-Sample 3-D Simulation Study. IEEE Transac- tions on Electron Devices 56(10), 2255–2263 (2009) 4. Cathignol, A., Cheng, B., Chanemougame, D., Brown, A.R., Rochereau, K., Asenov, A.: Quantitative Evaluation of Statistical Variability Sources in a 45-nm Technological Node LP N-MOSFET. IEEE Electron Device Letters 29(6), 609–611 (2008) 5. Tsunomura, T., Nishida, A., Yano, F., Putra, A.T., Takeuchi, K., Inaba, S., Kamohara, S., Terada, K., Hiramoto, T., Mogami, T.: Analyses of 5σ Vth fluctuation in 65nm-MOSFETs using takeuchi plot. In: Symposium on VLSI Technology, pp. 156–157. IEEE Press, Los Alamitos (2008) 6. Kuhn, K.J.: Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS. In: IEEE International Electron Devices Meeting (IEDM), pp. 471–474 (2007) 7. Kheradmand-Boroujeni, B., Piguet, C., Leblebici, Y.: AVGS-Mux style: A novel technol- ogy and device independent technique for reducing power and compensating process vari- ations in FPGA fabrics. In: Design, Automation & Test in Europe Conference & Exhibi- tion (DATE), pp. 339–344 (2010) 8. Abu-rahma, M.H., Anis, M.: Variability in VLSI Circuits: Sources and Design Considera- tions. In: Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), pp. 3215–3218 (2007) 9. Yeo, K.S., Roy, K.: Low-Voltage Low-Power Adders. In: Low-Voltage, Low-Power VLSI Subsystems, ch. 3, pp. 72–83. McGraw-Hill, New York (2005) 10. Piguet, C., Masgonty, J.M., Arm, C.: D-Type Master-Slave Flip-Flop. In: US Patent No. 6323710 B1, filed (November 1999) Impact of Process Variations on Pulsed Flip-Flops: Yield Improving Circuit-Level Techniques and Comparative Analysis

Marco Lanuzza, Raffaele De Rose, Fabio Frustaci, Stefania Perri, and Pasquale Corsonello

Departement of Electronics, Computer Science and Systems University of Calabria- Arcavacata Di Rende-87036- Rende (CS) {lanuzza,derose,ffrustaci,perri}@deis.unical.it, [email protected]

Abstract. Process variations cause unpredictability in speed and power charac- teristics of nanometer CMOS circuits impacting the timing and energy yields. In this paper, transistor reordering and dual-Vth techniques are evaluated re- garding their efficiency in mitigating the impact of process variations on a set of pulsed flip-flops. It is shown that the conjunct use of the above mentioned tech- niques can improve delay, energy and EDP yields more than 1.98X, 1.62X and 1.99X times, respectively. The yield optimized flip-flop circuits are also com- paratively analyzed to identify the best topologies.

1 Introduction

The rapid scaling of silicon technology has enabled designers to integrate millions and even billions of transistors into a single chip. This ability to achieve very high integration density has contributed to the success of (IC) design during the past few decades. Unfortunately, technology scaling has led to a significant increase in process variability due to random doping effects, imperfections in litho- graphic patterning of small devices, and related effects [1]. Process variations can cause significant uncertainty in speed and power characteristics of ICs. Due to the inverse relationship between power and delay, fastest chips in a lot may present unac- ceptable power dissipation whereas low-power chips can be too slow [2]. This signifi- cantly reduces the parametric yield in advanced process technologies (like the 65-nm and the 45-nm technological nodes) [3]. Moreover, the yield loss will become very critical in future technologies where physical device parameters will approach the atomic scale and will be hence subject to atomic uncertainties [1]. In this paper, we cope with the influence of random process variations on timing and energy yield of pulsed flip-flops (FFs). These were chosen as a case study since they are very critical elements in the design of high-speed microprocessors, due to their high impact on the delay and energy characteristics of the whole system [4], [5]. FFs targeted for high speed applications in energy-constrained environments are conventionally sized to optimize the energy-delay-product (EDP) [6]. However, due to random process variations, a large number of circuits might not meet the targeted

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 180–189, 2011. © Springer-Verlag Berlin Heidelberg 2011 Impact of Process Variations on Pulsed Flip-Flops 181

EDP constraint. This is intuitively shown in Fig.1. Under process variations, the EDP distribution of a given circuit can be modeled by a normal distribution with the mean value (μ) and the standard deviation (σ) [1]. Considering FFs conventionally opti- mized for minimum EDP, only the 50% of the total number of circuits would meet the target constraint. In order to achieve a higher yield, statistical sizing approaches, based on the use of statistical information to estimate the sensitivity to process varia- tions, can be used. In [7] a gate sizing algorithm is proposed to improve timing yield of clocked storage elements. The desired timing yield is achieved by iteratively in- creasing transistor sizing on the basis of statistical simulation results. As it is shown in [7], this approach can lead to not negligible power and area overheads.

Fig. 1. The EDP probability density function (pdf) due to process variations

In this work, simple circuit-level techniques to mitigate the impact of process variations on pulsed FFs are evaluated, namely the transistor reordering [8] and the usage of dual threshold voltage transistors (dual-Vth) [9]. Both these approaches can be applied at design-time of the circuits without requiring any extra device and archi- tectural modifications, thus they can easily be used also in conjunction with other techniques (such as that proposed in [7]). As it will be demonstrated in the following, timing and energy yield of FFs can be concurrently improved by the conjunct exploi- tation of the transistor reordering and the dual-Vth techniques, without any extra area requirement. Experiments have been performed on four state-of-the-art pulsed FF topologies, designed using STMicroelectronics 45-nm 1V CMOS technology. Furthermore, comparative analysis of the FF structures has been done. Differently from the study presented in [6], where the process variability impact was analyzed considering FF circuits conventionally optimized for minimum EDP, we performed a comparative analysis considering yield improved circuit structures. This paper is organized as follows. In Section 2, the analyzed Pulsed FF topologies are briefly reviewed and the adopted simulation setup is discussed. Section 3 deals with implemented circuital techniques to improve robustness against process variability. 182 M. Lanuzza et al.

A comparative analysis of the obtained results is provided in Section 4. Finally, the conclusions are drawn in Section 5.

2 Pulsed Flip-Flop Topologies and Simulation Methodology

In this work, four representative pulsed FF topologies widely used in high- performance processors were selected as case study. Fig.2.a shows the Hybrid-Latch Flip-Flop (HLFF), used in AMD K6 and K7 processors. This hybrid circuit is particu- larly fast. However, due to its pre-charged structure, this design is associated with considerable power consumption [4]. An improved design is the Conditional Pre- charge Flip-Flop (CPFF), depicted in Fig.2.b. This circuit overcomes the problem of the glitches at the output, thus reducing dynamic power consumption. This is accom- plished by appropriate insertion of keeper elements and introducing a conditional precharge technique to prevent unnecessary transitions [10]. Another interesting hy- brid design is the Semi-Dynamic Flip-Flop (SDFF) which is shown in Fig.2.c. This circuit achieves very high speed at the expense of considerable energy consumption, mainly due to the switching activity of the clock pulse generator and to the highly loaded dynamic internal node. A more advanced semi-dynamic Flip-Flop implemen- tation is represented by the UltraSPARC Semi-Dynamic Flip-Flop (USDFF), shown in Fig.2.d. The improvement with respect to the SDFF topology mainly consists in using a conditional keeper on the dynamic internal node. It was demonstrated that this allows to significantly reduce the energy consumption [11].

* CLK * 0.24 0.24 CLK Q 0.24 * 0.24 * 0.24 0.6 0.24 0.6 0.6 Q 0.6 Qb X Qb X 0.3 0.3 D 0.24 0.9 0.9 0.9 0.9 0.12 CLK D 0.9 0.9 Q 0.9 0.24 0.9 * * 0.24 0.24 0.24 0.24 0.24 0.24 ICLK ICLK 0.9 0.24 CLK 0.9 0.9 0.9 0.12 0.12 0.12 0.12 0.12 0.12 (a) (b) * * * * * * D 0.24 * * * 0.24 0.24 1.08 * 0.24 0.6 0.24 * 1.08 * X Qb X 0.6 Qb 0.24 ICLK 0.24 0.3 ICLK 0.24 0.3 1.62 0.12 0.24 CLK 1.62 1.08 0.24 CLK D 1.08 D 1.62 1.62 0.24 0.24 * CLK 1.62 1.08 1.62 0.24 0.24 1.08 CLK 0.12 0.12 0.12 0.12 (c) (d)

Fig. 2. Analyzed Flip-Flops: (a) HLFF [4] (b) CPFF [10] (c) SDFF [11] (d) USDFF[11]

In a first phase, all the FF circuits were deterministically sized for optimal EDP. Since the number of transistors of a single topology varies from 22 to 26 transistors, proper circuit simplifications were introduced to manage the transistor sizing optimization. Impact of Process Variations on Pulsed Flip-Flops 183

Transistors that do not affect the FF performance (shown as * in Fig.2) were minimum sized to limit the energy consumption. The remaining devices were iteratively sized imposing equal width for series-connected transistors [12]. The iterations were performed until the optimum EDP was obtained. Fig.3 shows the simulation setup used in this work. Input buffers are placed be- tween ideal voltage sources and data and clock inputs to provide realistic input sig- nals. The data input buffer is minimum sized, whereas the clock input buffer is sym- metrically sized to keep a constant clock slope equal to FO2 [13], as it is adopted in real designs. The output of a given FF is loaded with a 12 fF capacitance. This value has been chosen analyzing the capacitive loads optimally driven by FFs with different strengths, belonging to the commercial STM 45-nm standard cells library. We sup- posed that the generic FF circuit should act as a standard cell with X9 drive strength. For this reason we analyzed the behavior of the FFs with adjacent strengths X4, and X18. Then, we found the load capacitive range for which the X9 flip-flop is opti- mized. From Fig.4, it can be seen that 12fF represents the middle range capacitive load for which the X9 strength is preferable to the adjacent ones. This choice allows realistic running conditions to be examined.

VDD * * VDD VDD 0.24 0.24 Data D Qb 0.12 0.12 FF under 12 fF No process test variations CLK

* * VDD VDD 2Wclk Clock 2Wclk constant slope W (FO2) clk Wclk

Fig. 3. The simulation setup Fig. 4. Load capacitance analysis (100°C ,1V)

The impact of process variations (including the mismatch between transistors) was evaluated through Monte Carlo (MC) simulations performed on 1000 samples. In MC simulations, the nominal 1V power supply voltage, a temperature of 100°C, a clock frequency of 1 GHz and pseudorandom input data with a 25% activity rate [4] were considered. In our tests, both data and clock buffers are not influenced by random process variations. The flip-flop delay considered in this study is the data-to-output delay (TDQb) [14] which includes both the worst clock-to-output delay (TCQb) and the setup time (Tsetup). The latter is usually defined as the data-to-clock offset that corresponds to a 10% increase in the clock-to-output delay [14]. Since the setup time can be deeply influenced by process variations, particular attention was given to the determination of the data-to-clock offset to be used in MC analysis. To this purpose, the mean value and the standard deviation of the setup time have been evaluated through appropriate 184 M. Lanuzza et al. parametric MC simulations. Therefore the data to clock offset has been set to the 3- sigma setup value (i.e. (μ+3σ)s) in the subsequent MC simulations used for evaluating the FF delay. In this way a setup-time margin is introduced, which assures that more than 99.7% of the performed MC runs satisfy the constraint of having less than 10% increasing in the clock-to-output delay.

3 Circuit-Level Techniques to Improve Yield

In this section, two different circuit-level techniques that can be useful to target the desired yield in terms of delay and/or energy without sacrificing area are evaluated. Transistor reordering is a well known technique that can be used to optimize circuit delay and power dissipation. Appropriate transistor ordering can minimize the switch- ing activity at internal nodes, thus reducing the dynamic power consumption [1]. Moreover, transistor reordering can reduce critical path delay. Placing the critical- path transistor (i.e. the transistor which is driven by last signal of all inputs which assumes a stable value) closer to the output of the gate can result in reduced gate delay. It was demonstrated that this approach improves also the delay yield of basic logic gates [8]. As part of this work, transistor reordering has been applied on pull-down network (PDN) of both the stages of the analyzed FF structures. Table 1 shows the six possible PDN transistor configurations. Each transistor combination is organized in ascending order from the ground to the output node.

Table 1. PDN transistor ordering (in brackets the transistors belonging to the PDN of the sec- ond stage)

PDN transistor SDFF-USDFF HLFF-CPFF ordering

Configuration1(C1) MCLK-MD-MICLK (MCLK-MX) MCLK-MD-MICLK (MCLK-MX- MICLK)

Configuration2(C2) MCLK-MICLK-MD (MCLK-MX) MCLK-MICLK-MD (MCLK-MICLK-Mx)

Configuration3(C3) MD-MCLK-MICLK (MX-MCLK) MD-MCLK-MICLK (Mx-MCLK-MICLK)

Configuration4(C4) MD-MICLK-MCLK (MX-MCLK) MD-MICLK-MCLK (Mx-MICLK-MCLK)

Configuration5(C5) MICLK-MCLK-MD (MCLK-MX) MICLK-MCLK-MD (MICLK-MCLK-Mx)

Configuration6(C6) MICLK-MD-MCLK (MX-MCLK) MICLK-MD-MCLK (MICLK-Mx-MCLK)

Table 2 presents obtained results in terms of setup-time margin, delay, energy and EDP mean and standard deviation. As expected, the transistor reordering significantly influences the 3-sigma setup value. Comparing the worst and the best delay values of the analyzed configurations, it can be seen that the mean delay of the USDFF im- proves up to 20%. Whereas, that of the CPFF improves up to 28%. At the same time, an average variation of about 30% in terms of mean energy can be observed, except for the SDFF which shows a mean energy variation of about 18%. Impact of Process Variations on Pulsed Flip-Flops 185

Considering the mean TDQb values, it can be observed that most favorable con- figurations appear to be those in which the data related signals (i.e. D for the first stage and X for the second stage) drive transistors closer to the output node. Those configurations also allow the minimum standard deviation of the delay to be achieved. From results in Tab.2, it can also be concluded that the best input vector in terms of energy mean and standard deviation appears to be that in which the input signals with the highest probability of being at the logic state of one (in this case CLK and ICLK) are positioned far from the output node. This is due to the minimization of the switch- ing activity of internal nodes [15]. On the contrary, for the SDFF and the USDFF circuits that result more susceptible to leakage power (due to the reduced stack effect in the PDN of the second stage), the design rule given in [15] is not fully respected.

Table 2. Transistor reordering results

SDFF USDFF (μ+3ı) ı (μ+3ı) ı s μ ı μ ı μ EDP s μ ı μ ı μ EDP [ps] D D E E EDP [e- [ps] D D E E EDP [e- [ps] [ps] [fJ] [fJ] [e-27] [ps] [ps] [fJ] [fJ] [e-27] 27] 27] C1 1.41 44.89 2.61 22.56 1.97 1012.7 85.1 3.71 46.64 2.19 17.42 0.865 812.5 35.94 C2 -3.23 38.99 2.32 23.19 1.99 904.2 72.75 -0.5 40.77 2.14 17.72 0.927 722.4 34.76 C3 4.73 49.01 2.88 24.42 2 1196.8 98.4 7.01 50.36 2.59 24.79 1.26 1248.4 75.45 C4 10.84 49.84 2.8 21.45 1.78 1069.1 85.25 12.91 51.08 2.56 22.9 1.08 1169.7 57.95 C5 -1.52 39.43 2.35 21.01 1.95 828.4 72.1 0.99 41.11 2.15 22.95 1.62 943.5 63.65 C6 7.75 47.1 2.65 20.08 1.84 945.8 82.15 9.7 48.34 2.43 22.53 1.5 1089.1 73.85 HLFF CPFF

(μ+3ı) ı (μ+3ı) ı s μ ı μ ı μ EDP s μ ı μ ı μ EDP [ps] D D E E EDP [e- [ps] D D E E EDP [e- [ps] [ps] [fJ] [fJ] [e-27] [ps] [ps] [fJ] [fJ] [e-27] 27] 27] C1 12.47 45.02 2.34 34.46 2.17 1551.4 91.39 12.56 46.52 2.78 24.57 2.07 1143 64.8 C2 3.57 37.42 2.09 25.36 1.65 949 28.56 3.54 37.11 2.2 17.1 1.14 634.6 23.43 C3 3.27 46.47 2.55 29.98 1.55 1393.2 68.26 2.76 47.34 2.91 24.46 1.45 1157.9 47.74 C4 5.26 50.1 2.73 34.5 1.78 1728.5 59.56 6.27 51.47 3.13 23.43 1.38 1205.9 40.37 C5 8.22 36.99 1.99 31.41 1.73 1161.9 34.04 9.26 36.95 2.06 20.64 1.58 762.6 40.36 C6 14.55 44.65 2.44 34.57 1.82 1543.6 50.12 15.04 45.95 2.74 23.36 1.36 1073.4 38.63

Another interesting circuit-level strategy is the dual-Vth (DVT) technique which consists of the use of transistors with two different threshold voltages: the lower-Vth devices are used in the critical paths to optimize the performance, while the higher- Vth devices are used in non critical paths to reduce leakage power [1]. This approach was applied to the analyzed circuits in conjunction with transistor reordering, exploit- ing the 45-nm STM General Purpose transistors library. The latter includes devices with standard (SVT) and high threshold (HVT) voltages. SVT transistors were used to implement delay-critical PDNs, whereas HVT transistors were exploited when the devices delay is not a concern. The obtained results in terms of delay, energy and EDP mean and standard devia- tion are shown in Table 3. The setup-time margins are not significantly influenced by this action, thus their values are not reported in Table 3. By a careful comparison between the results given in Table 2 and in Table 3, it can be observed that the DVT technique has a minor impact on delay mean and standard deviation, while it can lead to a significant decrease of the energy standard deviation, depending on the input vector. More precisely, comparing the best and the worst PDN 186 M. Lanuzza et al. configurations in terms of energy consumption, the energy standard deviation is im- proved from 10.9% (for the SDFF) to 15.8% (for the CPFF). As highlighted in Table 3, for each Flip-Flop topology the best transistor arrangements in terms of perform- ance or energy consumption are the same as those shown in Table 2.

Table 3. Transistor reordering + dual Vth results

SDFF USDFF ı μ ı μ ı μ ı μ ı μ ı μ EDP D D E E EDP EDP D D E E EDP [e- [ps] [ps] [fJ] [fJ] [e-27] [e-27] [ps] [ps] [fJ] [fJ] [e-27] 27] C1+DVT 45.12 2.74 22.18 1.78 1000.8 67 46.68 2.14 16.8 0.742 784.2 33.37 C2+DVT 39.1 2.49 22.67 1.72 886.4 65.75 40.43 2.11 17.38 0.823 702.7 32.97 C3+DVT 49.08 3 23.95 1.68 1175.5 80.15 50.47 2.5 23.91 0.94 1206.7 49.66 C4+DVT 49.94 2.94 20.61 1.7 1029.3 79.3 51.41 2.44 21.08 0.705 1083.7 36.2 C5+DVT 39.5 2.44 20.68 1.71 816.9 66.3 41 2.06 21.34 1.2 874.9 44.18 C6+DVT 47.21 2.79 19.72 1.64 931 77.25 48.13 2.3 20.3 1.06 977 48.13 HLFF CPFF ı μ ı μ ı μ ı μ ı μ ı μ EDP D D E E EDP EDP D D E E EDP [e- [ps] [ps] [fJ] [fJ] [e-27] [e-27] [ps] [ps] [fJ] [fJ] [e-27] 27] C1+DVT 45.11 2.31 33.87 1.87 1527.9 78.57 46.24 2.75 24.03 1.73 1111.1 54.4 C2+DVT 37.32 2.03 24.78 1.46 924.8 27.53 36.76 2.09 16.52 0.96 607.3 18.47 C3+DVT 46.41 2.45 28.75 1.41 1334.3 65.43 47.12 2.87 23.95 1.28 1128.5 43.77 C4+DVT 50.12 2.7 33.8 1.69 1694.1 57.87 51.14 3.03 22.41 1.14 1146 34.72 C5+DVT 36.94 1.97 30.92 1.64 1142.2 32.64 36.59 2.03 20.06 1.5 734 37.36 C6+DVT 44.71 2.4 33.82 1.73 1512.1 49.03 45.53 2.65 22.18 1.09 1009.9 31.93

Fig. 5. Yield improvement comparing C1 and C5 (dashed line) SDFF transistors arrangements (the yield data is referred to the μ value of the C1 configuration) Impact of Process Variations on Pulsed Flip-Flops 187

Figure 5 shows the effects of the analyzed techniques on the SDFF topology. Results demonstrate that conjunct use of transistor reordering and DVT technique improves considerably timing and energy yields, concurrently. More precisely, com- paring C5 with C1 transistor stack arrangement, an improving of 1.98X, 1.62X and 1.99X times is obtained in terms of delay, energy and EDP yields, respectively.

4 Comparative Analysis and Discussion

For each FF topology, has been selected the solution which leads to the best trade-off between EDP and robustness to the process variations. To this purpose, the simple cost function defined in [16] was used:

CF(C)= [μ EDP(C) * σ EDP(C)] (1) The CF results a relevant metric which takes into account both mean EDP and its variance caused by process variation effects. Obviously, the optimal transistor con- figuration (Copt) corresponds to that which minimizes the CF function (i.e. Copt=C: min{μEDP(C)*σEDP(C)} ). As shown in Tab.3, the optimal transistor arrangement is represented by the configuration C2, except for the SDFF which has as the best solu- tion configuration C5. Comparative MC results are given in Tab.4. The ratio between the maximum spread 3σ and the mean value µ was considered in Table 4 as a measure of variability due to the process variations impact on a particular parameter. All the FF topologies show similar results in terms of delay variability. The USDFF circuit presents the lowest delay variability (about 15.66%) whereas the SDFF has the highest uncertainty in terms of delay (i.e. 18.53%). Although the CPFF has a 17.07% of delay variability it shows the best mean delay (see Tab.3). A more differentiated susceptibility to the process variations can be observed in terms of energy dissipation. The SDFF has the highest energy variability (more than 24.81%), while the USDFF is the circuit with the lowest variability (about 14.21%).

Table 4. Comparative results

(3σ/µ)D (µ+3σ)D (3σ/µ)E (µ+3σ)E (3σ/µ)EDP (µ+3σ)EDP [%] [ps] [%] [fJ] [%] [e-27] SDFF 18.53 46.82 24.81 25.81 24.35 1015.8

USDFF 15.66 46.76 14.21 19.85 14.08 801.61

HLFF 16.32 43.41 17.67 29.16 8.93 1007.39

CPFF 17.07 43.03 17.43 19.4 9.12 662.71

The 3-sigma value defined as µ+3σ and provided in Table 4 gives practical infor- mation to evaluate the achievable yield. As illustrated in Fig.6, the 99.87% of fabri- cated circuits based on CPFF topology would have a worst case delay lower than 43.03 ps and an energy dissipation lower that 19.4 fJ. The 99.75%, 92.65% and 89.07% of fabricated HLFF, SDFF and USDFF circuits would reach a speed perform- ance similar to that obtained for the CPFF structure. At the 3-sigma energy value of 188 M. Lanuzza et al. the CPFF, the USDFF and the SDFF achieve an energy yield of 99.29% and 23.52% respectively, whereas the HLFF presents an energy yield almost equal to zero. As expected, the CPFF shows also the lowest 3-sigma value in terms of EDP, thus result- ing the best solution between the four analyzed circuits. At the CPFF 3-sigma value the USDFF and the SDFF show an EDP yield of 11.26% and 1% respectively, whereas the HLFF presents an EDP yield equal to zero.

Fig. 6. Yield comparison

5 Conclusions

In this paper, the impact of process variations on delay and energy performances over a set of high-speed FFs has been analyzed. Moreover, in order to reduce the unpre- dictability in speed and energy dissipation, transistor reordering and dual –Vth tech- niques have been applied and their effects have been studied. It was found that these techniques can significantly impact on the setup time, data-to-output delay and energy dissipation both mean values and standard deviations. The optimum transistor reor- dered solution is dependent on the particular FF topology, the number of stacked transistors, and the relative position of switching devices in the transistor network arrangements. The best mean and standard deviation delay results were found for PDN configurations in which the data signals drive devices closer to the output node. Moreover, for each kind of FF better mean and standard deviation energy results are obtained using high-threshold transistors within the non-critical paths. Analyzed FF topologies were also compared to identify the best choice from the yield point of view. Comparative analysis clearly shows that the CPFF circuit assures highest de- lay, energy and EDP yields. Impact of Process Variations on Pulsed Flip-Flops 189

References

1. Wong, B.P., et al.: Nano-CMOS Design For Manufacturability. John Wiley & Sons, Chichester (2009) 2. Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A., De, V.: Parameter varia- tions and impact on circuits and microarchitecture. In: Proc. of the 40th Conference on De- sign automation, Anaheim, CA, USA, June 2-6 (2003) 3. Sylvester, D., Agarwal, K., Shah, S.: Variability in nanometer CMOS: Impact, analysis, and minimization, Integration. The VLSI Journal 41(3), 319–339 (2008) 4. Stojanovic, V., Oklobdzija, V.: Comparative Analysis of Master-Slave Latches and Flip- Flops for High-Performance and Low-Power Systems. IEEE J. Solid-State Circuits 34(4), 536–548 (1999) 5. Rebaud, B., Belleville, M., Bernard, C., Robert, M., Maurine, P., Azemard, N.: A com- parative study of variability impact on static flip-flop timing characteristics. In: Proc. IEEE International Conference on Integrated Circuit Design and Technology ( ICICDT), Austin, TX, June 2-4, pp. 167–170 (2008) 6. Hansson, M., Alvandpour, A.: Comparative Analysis of Process Variation Impact on Flip- Flop Power-Performance. In: Proceedings of the 2007 IEEE Symposiums on Circuits and Systems (ISCAS 2007), pp. 3744–3747 (2007) 7. Mostafa, H., Anis, M., Elmasry, M.: Comparative Analysis of Timing Yield Improvement under Process Variations of Flip-Flops Circuits. In: 2009 IEEE Computer Society Annual Symposium on VLSI (2009) 8. da Silva, D.N., et al.: CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements. Microelectronics Reliability 49, 977–981 (2009) 9. Ashouei, M., Chatterjec, A., Singh, A.D., De, V.: A dual-Vt layout approach for statistical leakage variability minimization in nanometer CMOS. In: Proceedings of 2005 IEEE In- ternational Conference on Computer Design (ICCD), pp. 567–573 (October 2005) 10. Nedovic, N., Oklobdzija, V.G.: Hybrid Latch Flip-Flop with Improved Power Efficiency. In: Proceedings of the 13th Symposium on Integrated Circuits and Systems Design, pp. 211–215 (2000) 11. Giacomotto, C., Nedovic, N., Oklobdzija, V.G.: The Effect of the System Specification on the Optimal Selection of Clocked Storage Elements. IEEE J. Solid-State Circuits 42(6), 1392–1403 (2007) 12. Alioto, M., Consoli, E., Palumbo, G.: General Strategies to Design Nanometer Flip-Flops in the Energy-Delay Space. IEEE Transaction on Circuits and Systems (2009) 13. Alioto, M., Consoli, E., Palumbo, G.: Flip-Flop Energy/Performance Versus Clock Slope and Impact on the Clock Network Design. IEEE Transaction on Circuits and Systems (2009) 14. Markovic, D., Nikolic, B., Brodersen, R.: Analysis and Design of Low-Energy Flip-Flops. In: Proc. of the 2001 International Symposium on Low Power Electronics and Design, Huntington Beach, California, United States, pp. 52–55 (2001) 15. Hossain, R., et al.: Reducing Power Dissipation in CMOS Circuits by Signal Probability Based Transistor Reordering. IEEE Trans. Computer Aided Design Integrated Circuits Systems 15(3), 361–368 (1996) 16. Li, B., Peh, L., Patra, P.: Impact of Process and Temperature Variations on Network-on- Chip Design Exploration. In: Proc. of the Second ACM/IEEE International Symposium on Networks-on-Chip, NOCS, pp. 117–126 (2008) Transistor-Level Gate Modeling for Nano CMOS Circuit Verification Considering Statistical Process Variations

Qin Tang, Amir Zjajo, Michel Berkelaar, and Nick van der Meijs

Circuits and Systems Group, Delft University of Technology [email protected]

Abstract. Equation- or table-based gate-level models (GLMs) have been applied in static timing analysis (STA) for decades. In order to evaluate the impact of statistical process variabilities, Monte Carlo (MC) simula- tions are utilized with GLMs for statistical static timing analysis (SSTA), which requires a massive amount of CPU time. Driven by the challenges associated with CMOS technology scaling to 45nm and below, intensive efforts have been contributed to optimize GLMs for higher accuracy at the expense of enhanced complexity. In order to maintain both accu- racy and efficiency at 45nm node and below, in this paper we present a gate model built from a simplified transistor model. Considering the increasing statistical process variabilities, the model is embedded in our new statistical simulation engine, which can do both implicit non-MC statistical as well as deterministic simulations. Results of timing, noise and power grid analysis are presented using a 45nm PTMLP technology.

Keywords: gate modeling, transistor-level, non-Monte Carlo, statistical timing analysis.

1 Introduction

Nowadays cell-based design flows are still dominant for circuit verification such as timing, noise or power grid analysis. Usually, due to the challenges associated with gate modeling, a unique GLM, such as a noise model or a power droop model, is developed to handle each effect. However, improved based on recent invention of a current source model [8], a unified GLM for timing, noise and power analysis is in sight. Since the analysis is carried out using cell models, the models must accurately represent the behavior of the circuit that makes up the cell for timing, crosstalk, variability calculation, etc. However, the conventional GLMs model every element in the model as a function of input slew and sin- gle output effective capacitance (Ceff ), and have single-input-switching (SIS) assumption. Instead of optimizing the GLMs for higher accuracy at the cost of increased complexity and characterization time, we make a case that transistor-level gate models can address most of the limitations of GLMs [5].

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 190–199, 2011. c Springer-Verlag Berlin Heidelberg 2011 Transistor-Level Gate Modeling 191

With increasing process variations at 45nm and below, the major challenge in timing gate modeling becomes an efficient construction of a parameterized timing model of a design, representing the design characteristics as a function of process variations [6]. The major approaches are Monte Carlo (MC) simulations and computing and propagating statistical arrival times. The MC method suffers from excessive pessimism and poor scalability as the number of process param- eters increases. On the other hand, generating statistical arrival time models for all standard cells of a library takes a huge amount of CPU time due to the necessary MC-based simulation. In this paper, we present a statistical simplified transistor model (SSTM) for cell modeling which is capable of simultaneously handling most of the is- sues described in section 2. The new non-MC statistical simulation method is introduced in section 4.

2 GLM Limitations and Optimization Trends

By using conventional GLMs, (S)STA provides delay and slew much faster with- out calculating accurate waveform. In nanometer technology, however, the con- ventional GLMs become less accurate due to the following intrinsic limitations. 1. The simple saturated ramps can no longer represent the input signals, espe- cially if they arise from a complicated driving stage with noise or multiple- input switching (MIS) scenario, or are influenced by process variations or other sources of variabilities [7]. 2. GLMs fail to work with a multi-port coupled interconnect load since the load is only modeled as an effective capacitance (Ceff ). Oversimplification of the interconnect coupling can lead to large errors during timing analysis [1]-[2]. 3. GLMs are unable to capture MIS and internal charge effects for high-stack and complex cells. The SIS assumption is inherent in all timing tools. In reality, all multiple input cells are subject to delay degradation (or delay improvement for the min-delay STA) due to MIS. Not modeling MIS for timing can result in as much as 100% error in delay and slew calculation [2]. 4. The increasing modeling complexity required to handle voltage droop effects. In order to account for power supply variations, GLMs are required to be characterized at different supply voltages. There is a clear trend to optimize GLMs to deal with the limitations listed above. Croix and Wong introduced an input-waveform-independent current source model (CSM) [8] which is essentially a voltage-based, DC-transfer-derived current source with transient effects modeled by a linear capacitance at the output. Many opti- mized CSMs extend the Croix model to handle other limitations. The Miller ca- pacitance is considered and voltage-based capacitance models are used in [1]-[3] while [9] focuses on waveform models. A non-linear Ceff model is described in [4] although its accuracy still needs to be evaluated further. The MIS issue is ad- dressed by modeling every input and output port in the cell [1]-[2]. The internal nodes are also modeled to capture internal charge effects in [1] to obtain higher 192 Q. Tang et al. accuracy. However, they just attempt to optimize GLMs to maintain acceptable accuracy for all types of gates. Unfortunately the fact that GLMs are black-box models where the internal structure of the gates is hidden is the essential root of all these issues. The increasing requirement for accuracy makes the trade-off between better accuracy and shorter runtime a real challenge [6]. At 45nm and below, the propagation of complex signals and accurate model- ing for crosstalk effects require accurate cell models. A good cell model for SSTA should be independent of input waveform, output load and circuit structures; should not increase complexity and provide high accuracy and efficiency at the same time compared to SPICE; should have much shorter characterization time, and should be able to capture process variations and be easy to embed in a SPICE-like engine to propagate statistical signal information. By using an effi- cient transistor model and simulation algorithm, transistor-level gate modeling for timing analysis is a gaining popularity [10]-[12].

3 Statistical Simplified Transistor Model (SSTM)

One extreme way of transistor-level timing analysis is to simply run Spice/Spectre. However, such an approach is computationally impractical due to transistor model (e.g. BSIM4 [13]) evaluation. Our target is to develop a simplified transistor model which captures suffi- cient second-order effects and statistical process variations to allow accurate and efficient waveform and delay calculation for (S)STA.

Fig. 1. a) current-source model; b) proposed SSTM

Recently, optimized GLMs typically model every gate by several capacitors andacurrentsourceasshowninFig.1a[3]. Although the CSM is less accurate for the whole gate representation for nanometer technology, the simple model is, however, appropriate for transistor modeling. The proposed SSTM shown in Fig. 1b represents every transistor by a statistical current source Ids and five parasitic capacitances which also have statistical values as a function of the statistical process parameters of interest. Transistor-Level Gate Modeling 193

3.1 Current Source Modeling Conventionally, without considering second-order effects of deep-micron MOS- FETs, the Shichman-Hodges model was replaced gradually by Deep Submicron MOSFET Models (DSMM) [14]. Although a DSMM substantially improves ac- curacy for submicron MOSFET behavior, our experiments in 45nm technology still show significant errors: i) due to channel length modulation (CLM), DIBL and substrate current induced body effect, the CLM parameter λ is a compli- cated function of Vgs and Vds. As a consequence, the method to model saturation current to be a linear function of Vds with constant slope starting from Ids(Vdsat) is not accurate enough; ii) in the linear region, Ids is no longer proportional to − − 1 1 (Vgs Vth 2 Vds). In fact the 2 should be replaced by a factor which depends on Vgs − Vth; iii) the cutoff current can not be ignored any more. Simulation results show that when Vgs is smaller than Vth by a small amount, the current still has similar shape as the current when Vgs >Vth, which cannot be modeled as zero if the input slew and load capacitance are both small. Similarly, the α-power law MOSFET model [15] is also widely used in digi- tal circuit simulation. This model assumes that near- and sub-threshold region modeling is not important in calculating the delay of digital circuits, so the linear region is just approximated by linear lines and the saturation region cur- rent is constant. However, if the load capacitance and input slew are both quite small, the inaccuracy of the linear-region current significantly impacts the out- put waveform at the end of the transition, which introduces a large error for output slew. Taking these issues into consideration, the proposed BSIM4-based nominal current source Ids0 of SSTM in equation form is given as:  (V /nV ) (−V /V )  He gst  t (1 − e ds t ) Vgs ≤ Vth I 0 = ds W Vds Vds · JVgstVdseff 1 − / 1+ · [1 + λ(Vds − Vdseff )] Vgs >Vth L 2Vb Vc (1) where Vgst = Vgs − Vth, Vb = Vgst +2Vt and Vt is the thermal voltage. The main components are described as:   1  V = V − V − V − γ + (V − V − γ)2 +4γV ) (2) dseff dsat 2 dsat ds dsat ds dsat Vdsat = Vc · (Vgst +2Vt)/(Vc + Vgst +2Vt)(3) In order to link the continuous linear current with the saturation current, a smooth function (2) based on BSIM4 is used. Vdseff enables a unified expression for both linear and saturation currents. The threshold voltage Vth divides the I-V plane to two parts, thus accurate Vth modeling is important. According to the BSIM4 model, a linear dependence of Vth on Vds is a good approximation. We simplify the Vth model as:   Vth = Vth0 − α · Vds + K1( Φs − Vbs − Φs) − K2 · Vbs (4) where Vth0 is the zero-biased long-channel device Vth and α is a coefficient for drain/source charge sharing and DIBL effects on Vth. The coefficients K1, K2 and surface potential Φs are obtained and derived from the technology file. 194 Q. Tang et al.

The model simplification focuses on the following items: i) instead of using complicated expressions, the parameter J considers several effects, including mo- bility degradation; ii) no consideration for narrow channel effect for Vth model; iii) Vgsteff model in BSIM4 [13] is replaced by Vgst since the unified expression for the current from strong inversion to linear region is not used. As a result, the Ids0 model and it’s derivative are dramatically simplified. It should be noticed that the cut-off current could simply be modeled as zero if sharp input ramps and extremely small load capacitances rarely occur at the same time. Then the proposed model is simplified further to the 2nd equation in (1) where only J and λ are obtained in the characterization stage. The statistical description of I-V model is: m m ∂Ids | · · Ids = Ids0(t)+ pk=pk0 (t) ξk = Ids0(t)+ χk(t) ξk (5) ∂pk k=1 k=1 pk = pk0 + ξk (k =1∼ m)(6) where pk is the kth random process parameter which is the sum of nominal value pk0 and random variable ξk with zero mean (μ) and the same standard deviation (σ)aspk. χk(t) is the differential function of Ids by the elements of pk.

3.2 Intrinsic Capacitance Modeling The most accurate way to model non-linear capacitances is to represent them as voltage dependent terminal charge sources [13]. Characterization of such a model would involve generating charge tables for a range of terminal voltages. All capacitances are derived from the charge to ensure charge conservation. Each capacitance is computed by Cij = ∂Qi/∂Vj at every time step, where i and j denote the transistor terminals. Although this approach would be the most accurate, the massive amount of simulation time would be a problem for STA and SSTA.

−17 x 10

3

2.5

2 linear

Cgd (F) 1.5 saturation

1 cut−off 0.5 1.5 1.5 1 1 0.5 0.5 0 0 Vds (V) Vgs (V)

Fig. 2. Cgd variation for a minimum-sized NMOS

Using a single value for all capacitors promises fast simulation, but it results in an overly simple model which produces errors in (S)STA for nanometer tech- nology. Fig. 2 shows the variation of Cgd for a minimum-sized NMOS. Clearly, Transistor-Level Gate Modeling 195 at the 45nm node, the capacitances are too nonlinear to be accurately modeled as a constant value. In order to improve accuracy while maintain good computa- tional efficiency, SSTM treats the five capacitances differently. For gate channel capacitances (GCC) Cgs, Cgd and Cgb, SSTM uses a constant value in the cut-off and saturation regions respectively, while approximates them as a linear func- tion of Vgs and Vds in the linear region. For junction depletion capacitances Csb and Cdb, SSTM uses a single value model since they are 1-2 orders of magnitude smaller than GCCs. In the statistical extension of the capacitance model (7), Cj0 is the nominal value of the jth capacitance in Fig. 1 and the sensitivity ζ is characterized by perturbing the process variables of interest.

m m ∂Cj | · · Cj(t, ξ)=Cj0 + pk =pk0 ξk = Cj0 + ζk ξk (7) ∂pk k=1 k=1 The characterization time of GLMs for SSTA is quite long since standard cell libraries consist of hundreds of cells with different sizes and process corners. In contrast, by using transistor-based gate modeling like SSTM, the characteriza- tion time is significantly reduced as only the unique transistors used in the cell library need to be characterized.

4 Non-MC Statistical Simulator

The proposed SSTM is embedded in our non-MC statistical simulator [16] for fast statistical timing analysis. In general, for deterministic time-domain analysis, the modified nodal analysis (MNA) equations for any circuit can be expressed in compact form as:

 F (x ,x,t,p0)=0 x(t0)=x0 (8) where x is the vector of the circuit state variables consisting of nodal voltages and branch currents and p0 is the nominal process variable vector with elements pk0  introduced in (6). x denotes the time derivative of x.Letxs be the solution to (8). Transient analysis in a conventional simulator solves for xs using numerical integration methods. However, the existence and importance of process variations at 45nm and below result in a random MNA which can be expressed as:

 F (x ,x,t,p)=0 x(t0)=x0 + δx0 (9) where p is the statistical process variable vector with elements pk introduced in (6). δx0 denotes the initial variation caused by p. It is computationally impracticable to solve (9) due to a large set of correlated random variables and the nonlinearity. Therefore, in order to make the problem manageable, we employ principal component analysis (PCA) to model a large set m in (6) of correlated p to a n-dimensional (n  m) vector of uncorrelated random variables, and linearize (9) with a truncated Taylor expansion. To avoid 196 Q. Tang et al. notational cluttering, the notation p representing the uncorrelated process vari- ables after PCA is further used in the paper. The linear Taylor expansion is  − carriedoutatthepointofxs, xs and p0. Let’s define y(t)=x(t) xs(t)asthe x(t) variation vector due to process variation ξ with zero μ and finite σ men- tioned in (6). Re-organizing the 1st-order Taylor expansion of (9) we can obtain a compact format as:

 y (t)=E(t)y(t)+F (t)ξy(t0)=δx0 (10)

The nonlinear random equation (9) is converted to a linear random differential equation (RDE) in y. According to the mean square (m.s.) integral theorem [17], there exists a unique solution. Assuming the initial condition x0 issettoafixed value, the solution is found as y(t)=α(t) · ξ. By substituting y(t)=α(t) · ξ in (10), α(t) is easy to calculate by solving the resulting ODE. Then the mean, variance and covariance of x(t) can be calculated as: n { } { } 2 { } E x(t) = xs(t) Var xj (t) = αjk(t)Var ξk (11) k=1 T Cov(xa,xb)=α(ta) · diag(Var{ξ1} , ···,Var{ξn}) · α (tb) (12) where xj (t)isthejth element of vector x(t). As long as α(t)iscalculated,y(t) is known, thus the covariance matrix of the solution y(t)attwodifferenttime points ta and tb can be calculated by (12). From the waveform modeling point of view, the waveform is modeled as a time- indexed voltage array for STA while the mean,variance and covariance array are used for SSTA. Based on (11)-(12), the probability density function (pdf)of every crossing time for rising and falling transitions can be straightforwardly calculated by (13) and (14) respectively assuming the voltage at any time point is Gaussian distributed [16].

Pr(trη = t)=Pr(Vo(t − Δt) ≤ Vη) − Pr(Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη)(13)

Pr(tfη = t)=Pr(Vo(t) ≤ Vη) − Pr(Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη) (14) where the crossing time tη is the time when the node voltage crosses the corre- sponding voltage threshold Vη = η% · Vdd. Pr(Vo(t − Δt) ≤ Vη ∩ Vo(t) ≤ Vη)is the joint cdf of Vo at two time steps. Note that the proposed method calculates the pdf directly and considers the correlation of Vo at two time steps in con- trast to [18] and [19]. Given mean and variance of crossing time, the mean and variance of delay and slew can be calculated.

5 Experimental Results

The proposed SSTM and non-MC statistical simulation method were evaluated using 45nm PTMLP technology [20] and implemented in MATLAB. For SSTM, the data for characterization were obtained from Spectre using a BSIM4 model and then imported to a characterization algorithm in MATLAB to acquire the Transistor-Level Gate Modeling 197

3 5

2

1

0 0 −1

−2

Relative Error (%) −3 Relative Error (%)

−4 −5 0.5 0.5 0.4 25 0.4 25 0.3 20 0.3 20 0.2 15 0.2 15 10 10 0.1 5 0.1 5 0 0 0 0 Input Slew (ns) Load Capacitance(fF) Input Slew (ns) Load Capacitance(fF) (a) relative error of rise delay (b) relative error of fall delay

9 12

8 10 7

6 8

5 6 4

3 4 scaled output rising slew SSTM results scaled output falling slew SSTM results 2 BSIM4 results BSIM4 results 2 1

0 0 0 5 10 15 20 25 0 5 10 15 20 25 Capacitive load (fF) Capacitive load (fF) (c) rising output slew (d) falling output slew

Fig. 3. Delay and output slew evaluation required parameters described in section 3. We present the accuracy evaluation of SSTM for minimum-sized cells, arbitrary inputs and MIS and the applicability of SSTM for power grid and signal integrity verification. In the end, the statistical simulation results were presented. We evaluated the nominal SSTM when process variations are not included in SSTM in minimum-sized inverter and NAND2 cells with different input slew (Sin) and capacitive load (Cload). The Sin ranges from 1ps to 500ps and the Cload spans from 0.5fF to 40fF. In comparison with Spectre using the BSIM4 model, It is clear from Fig. 3 (a)-(b) that the relative error for delay calculation is within 5%. 99.2% of the output rise delay and 93.9% of output fall delay are within 1.6%. The average relative error of output slew calculation is 1.2%. Although the maximum relative error is 3.3% with zero Cload, Fig. 3 (c)-(d) show the absolute error is nearly zero. In essence, SSTM is input waveform independent so it can handle arbitrary input waveforms. Certain cells may experience simultaneous MIS and internal charge sharing during some specific input to output transitions. The transistor- based SSTM is able to handle these since every node is considered at the same time. Fig. 4 illustrates the accuracy of the nominal SSTM used in a minimum- sized inverter with irregular input and a NAND2 cell in a simultaneous MIS scenario. The results show a very good match between the nominal SSTM and BSIM4 model. Power supply integrity verification is an essential step in current design flows due to the large currents drawn through an increasingly resistive power supply network. The models used in power grid analysis must capture the dynamic current characteristics of the cells. Fig. 5(a) shows the current drawn by a cell from 198 Q. Tang et al.

1.2 1.2

1 1

0.8 0.8 input output−BSIM4 0.6 output−SSTM 0.6 Voltage (V) Voltage (V) input1 0.4 0.4 input2 output−BSIM4 output−SSTM 0.2 0.2

0 0 0 1 2 3 4 5 6 0 0.5 1 1.5 2 2.5 3 −9 −9 time (s) x 10 time (s) x 10

Fig. 4. left: irregular input; right: simultaneous MIS for a NAND2 cell

1 1.2

1 0 0.8 noisy input aggressor −1 0.6 SSTM output

SSTM result voltage (V) 0.4 Spectre output −2 scaled current Spectre result 0.2

−3 0 0 0.5 1 1.5 2 0 0.5 1 1.5 scaled time scaled time (a) SSTM to power grid verification (b) SSTM to signal integrity verification

Fig. 5. SSTM’s application to power grid and signal integrity verification the power supply at both rising and falling transitions. It is easy for transistor- based gate models to capture the dynamic currents since the desired current is calculated during the simulation. The primary modeling challenge for on-chip signal integrity verification has been the simulation of a driver (the victim), subject to an input noise, whose interconnect load is capacitively coupled to the output of another driver (the aggressor). In Fig. 5(b) we see the SSTM captures this scenario well. All wave- forms in Fig. 5 show SSTM can be applied to power grid and signal integrity verification flows. We combined SSTM with the proposed non-MC statistical simulation method for a large number of standard cells in a 45nm technology. The uncorrelated process variations are length and width variations with zero μ.The3σ of length and width are 20% and 15% of the nominal length and the largest width of every cell respectively. In comparison with 1000 Monte Carlo trials in Spectre, the proposed modeling and simulation method achieved relative error within 1.4% for μ and within 6.8% for σ with an average 40× speedup [16].

6Conclusion

At 45nm and below the gate models for circuit verification should account for increasing accuracy requirements and process variations. In this paper, a statis- tical simplified transistor model (SSTM) for transistor-level gate modeling which is embedded in our non-MC statistical simulator is presented. The SSTM-based Transistor-Level Gate Modeling 199 gate model is independent of input waveform and output load, easy to character- ize and suitable for SSTA, and accurate compared to Spice/Spectre for standard cells. We show that, in addition to handling accuracy limitations associated with conventional gate-level models for STA like arbitrary input, multi-input switching, etc., it is possible to be applied to power grid verification and noise verification flows. The statistical results show that our transistor-level timing analysis methodology achieves both high accuracy and efficiency.

References

1. Menezes, N., Kashyap, C., Amin, C.: A “true” electrical cell model for timing, noise, and power grid verification. In: Proc. of DAC, pp. 462–467 (2008) 2. Amin, C., Kashyap, C., Menezes, N., Killpack, K.: A multi-port current source model for multiple-input switching effects in CMOS library cells. In: Proc. of DAC, pp. 247–252 (2006) 3. Goel, A., Vrudhula, S.: Statistical waveform and current source based standard cell models for accurate timing analysis. In: Proc. of DAC, pp. 227–230 (2008) 4. Li, P., Acar, E.: Waveform independent gate models for accurate timing analysis. In: Proc. of ICCD, pp. 617–622 (1996) 5. Tang, Q., Zjajo, A., Berkelaar, M., van der Meijs, N.: A simplified transistor model for CMOS timing analysis. In: Proc. of ProRISC, pp. 1–6 (2009) 6. Keller, I., Tarn, K.H., Kariat, V.: Challenges in gate level modeling for delay and SI at 65nm and below. In: Proc. of DAC, pp. 468–473 (2008) 7. Nazarian, S., Pedram, M., Tuncer, E., Lin, T.: Sensitivity-based gate delay prop- agation in static timing analysis. In: Proc. of ISQED, pp. 536–541 (2005) 8. Croix, J.F., Wong, D.F.: Blade and Razor: cell and interconnect delay analysis using current-based models. In: Proc. of DAC, pp. 386–389 (2003) 9. Amin, C.S., Dartu, F., Ismail, Y.I.: Weibull based analytical waveform model. IEEE Trans. on CAD 24, 1156–1168 (2005) 10. Raja, S., Varadi, F., Becer, M., Geada, J.: Transistor level gate modeling for accurate and fast timing, noise, and power analysis. In: Proc. of DAC, pp. 456–461 (2008) 11. Kulshrehtha, P., Palermo, R., Mortazavi, M.: Transistor-level timing analysis using embedded simulation. In: Proc. of ICCAD, pp. 344–348 (2000) 12. Li, Z., Chen, S.: Transistor level timing analysis considering multiple inputs simul- taneous switching. In: Proc. of CADCG, pp. 315–320 (2007) 13. BSIM4 Home Page, http://www-device.eecs.berkeley.edu/bsim3/bsim4.hml 14. Rabaey, J.M.: Digital integrated circuit: A design perspective, pp. 96–100. Prentice Hall, Upper Saddle River (1996) 15. Sakural, T., Newton, A.R.: Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas. IEEE JSSC 25(2), 584–594 (1990) 16. Tang, Q., Zjajo, A., Berkelaar, M., van der Meijs, N.: RDE-based transistor-level gate simulation for statistical static timing analysis. In: Proc. of DAC, pp. 787–792 (2010) 17. Soong, T.T.: Random differential equations in science and engineering. Academic Press, New York (1973) 18. Fatemi, H., Nazarian, S., Pedram, M.: Statistical logic cell delay analysis using a current-based model. In: Proc. of DAC, pp. 253–256 (2006) 19. Liu, B., Kahng, A.B.: Statistical gate level simulation via voltage controlled current models. In: IEEE Proc. of MBAS, pp. 23–27 (2006) 20. Predictive Technology Model for Low-power Applications (PTMLP) (November 2008), http://www.eas.asu.edu/~ptm/modelcard/LP/45nm_LP.pm White-Box Current Source Modeling Including Parameter Variation and Its Application in Timing Simulation

Christoph Knoth1,IrinaEichwald1, Petra Nordholz2, and Ulf Schlichtmann1

1 Institute for Electronic Design Automation, Technische Universit¨at M¨unchen http://www.eda.ei.tum.de/ 2 Infineon Technologies AG, Munich http://www.infineon.com

Abstract. This paper presents a novel method for generating current source models (CSMs) for logic cells that efficiently captures the influences of parameter variation and supply voltage drops. The characterization ex- ploits topological information from the transistor netlist resulting in typi- cally 80x faster CSM library generation. The parametric CSMs have been integrated into a commercial FastSPICE simulator to further accelerate path-based timing analysis with transistor level accuracy. Without loss of accuracy, simulation times were reduced by 4x to 98x.

1 Introduction

Timing validation is a crucial step during the design closure of digital cir- cuits. The huge number of cell instances in modern IC designs requires abstract signal and delay models. The industry standard delay model, nonlinear delay model (NLDM), therefore approximates the cell input behavior by capacitances and logic signals by linear ramps with arrival and transition times. Nonetheless, these idealizations do not account for the increasing impact of analog effects in- troduced by interconnects. Signal transitions are non-monotonic due to coupling noise and the wire resistance causes long transition tails and reduces the load capacitance seen by the driver. Effective capacitance and piecewise constant in- put capacitances emerged as patches for NLDM to better account for the analog effects [21] but still delay and slew errors are larger than 10% [10]. EDA vendors recognized the importance of precise waveform modeling for correct delay mod- eling and introduced the new driver and delay models ECSM and CCS [1, 2, 24]. These models use more voltage-time-points to describe logic signal but still as- sume monotonic transitions. The authors of [14] proposed to use a larger set of ”typical” waveforms including noisy ones for cell delay characterization. In contrast to simulating every possible scenario of input signal and output load during library generation, waveform and load independent CSMs have been proposed. They are pin compatible models of logic cells and provide the port currents as functions of port voltages to calculate the output waveform using

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 200–210, 2011. c Springer-Verlag Berlin Heidelberg 2011 White-Box Current Source Modeling Including Parameter Variation 201

SPICE principles. CSMs are mainly used in dedicated timing or noise engines [6, 7, 9, 11] but can also be employed in SPICE simulators [17, 23]. For today’s and future technology nodes the impact of parameter variation is of major concern. It is therefore not sufficient to improve model accuracy for nominal conditions. All enhancements must support statistical analyses. This also holds for CSMs. In [9], [19] and [25] CSMs are used in special statistical timing simulators to propagate the nominal voltage waveform and sensitivities of voltage crossing points w.r.t. parameters. Despite their accuracy benefits and reported applications, generating CSM libraries is a significant effort. As will be shown in the next section, the problem arises from time consuming transient simulations for obtaining CSM compo- nents. Moreover, this leads to a prohibitively high simulation effort when the impact of parameter variation has to be considered. This paper therefore presents a white-box modeling approach that allows much faster CSM library generation. To the best of our knowledge, it is the first method to build parametric CSMs that employs transistor netlist informa- tion from a topology analysis. Furthermore, the paper reports the first utiliza- tion of CSMs to accelerate simulation performance of a commercial FastSPICE simulator. This allows to reduce simulation times for digital and mixed signal circuits.

2 Current Source Modeling

Current source models imitate the nonlinear port currents of logic cells as func- tions of port voltages. Different CSMs have been proposed over the years [3, 6, 7, 9, 11–13, 15, 16, 18, 20, 22, 25]. All of them model the port current as a com- posite of a static current from a voltage controlled current source (VCCS) and an additional dynamic contribution realized by (non)linear charges or capacitors (see Fig. 1). These static and dynamic components are modeled as functions of the port voltages. Important internal nodes of complex cells might be treated as additional virtual ports [15]. Generating a CSM can cause a significant simulation effort. Only the authors of [6] propose a method to derive CSMs from already existing ECSM timing libraries. Unfortunately, the impact of parameter variation cannot be captured. In almost all other approaches, a set of time consuming simulations is performed. Obtaining the functions for static port currents of a logic cell is conveniently realized by attaching DC voltages sources to the ports, sweeping their values, and measuring the resulting port currents. These values are stored in lookup tables (LUTs) or are approximated by polynomials or splines. The real challenge is in characterizing the dynamic components for which different methods have been published. In [7, 20] the capacitor values or functions are found by error minimization to match the transient output current for a set of typical input stimuli. In other approaches step or ramp signals are applied and the differences between static and transient port currents are integrated to get equivalent port charges or 202 C. Knoth et al. capacitances [3, 12, 18]. This is done for all combinations of port voltages in the LUT. In [18] a second order lowpass filter at the input accounts for additional gate delay. The filter parameters and all other model components are “tuned“ by step-wise error minimization with typical input waveforms. The authors of [3] pointed out the runtime problem of transient simulations for CSM characterization and reduced the number of data points in the LUTs. Therefore, in [15] AC simulations are used to obtain voltage controlled capacitors connecting the ports of a cell. Unfortunately this method leads to very complex CSMs. It should be noted that although being a one time effort, library character- ization can be very expensive and time consuming. Several CSMs of a single cell have to be generated for different PVT corners. Inefficient methods block computational resources and software licenses and can defer the design process. The problem is even more severe when parameter variation is considered. In [20] the CSM elements are determined by performing a number of Monte Carlo (MC) simulations with typical input waveforms and adjacent error mini- mization w.r.t. port voltages and parameters. In [9] many CSMs are generated for different parameter combinations of several MC runs. Subsequent linear fitting for every data point in the LUTs yields a first order sensitivity model. Similarly the authors of [18] wrap parameter deflection and the calculation of finite differ- ences for each model element around the whole characterization which is based upon error minimization. In [13] the CSM capacitors are obtained from the dif- ference of static and total port current for a sequence of transient simulations. This is repeated for every combination of parameters. The highly dimensional tables (port voltages and parameters) are approximated by the tensor product of polynomials which model the nominal values and variation impact. The proposed white-box approach avoids the plethora of transient simulations to match the port behavior of logic cells. Instead, physically motivated CSMs are generated based upon the original netlist elements. The additional information obtained from the transistor netlist enables very fast and accurate model gener- ation. This efficiency is the key to capture the influence of parameter variation within reasonable time. The model is applicable to stand alone timing simula- tors. However, we implemented the parametric CSM for SPICE and FastSPICE simulators. This allows to further improve the performance of existing and highly efficient tools. Moreover, CSMs can thus be utilized for simulating mixed signal circuits together with transistor models and behavioral descriptions in Verilog or VHDL. Each CSM can be adjusted to parameter variation and Vdd-drop during simulation. It is therefore compatible with MC methods and fits very much into existing simulation, optimization, and verification methodologies.

3 White-Box CSM Characterization

3.1 Nominal Characterization The aim is to replicate the nonlinear port behavior of the transistor level subcir- cuit description, such as in Fig. 2, by the much simpler circuit of Fig. 1. Hence, White-Box Current Source Modeling Including Parameter Variation 203

vdd R1 M1 ˆi ˆi a z va1 R M1 ia 0 I v iz ∗ va va C0 d z0 vz va ∗ Iˆz(va,vz ) 0 va vz IM2 va Rˆa d R3 ∗ ∗ Qˆa(va,vz ) Qˆz(va,vz ) Cˆa M2 C4 C2 R2 va2 vss vss

Fig. 1. Current Source Model with low- Fig. 2. Subcircuit definition of CMOS pass filter and nonlinear current source inverter with parasitic elements and charges

for any sequence of input voltages va and any arbitrary load attached to output port z, the model port currents ˆia and ˆiz must match the original currents ia and iz. Similar to other CSM approaches the port current is modeled by the sum of a static current Iˆz(va,vz) and a dynamic current resulting from the time derivative of the associated port charge dQˆ(va,vz)/dt. For efficiency, a CSM is provided for every time arc. Therefore, the model components are functions of two node potentials. In cells with multiple stages (e.g. buffer, and), internal node poten- tials affect the port behavior. Structure recognition is applied to partition these cells into channel connected blocks. These stages are then modeled individually by a CSM as in Fig. 1. In cells with significant parasitic input networks a lowpass filter accounts for the additional cell delay. While existing approaches treat the logic cell as a black box of which only the port currents are observable, the presented white-box approach uses the original netlist elements to derive the model components voltage controlled current source and voltage controlled charges. The port charge is denoted as the sum of all node charges of resistively connected internal nodes [17]. A topological search is performed on the transistor netlist to obtain a symbolic expression that collects all charges associated with one port. Similarly, all static current contributions of the transistors are found. For the example of Fig. 2, the model components are related to original currents and charges through

ˆ M1 M2 Iz(va,vz)=Id (vdd,vz0 ,va1 )+Id (vss,vz0 ,va2 )(1) ˆ M1 M2 Qz(va,vz)=Qd (vdd,vz0 ,va1 )+Qd (vss,vz0 ,va2 )+ · − · − + C4 (vz0 vss)+C0 (vz0 va0 )(2) ˆ M1 M2 Qa(va,vz)=Qg (vdd,vz0 ,va1 )+Qg (vss,vz0 ,va2 )+ · − · − + C2 (va0 vss)+C0 (va0 vz0 )(3)

Mx Qg denotes the gate pin charge of transistor Mx and Cx are the parasitic capacitances. Dynamic coupling between input and output (Miller effect) is im- plicitly modeled in (2). Similarly the dependency of the input capacitance on the output voltage is captured by the last term of (3). 204 C. Knoth et al.

While the nonlinear transistor quantities depend on internal node potentials, the model components shall be functions of port voltages only. It has been ob- served that all internal node voltages have very small time constants. Hence, any particular solution decays quickly, usually within one time step of a transient simulation. The node potentials therefore have the same values as in a DC simu- lation with fixed port voltages. Consequently also the node charge values will be identical. This observation is used to implement a very efficient characterization without transient simulations. DC voltage sources are attached to the active pins of the stage and swept from Vss to Vdd. Based on the topological search, mea- surement statements of (1-3) are executed and the data for the port quantities is obtained. In contrast to existing methods there is no interdependence among the model components. Hence, the complete model comprising static and dynamic components can be characterized simultaneously in a single DC simulation. Having multiple parallel transistors to increase driving strength results in a rather large linear parasitic input network. This causes a notable signal delay which is accounted for by a lowpass filter. The model elements Rˆa and Cˆa are chosen to equate the average cutoff frequency of the connected transistor gate pins. It is attached to a duplicate of the input voltage to preserve the receiver properties modeled by Qˆa. The delayed input voltage is used to control the nonlinear elements.

3.2 Handling Parameter Variations Deviations of process or environmental parameters from their nominal values affect transistor quantities like saturation current or overlap capacitances, lead- ing to altered cell delays. The CSM accounts for this by modeling the physical impact of variations on the model quantities port current and port charge. Consistent with existing simulation methods each parameter is described as n the superposition of nominal value pi and deviation Δpi. The latter is composed of global, local, and random influences. n n g l r pi = pi + Δpi = pi + pi + pi + pi (4) This allows to model correlation between local variations of parameters of closely placed cells. Consequently every CSM instance is facing an individual set of pa- rameter deflection Δp. Intra cell variation is not considered but could be modeled in the same way. Supply voltage drops are treated similarly to parameters with expected deviations of up to 15%. An individual Vdd-drop can be assigned to each stage of a CSM. Every parameter variation Δpi causes an additional static current and addi- tional charges. If Δpi is sufficiently small, the first order approximation of the model components is given as  n n dIˆ Iˆ = Iˆ + ΔIˆ = Iˆ + · Δpi (5) dpi  n n dQˆ Qˆ = Qˆ + ΔQˆ = Qˆ + · Δpi (6) dpi White-Box Current Source Modeling Including Parameter Variation 205

The applicability of every CSM modeling method strongly depends on the costs dIˆ for obtaining the linear sensitivity of a quantity w.r.t. a parameter, here /dpi dQˆ and /dpi. All methods which excessively employ transient simulation for model characterization run into severe complexity problems. The proposed white-box method based upon netlist information is very efficient since a complete stage is characterized in a single, very fast, DC simulation. Since the relation of netlist elements and CSM components is known from the nominal characterization, also the sensitivities to parameter variations are immediately assigned to the parametric CSM components. By reusing the symbolic equations of (1-3) the linear sensitivities of the model components are given as

dIˆ dIM1 dIM2 z/dpi = d /dpi + d /dpi (7) M1 M2 dQˆz dQ dQ d(vz −vss) d(vz −va ) /dpi = d /dpi + d /dpi + C4 · 0 /dpi + C0 · 0 0 /dpi (8) M1 M2 dQˆa dQ dQ d(va −vss) d(va −vz ) /dpi = g /dpi + g /dpi + C2 · 0 /dpi + C0 · 0 0 /dpi (9)

The numerical values of (7-9) are obtained through simulation with subsequent calculation of finite differences. Each parameter is positively and negatively de- flected by one standard deviation while all other parameters are kept constant. If cross dependencies are significant, more DC simulations can be performed to cover additional points in the parameter space. However, we observed that second order effects can be neglected. Hence, for N parameters with significant influence, (2N + 1) DC simulations are required, which takes a few minutes on standard computers. For illustration, generating the nominal CSMs for two timing arcs of a nand gate was done in 32 seconds on a desktop machine. For comparison, CSM models have been generated according to the method proposed in [3]. 46 minutes and 20 seconds were needed to generate the two nominal CSMs. Therefore, our proposed approach is faster by a factor of 86. Similar factors have been observed for other cell types. The full model generation including the sen- sitivities w.r.t. six parameters and Vdd required 9 minutes and 5 seconds using our method but would take about 12 hours for the other approach.

3.3 Implementation

The characterization starts with the topology analysis of transistor netlist files. SPICE simulations are conducted for each timing arc and measured port values are stored in ASCII LUTs. Finally, the CSMs are generated either as Verilog-A modules or as subcircuits using compiled models for the nonlinear elements [5]. Verilog-A models are supported by many circuit simulators but more speedup is gained with compiled models. The compiled model interface (CMI) allows to use CSMs in simulators like Spectre or UltraSim. Similar interfaces exist for other simulators. New circuit elements for voltage controlled current source and voltage con- trolled charge have been implemented. They allow to use 2D LUTs of variable size provided as ASCII files. During an initialization phase nominal and sen- sitivity tables are imported. In cases of a parameter alteration, the simulator 206 C. Knoth et al. provides the numerical value of the deviation and the instance tables are up- dated according to (5). This is done prior to any transient analysis and for every entry in the LUTs. During the simulation, bilinear interpolation is applied to the final tables. It is preferred to multidimensional approximation functions, since it is highly flexible to support the modified tables for parameter variation and sufficiently accurate. Due to modeling channel connected blocks only, the func- tions are reasonably smooth. Hence, good convergence properties exist also for moderate discretization of the LUTs. It was also observed that the size of the 2D-LUTs was not runtime critical. In Verilog-A the variation is modeled by ad- ditional current and charge contributions. Hence, additional interpolations must be performed for each parameter in every iteration. Unfortunately Verilog-A’s interpolation function $tablemodel is rather slow. Almost no speedup was gained in the experiments. The integration into commercial SPICE and FastSPICE simulators allows to perform timing and noise analyses as well as MC simulations. This really broadens the applicability of CSMs since it is now possible to efficiently sim- ulate circuits containing transistor models, behavior models, and CSMs. Still the model can be used in dedicated timing and noise simulators. Especially the sensitivity tables are valuable data for statistical approaches such as [19] or [25].

4Results

A CSM library has been automatically generated for 293 90nm CMOS gates with extracted parasitics. The cells have 1 to 10 input pins and consist of up to three stages. The influences of the six most dominant process parameters and static supply voltage drop have been considered. The complete structure recognition required less than one second. Generating the CSMs for each cell required 20 minutes on average using a 2 GHz Linux machine with 4GB RAM. To evaluate model accuracy and performance, the CSMs have been compared against the transistor level implementations (BSIM) of logic cells using an in- house SPICE simulator. Models for every timing arc of every gate have been tested individually by performing 50 MC runs for different combinations of in- put waveforms, CRC Π-loads, and parameters. The histograms in Fig. 3 shows relative delay and slew errors for these tests. For the majority of testcases the CSM delay prediction matches the BSIM reference. In 93.18% of the testcases the delay error was less than 2%, 99.58% are within 5% of BSIM. The error of output slew was less than 2% for 96.54% and less than 5% for 99.86% of all tests. The CSM therefore provides significantly more accuracy than NLDM [10] while already supporting parameter variation and non-ideal input waveforms. Fig. 4 demonstrates this capability using an inverter and a noisy input signal. In plot A the input waveform is depicted together with the two output waveforms predicted by BSIM and the CSM model, respectively. The same noisy input has been applied to the gate while different simulation modifications were made. Plot B shows the output waveforms if one parameter is altered. In plot C, all six parameters have been randomly deflected. In the scenario of plot D arbitrary White-Box Current Source Modeling Including Parameter Variation 207

25000 25000

20000 20000

15000 15000

10000 10000

5000 5000 Slope Error [%] Delay Error [%]

−4 −2 0 2 4 −6 −4 −2 0 2 4 6

Fig. 3. Relative delay and slew errors for all cells with different CRC Π-loads, inputs slews, and parameter variation (50 MC runs per timing arc)

V V 1.2 1.2

1 1 Input 0.8 BSIM 0.8 CSM BSIM 0.6 0.6 CSM nominal (A) 0.4 0.4

0.2 A 0.2 B 0 0 100 200 Time [ps] 100 200 Time [ps] V V 1.2 1.2

1 1

0.8 0.8 BSIM BSIM 0.6 CSM 0.6 CSM nominal (A) nominal (A) 0.4 0.4

0.2 C 0.2 D 0 0 100 200 Time [ps] 100 200 Time [ps]

Fig. 4. Accurate waveform prediction in the presence of noise for nominal conditions (A), one altered parameter (B), all altered parameters (C), additional Vdd-drop (D) parameter variation and an additional supply voltage drop have been applied. For all cases the waveforms overlap almost completely. It also visualizes that first order sensitivities are suitable to capture parameter variations for the CSM components. After studying each gate individually, critical paths of ISCAS85 circuits have been simulated with SPICE and FastSPICE using transistor models (BSIM) and the current source models (CSM). Table 1 compares the predicted path delays and simulation times for 50 MC runs in SPICE. Good accuracy is achieved with most mean errors being less than 1%. The simulations could be accelerated by factors of 82 to 175. For the circuit c6288 this means a reduction from 3 days and 11 hours to 30 minutes! The correlation plot of path delays for c1355 in Fig. 5 shows that most errors are within 5% while the maximum error is 8.9%. Similar results are obtained for the other circuits. 208 C. Knoth et al.

normalized delay Table 1. Simulation time and path delay errors +10% (CSM) of 50 MC runs in SPICE using transistors and +5% CSMs ±0% −5%

−10% Circuit Delay Error [%] CPU-Time mean max BSIM [s] CSM [s] Speedup c17 0.044 -1.762 151.32 1.21 125.06 normalized delay c1355 0.753 8.883 14332.32 107.18 133.72 (BSIM) c880 0.076 8.339 15343.50 121.27 126.52 c1908 -0.499 8.642 23176.59 202.32 114.55 c2670 -0.519 5.542 14838.57 180.74 82.10 Fig. 5. Correlation plot of path c5315 -0.159 9.472 16739.05 226.45 73.92 c6288 -3.159 -9.215 299763.30 1715.30 174.76 delay variation for c1355

The above studies focused on verifying the CSM accuracy. It has been fur- ther investigated if CSMs can improve existing tools used for timing analysis. FastSPICE simulators provide the necessary functions for timing verification with transistor level accuracy [4]. They apply circuit partitioning, use simpler device models and adaptively controlled explicit simulation [8]. CSMs further reduce the computational effort by combining several transistors of a logic cell into three LUTs. Table 2 compares the simulation times and speedup factors for different models and simulators. As expected SPICE with BSIM models is prohibitively time consuming. Replacing the cells by CSMs causes a signif- icant acceleration of 50 to 80. Simulation times are now of the same order as the FastSPICE simulator with transistor models. These times can be further re- duced by factors of 4 to 98 by using CSMs as cell models in FastSPICE. Specially remarkable are simulation times and speedup for c6288. This circuit consists of many identical gates. Hence, in contrast to other circuits only a few CSMs must be held in memory during simulation, resulting in fewer cache misses and higher speedup. This effect can be illustrated by reducing the circuit size. Truncating the path to 50% or 25% decreases the speedup to 62.99 and 38.14, respectively. Table 2 further lists the relative path delay errors compared to SPICE with BSIM models. Using a FastSPICE simulator has caused more error than using CSMs in SPICE. Furthermore, using CSMs in a FastSPICE simulator did not result in noticeable additional errors.

Table 2. Performance comparison for different simulators and models

Circuit SPICE Runtime [s] FastSPICE Runtime [s] Relative delay error [%] BSIM CSM BSIM/ BSIM CSM BSIM/ SPICE FastSPICE CSM CSM CSM BSIM CSM c17 41.05 0.70 58.6 3.06 0.32 9.6 0.00 −1.46 −1.46 c880 2180.24 27.01 80.7 42.37 5.22 8.1 −0.31 −2.02 −2.02 c1355 2008.78 22.96 87.5 40.95 4.62 8.9 0.26 −2.71 −2.71 c1908 3473.82 41.60 83.5 35.67 7.46 4.8 −1.49 −2.33 −2.33 c2670 2197.10 39.33 55.9 32.12 7.35 4.4 −0.84 −2.95 −3.01 c5315 2742.89 47.27 58.0 38.09 9.34 4.2 −1.08 −3.07 −2.90 c6288 30865.54 140.04 220.4 1725.58 17.57 98.0 −2.86 −2.56 −2.56 White-Box Current Source Modeling Including Parameter Variation 209

5Conclusion

A current source modeling technique for logic gates has been presented. By obtaining additional information from the transistor netlist, very efficient model characterization based on DC simulations has been realized. This allows fast CSM library generation including the sensitivities to process parameters and supply voltage. The CSMs have been realized as compiled circuit components and used in SPICE and FastSPICE timing analysis of ISCAS85 circuits. At the costs of 3% delay error SPICE simulation times could be reduced to those of FastSPICE simulators. Alternatively, additional speedup of 4-98x was realized when using CSMs in FastSPICE simulator without additional error penalty.

Acknowlegdement

This work has been supported by the German Ministry of Education and Re- search (BMBF) within the project ’Sigma65’ (Project ID 01M3080A). The con- tent is the sole responsibility of the authors.

References

1. Composite current source (December 2006), http://www.synopsys.com/products/solutions/galaxy/ccs/cc_source.html 2. Ecsm - effective current source model (2007), http://www.cadence.com/Alliances/languages/Pages/ecsm.aspx 3. Amin, C., Kashyap, C., Menezes, N., Killpack, K., Chiprout, E.: A multi-port current source model for multiple-input switching effects in cmos library cells. In: ACM/IEEE Design Automation Conference (DAC), pp. 247–252 (2006) 4. Cadence. UltraSim User’s Manual (June 2003) 5. Cadence. Compiled-Model Interface Reference (November 2004) 6. Chopra, K., Kashyap, C., Su, H., Blaauw, D.: Current source driver model synthesis and worst-case alignment for accurate timing and noise analysis. In: ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, pp. 45–50 (2006) 7. Croix, J., Wong, M.: Blade and razor: cell and interconnect delay analysis using current-based models. In: ACM/IEEE Design Automation Conference (DAC), pp. 386–389 (June 2003) 8. Devgan, A., Rohrer, R.A.: Aces: A transient simulation strategy for integrated circuits. In: IEEE International Conference on Computer Design (ICCD), pp. 357– 360 (1993) 9. Fatemi, H., Nazarian, S., Pedram, M.: Statistical logic cell delay analysis using a current-based model. In: ACM/IEEE Design Automation Conference (DAC), pp. 253–256 (July 2006) 10. Feldmann, P., Abbaspour, S., Sinha, D., Schaeffer, G., Banerji, R., Gupta, H.: Driver waveform computation for timing analysis with multiple voltage threshold driver models. In: ACM/IEEE Design Automation Conference (DAC), pp. 425–428 (2008) 210 C. Knoth et al.

11. Gandikota, R., Chopra, K., Blaauw, D., Sylvester, D.: Victim alignment in crosstalk-aware timing analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29(2), 261–274 (2010) 12. Goel, A., Vrudhula, S.: Current source based standard cell model for accurate signal integrity and timing analysis. In: Design, Automation and Test in Europe (DATE), pp. 574–579 (2008) 13. Goel, A., Vrudhula, S.: Statistical waveform and current source based standard cell models for accurate timing analysis. In: ACM/IEEE Design Automation Con- ference (DAC), pp. 227–230 (June 2008) 14. Jain, A., Blaauw, D., Zolotov, V.: Accurate delay computation for noisy wave- form shapes. In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 947–953 (2005) 15. Kashyap, C., Amin, C., Menezes, N., Chiprout, E.: A nonlinear cell macromodel for digital applications. In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 678–685 (2007) 16. Keller, I., Tseng, K., Verghese, N.: A robust cell-level crosstalk delay change analysis. In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 147–154 (2004) 17. Knoth, C., Kleeberger, V.B., Nordholz, P., Schlichtmann, U.: Fast and Waveform Independent Characterization of Current Source Models. In: IEEE/VIUF Inter- national Workshop on Behavioral Modeling and Simulation (BMAS), pp. 90–95 (September 2009) 18. Li, P., Feng, Z., Acar, E.: Characterizing multistage nonlinear drivers and vari- ability for accurate timing and noise analysis. IEEE Transactions on VLSI Sys- tems 15(11), 1205–1214 (2007) 19. Liu, B., Kahng, A.B.: Statistical gate level simulation via voltage controlled cur- rent source models. In: IEEE International Behavioral Modeling and Simulation Workshop (September 2006) 20. Mitev, A., Ganesan, D., Shanmugasundaram, D., Cao, Y., Wang, J.M.: A robust finite-point based gate model considering process variations. In: IEEE/ACM In- ternational Conference on Computer-Aided Design (ICCAD), pp. 692–697 (2007) 21.Nassif,S.,Li,Z.:Amoreeffectiveceff. In: IEEE International Symposium on Quality Electronic Design (ISQED), pp. 648–653 (2005) 22. Raja, S., Varadi, F., Becer, M., Geada, J.: Transistor level gate modeling for accu- rate and fast timing, noise, and power analysis. In: ACM (ed.) ACM/IEEE Design Automation Conference (DAC), Anaheim, California, USA, pp. 456–461 (June 2008) 23. Venkataraman, G., Feng, Z., Hu, J., Li, P.: Combinatorial algorithms for fast clock mesh optimization. IEEE Transactions on VLSI Systems 18(1), 131–141 (2010) 24. Wang, X., Kasnavi, A., Levy, H.: An Efficient Method for Fast Delay and SI Calcu- lation Using Current Source Models. In: IEEE International Symposium on Quality Electronic Design, Washington, DC, USA, pp. 57–61. IEEE Computer Society, Los Alamitos (2008) 25. Zolotov, V., Xiong, J., Abbaspour, S., Hathaway, D.J., Visweswariah, C.: Compact modeling of variational waveforms. In: IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Piscataway, NJ, USA, pp. 705–712. IEEE Press, Los Alamitos (2007) Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer*

Abdelkrim Kamel Oudjida, Ahmed Liacha, Mohamed Lamine Berrandjia, and Rachid Tiar

Microelectronics and Nanotechnology Division, Centre de Développement des Technologies Avancées (CDTA), Baba-Hassen, BP. 17, 16303 Algiers, Algeria {a_oudjida,liacha,mberrandjia,rtiar}@cdta.dz

Abstract. In this paper, a new pure-digital frequency synthesizer Fout =(X/Y)•Fin for square-waves with controlled precision is described. Given that Fin is the input reference frequency provided by a stable crystal oscillator, Fout is the syn- thesized frequency; X and Y are two co-prime integer numbers. The purpose is to demonstrate that with exclusively simple digital tech- niques, a frequency synthesizer with high precision, fast switching time and medium frequency bandwidth can be achieved. In conformity with design-reuse methodology, the frequency synthesizer is implemented as technology-independent and generic IP-core, easily adaptable to suit any particular need.

Keywords: Precision, Frequency Bandwidth, Switching Time, Double-Edge- Triggered Flip-Flops (DETFF).

1 Introduction

High precision, wide bandwidth and fast switching time are the main required specifi- cations for modern frequency synthesizers [1][2]. In the literature, there exists a pleth- ora of solutions, but roughly speaking, all fall into one of the two categories: analog solutions or digital ones. While analog solutions deliver better results, they remain very expensive as they are more difficult to design (requiring careful control of all active components), implement (especially in modern low-cost processes optimized for digi- tal systems), and maintain (there is no possibility of “patching” the circuit). Compared to their analog counterparts, digital solutions are more stable, but suffer from a serious drawback: limited frequency bandwidth. One of the most recent and effective digital solutions is described in [3]. While this solution is based on an interesting mathematical concept, its corresponding hardware implementation presents many weaknesses: an oversized solution (adaptive control) to handle the precision problem, varying switching time, unoptimized solution for

* This work was supported by “Centre de Développement des Technologies Avancées” (CDTA), Algiers, Algeria.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 211–217, 2011. © Springer-Verlag Berlin Heidelberg 2011 212 A.K. Oudjida et al. frequency bandwidth (use of time consuming parallel multiplier and divider), and unknown equations for error, jitter and duty-cycle. Based on the mathematical concept developed in [3], this paper introduces a new implementation alternative that overcomes all of the above-mentioned shortcomings. The paper is organized as follows. In this section we outlined the main requirement specifications for modern frequency synthesizers. Section 2 introduces the function- ing principle of our proposed architecture. Section 3 deals with the theoretical aspect of the solution. Implementation results are discussed in Section 4. And finally some concluding remarks.

2 Functioning Principle of the Solution

Our architecture (Figure 1) is essentially composed of two readable/writable registers to store the X and Y co-prime integer numbers, an Up (C1) and a Down (C2) counter, an adder and a substractor, and a crystal oscillator that generates a stable standard frequency Fc. A host-side-interface is also included to read/write the X & Y registers on the fly.

Y

C1

Fin K Y Fc Load each K cycles Xtal

Fout K Y / X

C2 x

Fig. 1. Block Diagram of the Frequency Synthesizer

Fin is sampled during each Fc period, such that Fc = K•Fin and the accumulated result (K•Y) in C1 is loaded into C2. Then, at each clock cycle of Fc, the X value is subtracted from C2 until C2≤0, such that Fout = Fc / [K•(Y/X)]. When Fc is replaced by K•Fin, we obtain: Fout = (X/Y) • Fin.

3 Theoritical Aspect of the Solution

3.1 Precision

The error in the digital frequency synthesizer is due to the missing fractional part after the cumulative arithmetic operation in C1 is terminated (K•Y rather than K•Y+ r, where 0≤ r

Error is calculated as |Ftout - Fsout| / Ftout, where Ftout and Fsout are the desired theo- retical frequency and the synthesized frequency, respectively. To minimize error, a simple double sampling technique on rising (↑) and falling (↓) edges of Fc during N (N=2n for easy shift operations) periods of Fin is used, as depicted by Figure 2. This has the benefit of not only doubling the frequency bandwidth as 2Fc is used instead Fc, but to also considerably reduce the error, jitter and duty-cycle.

Special Case: N=1 ; K=3 P1 P2

Fin

Fin / 2N

Fin / 2N

Fc Y 2Y 3Y 4Y 5Y 6Y 7Y Y 2Y 3Y 4Y 5Y 6Y 7Y 8Y

Jitter C Tc 1 7Y Fout(P1) C1 8Y

Jitter on Fout = (Tc/2) • Floor (Y/X)

Fout (P2)

Fig. 2. Double Sampling Technique on N Cycles of Fin

The final value obtained in C1, which can be either [(2•N•K+1)•Y] or [(2•N•K+2) •Y], is loaded into C2, and decremented (-N•X) on both ↑ and ↓ of Fc until C2≤0. This tech- nique yields a maximum error of: 1 − 1 ⎡ Y⎤ 1−1 ()2 ⋅ N ⋅K +1 ⎣⎢ X⎦⎥

And the maximum jitter between successive Fout is equal to:

[Tc /2] • Floor[Y/(N.X)] To assure a duty-cycle as close to 50% as possible, another simple technique based on the use of an up and down counter (C2 duplicated) is described in Figure 3. This tech- nique guarantees a maximum duty-cycle of:

50% + [(X/Y) /(2K+1/N)]% However, to guarantee a duty-cycle in-between 40% and 60% (which is the norm), the following condition must be achieved: K ≥ Ceil [5 (X/Y) – 1/(2.N)] 214 A.K. Oudjida et al.

As N & K are two generic parameters in RTL code, they can be individually set to reach any desired precision.

Special Case: N=1 ; K=2 ; X=2 ; Y=7

Fc Down Counter 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 33 [(2K+1).Y – N.X)]

UP Counter 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 2 [0 + N.X] Toggle

Fout

Duty Cycle: 50%+ [(X/Y)/(2K+1/N)]%

Fig. 3. 50% Duty-Cycle Technique

3.2 Switching Time

Switching time is the latency between any variation of the X or Y value and their corresponding Fout. The maximum switching time is: N•Tin + 2Tc , where Tin and Tc are the periods of Fin and Fc, respectively.

3.3 Frequency Bandwidth

The RTL code is technology-independent, highly reconfigurable, and written accord- ing to rules and recommendations given in [5]. The frequency bandwidth (Fc_Max) as well as the area occupation exclusively depends on the precision factor N, K ratio, and the bit size of theY register. To prevent overflow, all bit sizes of internal counters and registers are set to: Y_reg_bit_size • • • Ceil [log2 ((2 N K+2) 2 )]

Our RTL coding style requires that: Fin & Fout ≤ Fc_Max , given that Fc is the master clock of the circuit (Figure 1). The actual maximal rate at which Fc can run (Fc_Max) depends on the physical characteristics of the chip, either ASIC or FPGA.

4 Implementation Results

To implement the double sampling technique, we needed Double-Edge-Triggered Flip-Flops (DETFF) which allow data to be registered on both rising and falling edges of the clock [6]. Unfortunately, these types of flip-flops are not integrated within Xilinx FPGAs [7], which was the sole implementation device available to us. There- fore, to circumvent this hurdle, we extracted all mathematical equations (Table 1)

Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer 215 describing the main features of the proposed architecture depending on how the two counters C1 and C2 are triggered. It is important to note that whatever the imple- mentation case (I, II, III or IV of Table 1), any precision can be attained, since the precision factor N is a generic parameter that can be set accordingly. However, com- pared to case I, the maximum frequency bandwidth is divided by two in the other cases.

We implemented the solution corresponding to case II, where only C1 is double- triggered. This was achieved by using two simple-edge-triggered counters (C1 dupli- cated), running respectively during opposite edges. When the Fin/2N signal toggles (Figure 1), the accumulated results of the two counters are delayed one Tc cycle in order to stabilize before being summed and then loaded into the C2 counter. Such a trick simplifies the timing analysis of the architecture; otherwise it becomes more complicated as two types of clock-to-setup paths exist in the architecture: rising-to- falling and falling-to-rising Fc edge paths. In this case, double timing constraints Ch and Cl are to be respectively observed on both the high-level and low-level of the Fc clock, such as: Tc ≥ 2.Max(Ch , Cl). Without the use of the trick just mentioned, Fc_Max will be significantly degraded. The whole design code, either for synthesis or functional verification, was imple- mented in Verilog 2001 (IEEE 1365). The RTL code was simulated at both the RTL and gate level (post place & route netlist) with timing back-annotation using Model- Sim SE 6.3f, and mapped onto Xilinx FPGAs using Foundation ISE 10.1 version. It is noteworthy to mention that all results, either for slice occupation or delays, are ob- tained using the default options of the implementation software (Foundation ISE 10.1) with the selection of the smallest FPGA device in each family (Virtex5 & Virtex2) with the fastest speed grade. The design has undergone thorough functional software verification procedure ac- cording to the IP development methodology summarized in [8]. As for physical test, the synthesizer was integrated around a Microblaze SoC environment using a V2MB1000 demonstration board [4] with Xilinx’s EDK 9.1i version. The obtained errors (Table 2) are compared to those given in [3]. To characterize our synthesizer in terms of speed and area, RTL-code with N=1, various K ratios and 8-bits X & Y register-size were mapped on recent (Virtex-5) and old (Virtex-2) FPGAs. The results are summarized in Table 3. The slice utilization ratio between the two FPGA families is not only due to the number of slices included, which are 4800 for the Viertex-5 and 256 for the Virtex-2, but also because of the difference in the number of look-up-tables (LUTs) per slice, which is: 2 LUTs of 4 inputs each for Virtex-2 devices, and 4 LUTs of 6 inputs each for Virtex-5 devices. As for speed, there is almost no significant difference in terms of delay with regard to large variations of the K factor (Table 3). Delays were calculated for two types of paths: Clock-To-Setup and all paths together (Pad-To-Setup, Clock-To-Pad, and Pad- To-Pad). The Clock-To-Setup (Table 3) gives more precise information on the delays than the other remaining paths, which depend on the I/O Block (IOB) configuration (low/high fanout, CMOS, TTL, LVDS, …).

216 A.K. Oudjida et al.

Table 1. Main Features of the Architecture

Triggering Error Case Jitter Duty-Cycle (%) Frequency Switching Bandwidth Time C1 C2 General Case SC 1 − 1 1 1 X ⎡ 1 ⎤ ⎡ Y⎤ + . ≤ I ↑↓ ↑↓ − ()⋅ ⋅ + ⋅ ⎢ ⎥ Fin & Fout Fc _ Max 1 1 ⎢ 2 N K 1 ⎥ 2 K 2 Y ⎣2K + ()1/ N ⎦ ⎣ X⎦ T Y c ⋅ Floor ⋅ 1 2 N X − 1 1 1 + X ⎡ 1 ⎤ II ↑↓ ↑ − ⎡()⋅ ⋅ + Y⎤ 1 . ⎢ ⎥ 1 2 ⎢ 2 N K 1 ⎥ K − 2 Y ⎣K + ()1/ 2N ⎦ ⎣ X⎦ 2 N.Tin + 2Tc 1 F _ − 1 1 1 X ⎡ 1 ⎤ F & F ≤ c Max Y + . in out III ↑ ↑ − ⎡()⋅ ⎤ K −1 ⎢ ⎥ 2 1 1 ⎢ N K ⎥ 2 Y ⎣K + ()1/ N ⎦ ⎣ X⎦ Y T ⋅ Floor c N⋅X 1 − 1 1 1 + X ⎡ 1 ⎤ IV ↑ ↑↓ ⎡ Y⎤ ⋅ − . ⎢ ⎥ 1 −1 ()2 ⋅ N ⋅K 2 K 1 2 Y 2K + ()1/ 2N ⎢ X⎥ ⎣ ⎦ ⎣ ⎦ SC : Special Case for N=1 X=Y=1; ↑↓ : Double-Edge-Triggering ; ↑ : Positive or Negative Simple- Edge-Triggering.

Table 2. Error Comparison

1 ()% Our Design (Case II) Stork’s Design [3] F (Hz) K 1 in K − 2 Fout (Hz) Error(%) Fout (Hz) Error (%)

1502 20713 0.0048 1502.04 0.0027 1502 0 2010 15478 0.0064 2010.03 0.0015 2014 0.1990 4008 7762 0.0128 4008.01 0.0002 4016 0.1996 6004 5181 0.0193 6003.84 0.0003 6026 0.3664 10008 3108 0.0321 10009.60 0.0160 10068 0.5995 20004 1555 0.0643 20006.40 0.0120 20258 1.2694 40000 777 0.1287 40012.80 0.0320 40980 2.45 100000 311 0.3220 100160.25 0.1602 106600 6.6 1000000 31 3.2786 1003610.99 0.3610 NA NA 10000000 3 40.0000 10370646.92 3.7064 NA NA

Special case: Fc = 31.111 MHz ; N=1 ; X=Y=1 1/(K-½) is the maximum theoretical error of case II for N=1 & X=Y=1 Tektronix TLA-714 logic analyser has been used for physical measures.

Table 3. FPGA Mapping Results

Virtex 5 Virtex 2 * + xc5vlx30-3ff324 xc2v40-6-cs144

Fc_Max Slice Fc_Max Slice K (MHz) Utilization (MHz) Utilization 106 20.35 2.02 % 12.11 73% 105 22.95 1.85 % 12.72 67% 104 23.20 1.77 % 12.93 59% 103 23.45 1.62 % 13.03 50% 102 28.39 1.43 % 13.24 44% 1 10 28.67 1.33 % 14.85 36% Special case: N=1 ; X & Y register size = 8 bits * : Total number of slices: 4800 + : Total number of slices: 256. Controlled-Precision Pure-Digital Square-Wave Frequency Synthesizer 217

5 Conclusion

We have demonstrated, both theoretically and experimentally on an FPGA, the design of an effective square-wave frequency-synthesizer with controlled precision by using simple digital techniques. However, as RTL-code is technology-independent, a map- ping on a deep-submicron standard-cell library with DET Flip-Flops delivers a much higher frequency bandwidth. Compared to Xilinx DCM (Digital Clock Manager), besides the ability to offer a higher controlled precision, our IP is not tied to a particular process technology. As for applications, our IP can advantageously be incorporated in any type of de- signs requiring a high level of synchronization (greater clock resolution) between source and destination for frame data-transfer, such as in serial communication proto- cols: UART, SPI, I2C, OneWire,…

References

1. Staszewski, R.B., Balsara, P.T.: All-Digital Frequency Synthesizer Design in Deep Submi- cron CMOS. John Wiley & Sons, Inc., Publishers, Chichester (2006) ISBN: 0-471-77255-0 2. Manassewitsch, V.: Frequency Synthesizers: Theory and Design. Wiley-Interscience Pub- lisher, Hoboken (2005) ISBN: 0-471-77263-1 3. Stork, M.: Digital Fractional Frequency Synthesizer Based on Counters. Turkish Journal of Electrical Engineering and Computer Sciences 14(3) (2006) TÜBITAK 4. Xilinx Inc., Virtex-IITM V2MB1000 Development Board User’s Guide 5. Keating, M., Bricaud, P.: Reuse Methodology Manual for System on a Chip Designs, 3rd edn. Kluwer Academic Publishers, Dordrecht (2002) ISBN: 1-4020-7141-8 6. Pedram, M., et al.: A New Design for Double Edge Triggered Flop-Flops. In: Proceedings of Asia and South Pacific Design Automation Conference, pp. 417–421 (February 1998) 7. Xilinx Inc., Doubling Counter/Timer Resolutions with CoolRunner-II, XAPP910 (V1.0) (October 27, 2005) 8. Oudjida, A.K., et al.: Front-End IP-Development: Basic Know How, Revue Internationale des Technologies Avancées, Algeria, vol. (20), pp. 23–30 (December 2008) ISSN 1111- 0902 An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis

Oliver Schrape1, Frank Winkler2, Steffen Zeidler1,MarkusPetri1, Eckhard Grass1, and Ulrich Jagdhold1

1 IHP GmbH, Frankfurt (Oder), Germany {schrape,grass,jagdhold,petri,zeidler}@ihp-microelectronics.com 2 Humboldt University Berlin, Berlin, Germany [email protected]

Abstract. InthispaperanAll-DigitalPhase-LockedLoop(ADPLL) with a high resolution and a wide frequency range for local on-chip clock generation is described. The proposed ADPLL has an operating range from 250 MHz to 1.3 GHz and a resolution of 25 ps. In contrast to other designs, the Digitally Controlled Oscillator (DCO) combines three dif- ferent development approaches to achieve the desired performance. The ADPLL provides four different algorithms to control the DCO. Depend- ing on the selected algorithm and the desired frequency, the lock-in time varies between 54 to more than hundreds reference cycles. The output of the synthesized clock is directly connected to a Low Voltage Differential Signaling (LVDS) interface to provide a high frequency LVDS clock. Be- fore their VHDL implementation, all components were simulated using an event driven Matlab model. This proposed ADPLL uses standard cell library elements only and is implemented in an IHP 0.25 µmBiCMOS process. The overall power dissipation is less than 50 mW (@ 800 MHz) with a 2.5 V power supply. Due to its VHDL description the design can be ported to other processes in short development time.

Keywords: All-Digital Phase-Locked Loop, ADPLL, Clock Generator, Event driven Matlab Model, LVDS, PID Controller.

1 Introduction

For clocking digital synchronous integrated circuits, Phase-Locked Loops (PLLs) are most widely used for frequency synthesis. In [1] R.E. Best has introduced linear PLL (LPLL), digital PLL (DPLL), all-digital PLL (ADPLL) and software PLL (SPLL) architectures. The fundamental structure of an ADPLL contains four components that are shown in Figure 1. A Phase Frequency Detector (PFD) compares the phases of a reference clock with the phase of a divided clock and sends control signals to a Control Unit. This unit evaluates the generated con- trol signals and provides the oscillator with a signal to control the frequency. The purpose of the frequency divider is to divide the generated clock by a pro- grammable constant. In lock-in mode, the output of the frequency divider has

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 218–227, 2011. c Springer-Verlag Berlin Heidelberg 2011 An ADPLL with High Resolution for Local On-Chip Clock Synthesis 219 the same phase and frequency as the reference clock. Traditionally, as published by F. Herzel et al. in [3] the Control Unit and the oscillator are implemented as analog IP-Blocks which are sensitive to process variations. These components have to be redesigned for each new manufacturing process respectively. Due to the noise coupling and power supply noise effects, ADPLLs have gained attrac- tion in the last years, since they reduce integration problems in a digital noisy environment. When designing an ADPLL, two problems have to be considered carefully: The first one is how to design a Digitally Controlled Oscillator (DCO) with a wide frequency range and a high resolution. Therefore, a selectable inverter chain as published by S. Moorthi et al. in [5] provides a wide operating range. In order to achieve a high resolution one can use a ring oscillator with parallel connected tri-state inverters as announced by T. Olsson et al. in [8]. Furthermore, using bus-keeper components as published by D. Sheng et al. in [10] is an alternative approach. The second problem is how to accelerate the frequency convergence and phase convergence of the ADPLL: Simple digital clock generators are proposed in [7] and by P. Nilsson et al. in [6]. Time-to-digital converter (TDC) as published by D. Sheng et al. in [10] can be used to solve this problem. An alternative search step algorithm that allows control depending on the convergence mode is proposed by Ching-Che Chung and Chen-Yi Lee in [2]. Similarly, recursive filters are powerful as published in [8] and presented by J. Zhuang et al. in [11]. In this paper, an implementation of an ADPLL with selectable control algo- rithms is presented. With the design of the proposed DCO, an operating range of 1050 MHz can be achieved. Depending on the chosen algorithm, a deterministic jitter less than 25 ps is obtainable. All components are described in VHDL and use standard cell library elements only. The structure and the behavior of each component are designed and illustrated. The simulation results are compared to earlier published designs.

 ,

      )       -   " '% ' . - .  #     - *%*#       )   !     )    #   "#'''  !  ! $   " "#( "     ! "  $& $%& "

 *  !   ! !   +

Fig. 1. Abstract block structure of the proposed ADPLL core 220 O. Schrape et al.

2 Digitally Controlled Oscillator

In opposition to other published developments of Digitally Controlled Oscillators (DCO) using one approach only, our proposed DCO merges different design approaches and needs few resources to achieve a wide frequency range with a high resolution. This component has three different tuning stages. The first one, the coarse tuning stage, is a slight modification of the selectable inverter chain proposed by S. Moorthi et al. in [5]. To achieve a wider frequency range the used standard cells need to have a short gate delay. Therefore, the chain elements consist of nine serial connected multiplexer structures. They are of different drive types and can be initialized by a NAND gate as shown in Figure 2. The differenttimingpathsareselectableviaa9bitone-hot-encodedinput signal MUX[8:0] connected to the multiplexers selector. Depending on the value of the select signal, either the shorter left timing path or the longer right timing path of the multiplexer is used. The transition time of each structure is about 320 ps. These time differences would cause a huge deterministic jitter, if the DCO oscillates between two timing paths. As published by D. Sheng et al. in [10], bus-keeper components are used to solve this problem. These components ...

Tri[8:0]

Start clk_dco

BusK[4:0] ... MUX0 MUX1 MUX7 MUX8 ...

MUX[8:0]

Fig. 2. Schematic of the DCO are designed using inverters and tri-state inverters of different drive types. They are in parallel connected to the last net in the feedback only. If one or more bits of the 5 bit control word BusK[4:0] is set to logic 1, two effects influence the selected timing path: The output of the enabled tri-state inverter increases the load of the feedback net. Thus, the last multiplexer structure needs more time to change its logical value. This effect influences the 4 other bus-keepers if the LSB of BusK for the leftmost tri-state inverter is set. Additionally, the steepness of each rising or falling edge is smoothed due to the transition delay of the enabled bus-keeper component. Consequently, a finer resolution of less than 40 ps is achieved, whereby a smaller deterministic jitter is possible. To eliminate the timing leak, a third tuning stage is added. Nine parallel connected tri-state An ADPLL with High Resolution for Local On-Chip Clock Synthesis 221 inverters are bound to the initial NAND gate, similar to the solution proposed by T. Olsson et al. in [8]. If one or more tri-state inverters are enabled by the 9 bit control signal Tri[8:0], an additional current drive is added to the multiplexer structures. The slight speedup results in a change of the propagation delay down to 1 ps. A CDL testbench of the DCO was simulated with Spectre MDL for each valid control word to ensure the timing behavior. The measurement results were parsed by scripts to generate a digital VHDL behavior model automatically. The combination of these three different approaches mentioned above results in an operating range from 250 MHz to 1.3 GHz with an average resolution step of less than 5 ps. For this performance our proposed DCO requires only 46 logic gates (see Table 2). For easier control, all valid control codes in the frequency range from 250 MHz to 1.3 GHz are sorted linearly in a look-up table with a maximum timing difference of less than 5 ps, if possible. Nevertheless, the largest timing difference between two neighboring control codes is about 25 ps.

3ClockDivider

The clock divider is used to divide the clock signal generated by the DCO by a programmable constant factor. The resulting clock frequency is compared with the reference clock to derive a controlling signal indicating whether to increase or decrease the clock frequency generated by the DCO. As proposed by Shenggao Li et al. in [4] and F. Herzel et al. in [3] the generated clock of the DCO is divided by a dual modulus prescaler. This component divides the DCO clock by 4 or by 5(clk 45) depending on the logic value of the selector ctrl that is controlled by a 2 bit Swallow Counter S. The output clock of the prescaler is divided by a 7-bit Main Counter M, that starts on the initial programmed value and counts down to zero. If the value is equal to zero, the signal start s of the Swallow Counter is set to logic 1. Consequently, the counter starts to count its initial value down to zero. If zero is reached, ctrl is set to logic 0. The necessary values of M and S are programmable over a SPI-like interface. With this structure, the amount of different adjustable frequencies is increased. Due to the maximum frequency of the 4/5 clock divider an additional prescaler can be enabled to divide the generated oscillator clock by 2. For this case, the calculation of the generated feedback is given in Equation 1.

clkdco clkfb = (1) (4 · M + S) · 2

4 Phase Frequency Detector

As published by Ching-Che Chung et al. in [2] the PFD structure is composed of a self-resetting flip-flop structure. The phases of the 5 MHz reference clock and of the divided feedback clock are compared. If the feedback clock is delayed regarding the reference clock, the flag d output signal is set to logic 1. Therefore, if the feedback clock leads regarding the reference clock, the output signal flag u 222 O. Schrape et al. is set to logic 1. The simulated dead zone of the PFD is less than 210 ps which is limited by the cell delays of the used flip-flops in the self-resetting structure. If the phase error of both clocks is lower than the dead zone each flag remains in logic 1. To ensure that no pulse width violation occurs, a digital pulse amplifier is implemented, which enlarges the pulse width of the reset signals.

5 Control Unit

In general, analog PLLs use a simple loop filter to regulate the control volt- age of the Voltage Controlled Oscillator (VCO) that mostly results in a greater lock-in time. A shorter lock-in time can only be achieved using a filter struc- ture of a higher order that is difficult to implement in an analog design flow. A digital control unit has much more potential to control the PLL behavior au- tomatically. Finally, it calculates the binary code word (w) that influences the timing paths of the DCO. The value that has to be controlled is an index of a sorted array that includes the binary code word for the inputs of the DCO. Our proposed ADPLL has four different control algorithms initial selectable us- ing the SPI-like interface. For applications that do not need a fast lock-in time, a linear search algorithm and a binary search algorithm is implemented that guarantees frequency acquisition only. The width of the phase error (pwidth)is not used for the calculation of the next index. In these algorithms only the sign of the phase error is interpreted. Due to the time constant of the ADPLL and the simple control algorithm a greater jitter is produced. Therefore, a counter cntx is implemented that allows a frequency change by changing a control word every x reference cycle only. However, this simple solution decreases the jitter enormously while simultaneously increasing the lock-in time. Both, the linear and non-linear, algorithms require few logic resources but do not support phase locking. In addition to these algorithms, a recursive filter is selectable to control the DCO that allows a shorter lock-in time.

5.1 Linear – Non-linear Controller If the linear algorithm is selected, the DCO is initialized with the mean value of the number (#f) of valid frequency code words w. Depending on the flag u and flag d of the PFD the next higher or lower code word is chosen every x cycles. The worst case is a lock on the lowest or the highest frequency of the look-up table. Accordingly, more than x · #f reference cycles are necessary to reach the desired frequency. For a shorter lock-in time, a binary search algorithm can be used instead of the linear counter. The number of clocks is reduced to f · x log2(# ) . When the desired frequency is found, the control unit oscillates between two neighboring code words.

5.2 Recursive Filter In order to achieve shorter lock-in time, a recursive filter can be used as pro- posed by T. Olsson et al. in [8]. Alternatively, the use of time-to-digital converter An ADPLL with High Resolution for Local On-Chip Clock Synthesis 223

(TDC) components has a better performance as published by D. Sheng et al. in [10]. In contrast to the two algorithms mentioned above, two additional PID con- trol algorithms are developed to achieve a faster lock-in time. For this purpose, a local ring oscillator is implemented to clock two simple counters (cnt1, cnt2). These counters are increasing their values while the pulse width indication signal pwidth of the PFD (see Figure 1) is logical 1. The sum of the two counter values represent the measured phase error (pe). The output of the ordinary PID con- troller – the new control index w(n) of clock cycle n – is the sum of proportional term (KP ) an integral term (KI ) and a derivative term (KD). The proportional term, referred to as proportional gain, is a multiple of pe by a constant factor P . The integration of the phase error is done by summing up over time. This gives an accumulated offset that should have been corrected previously and multiplied by the constant integral gain I. The change of pe after each adjustment is the product of the constant derivative gain D and the difference between the actual phase error and the last measured phase error. The resulting sum of the three terms KP , KI and KD represents the next valid index for the look-up table that contains the corresponding numerical control word. n   w(n)=P · pe + I · pe + D · pe − pe (2)   n k  n−1 n  k=0 KP    KD KI

Figure 3 shows the result of a simulation of an event driven Matlab model as proposed by J. Zhuang et al. in [11]. The PLL with this PID Controller locks at the desired frequency of 800 MHz after 85 reference cycles. The DCO oscillates between 4 neighboring frequencies in lock-in mode. The calculated average deterministic jitter after lock-in is about 4.1 ps.

Frequency Histogram ADPLL Frequencies 120 2

1.8 100

1.6 80

1.4 60 Count f [GHz] 1.2

40 1

20 0.8

0 0.8 1 1.2 1.4 1.6 1.8 2 0 50 100 150 200 250 300 f [GHz] Reference Periods

Fig. 3. Histogram and frequency variation of the general PID Controller

5.3 Smoothing Recursive Filter To achieve a better performance the mentioned calculation (see Equation 2) is slightly modified. Therefore, the new control index w(n∗) is the difference 224 O. Schrape et al. between the last stored control word w(n − 1) and the currently calculated code word w(n) as illustrated in the following equation.

w(n∗)=w(n − 1) − w(n)(3)

As illustrated in Figure 4 the modified PID Controller requires fewer reference cycles and has a smoother approximation. In addition, the smoothed PID Con- troller oscillates only between two frequencies. The average deterministic jitter after lock-in is about 1.6 ps.

Frequency Histogram ADPLL Frequencies 180 0.95

160 0.9

140 0.85 120

0.8 100 Count 80 f [GHz] 0.75

60 0.7 40

0.65 20

0 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0 50 100 150 200 250 300 f [GHz] Reference Periods

Fig. 4. Histogram and frequency variation of the smoothing PID Controller

Table 1 shows that the approximated power consumption of the PID algo- rithms is 64 times greater than for the linear/binary search version after logic synthesis. The reasons for this are the necessary local ring oscillator and ad- ditional registers for calculation parts. Nevertheless, depending on the desired frequency and adjusted filter parameter P , I and D, a lock-in after less than 55 reference cycles is possible.

Table 1. Properties of the implemented control algorithms

Algorithm Area [mm2] Power [uW ] Lock Time [cycles] (non-)linear 0.024 0.5 500 – more than 1000 (smoothed) PID 0.108 32.25 < 70

6ChipLayout

Figure 5 illustrates the complete prototype chip including IO pads. The layout was done as a block design. The core area of the ADPLL without IO pads is about 0.7mm2. The stand-alone DCO is placed in the upper right corner with fully separated supply voltage. Additionally, two standard cells (converters), CMOS to ECL and ECL to CMOS are placed on the right side of the core area. These are necessary to convert the 2.5 V CMOS voltage levels of the oscillator output An ADPLL with High Resolution for Local On-Chip Clock Synthesis 225 to standard LVDS levels, and to use an external fast LVDS clock (dco ext p, dco ext n) as an input clock for the clock divider for separate testability. The differential LVDS IO pads are placed left besides the two converters. Using these pads, clock rates up to 1.3 GHz are supported as well as inputs to supply the clock divider with an external frequency up to 1.7 GHz for separated testability. The LVDS converters have their own supply voltage to be fully decoupled from the CMOS components.

PFD SPI LVDS interface DCO DIV Control Unit

(a) Layout view of the ADPLL (b) Fabricated ADPLL Chip

Fig. 5. Prototype Chip: 1.6mmx 1.6mm

7 Simulation and Experimental Results

The proposed ADPLL is designed by using standard library cells only. It was synthesized by a bottom up synthesis flow. Figure 6 shows a digital worst case SDF timing simulation. The PLL locks at the desired frequency of 560 MHz after 54 reference cycles. In this case, the smoothed recursive filter was chosen to stabilize the system. For this kind of simulation, the generated VHDL behavior model was used. The deterministic jitter after lock-in is about 5 ps. In comparison to other ADPLL designs shown in Table 2, our proposed implementation has a frequency range of 1050 MHz with a lock-in performance of less than 70 reference cycles. Due to the four implemented control algorithms, the additional local ring oscillator for the recursive filters and the LVDS interface, the power dissipation is larger than other implementations using a similar kind of process. Nevertheless, we achieved a fine resolution of less than 25 ps. Furthermore, the maximum lock- in time of less than 70 reference periods for our implementation compared to 46 cycles for the proposal in [2] is caused by the limited number of possible control codes of our look-up table. In contrast to the published ADPLLs in [2] and [8], our proposed DCO structure has a wider operating range and requires fewer logic resources. 226 O. Schrape et al.

Fig. 6. Digital worst case simulation of the ADPLL

Table 2. Properties Comparison

Performance Proposed [2] [8] [9] Parameter ADPLL Process 0.25 µmBiCMOS0.35 µmCMOS0.35 µmCMOS0.18 µmCMOS Core Area 0.83 mm2 0.71 mm2 0.07 mm2 0.0025 mm2 DCO Gates 46 > 100 128 – Power < 50 mW 100 mW – 6.4mW Dissipation (@ 800 MHz) (@ 500 MHz) Min Freq 250 MHz 45 MHz 170 MHz 0.1 MHz Max Freq 1.3 GHz 510 MHz 360 MHz 282 MHz Lock-in Time < 70 cycles < 46 cycles ∼ 60 cycles < 5cycles Resolution < 25 ps < 5ps < 55 ps –

8Conclusion

In this paper, an ADPLL with high resolution and wide frequency range is pro- posed. All components are written in VHDL and are using standard cell library elements. An event driven Matlab model of the ADPLL was developed to simulate at design level. The prototype of the ADPLL was fabricated in an IHP 0.25 µm BiCMOS process. In contrast to other published DCO implementations, our de- sign is a combination of three different approaches, whereby, a resolution of less than 1 ps, and simultaneously, a frequency range of 1050 MHz are possible. The implemented recursive filters allow a short lock-in after less than 70 cycles. For smoother approximation that results in a lower output jitter, a new slightly An ADPLL with High Resolution for Local On-Chip Clock Synthesis 227 modified PID Controller is introduced. Furthermore, a LVDS clock is also pro- vided. The developed ADPLL is suitable for high-performance applications re- quiring a wide frequency range with a very low clock jitter.

References

1. Best, R.: Design and Applications, Design, Simulation, and Applications. McGraw- Hill, New York (February 2000) 2. Chung, C.C., Lee, C.Y.: An all-digital phase-locked loop for high-speed clock gen- eration. IEEE Journal of Solid-State Circuits 38(2), 679–682 (2003) 3. Herzel, F., Osmany, S.A., Hu, K., Schmalz, K., Jagdhold, U., Scheytt, J.C., Schrape, O., Winkler, W., Follmann, R., Kohl, D.K.T., Kersten, O., Podrebersek, T., Heyer, H.V., Winkler, F.: An integrated 8-12 ghz fractional-n frequency syn- thesizer in sige bicmos for satellite communications. In: Analog Integrated Circuits and Signal Processing (January 2010) 4. Li, S., Ismail, M.: A 7 ghz 1.5-v dual-modulus prescaler in 0.18 µm copper-cmos technology. Analog Integrated Circuits and Signal Processing 32, 89–95 (2002) 5. Moorthi, S., Meganathan, D., Janarthanan, D., Kumar, P.P., Perinbam, J.R.P.: Low jitter adpll based clock generator for high speed soc applications. In: Proceed- ings of World Academy of Science, Engineering and Technology, vol. 32 (August 2008) 6. Nilsson, P., Torkelson, M.: A monolitic digital clock-generator for on-chip clocking of custom dsp’s. IEEE Journal of Solid-State Circuits 31(5), 700–706 (1996) 7. Olsson, T., Nilsson, P., Meincke, T., Hemam, A., Torkelson, M.: A digitally con- trolled low-power clock multiplier for globally asynchronous locally synchronous designs. In: IEEE International Symposium on Circuits and Systems, vol. 3, pp. 13–16 (2000) 8. Olsson, T., Nilsson, P.: A Digital PLL made from Standard Cells. In: European Conference on Circuit Theory and Design, ECCTD 2001 (August 2001) 9. Reddy, B.S.P., Krishnaparsad, N., Moorthi, S., Perinbam, J.R.P.: An All Digital Phase Locked Loop for Ultra Fast Locking. In: Proceedings of National Conference on Engineering Trends in Engineering and Technology (2008) 10. Sheng, D., Chung, C.C., Lee, C.Y.: A Fast-Lock-In ADPLL with High-Resolution and Low-Power DCO for SoC Applications. In: IEEE Asia Pacific Conference on Circuits and Systems (2006) 11. Zhuang, J., Du, Q., Kwasniewski, T.: Event-driven modeling and simulation of an digital pll. In: Proceedings of the IEEE International Behavioral Modeling and Simulation Workshop (2006) Clock Network Synthesis with Concurrent Gate Insertion

Jingwei Lu, Wing-Kai Chow, and Chiu-Wing Sham

The Hong Kong Polytechnic University [email protected], [email protected], [email protected]

Abstract. In VLSI digital circuits, clock network plays an important role on the total performance of the chip. Clock skew and power dissipation are two major focuses of concerns in the clock network synthesis. During topology generation, the locations of buffer and gate insertion are usually not available. Despite local optimization, the global performance is limited. In this paper, a novel approach of topology generation with concurrent gate insertion is proposed. Meanwhile, a strict clock slew constraint is applied with comprehensive buffer insertion tech- niques. By clock gating, the switched capacitance of the clock tree is reduced, with acceptable extra cost caused in controller tree. In experimental results it is shown that our approach has good performance on the reduction of both clock skew and power dissipation.

1 Introduction

Clock signals are employed in VLSI digital systems to synchronize the active com- ponents of a design. Clock skew minimization is a popular research topic during the past decades. Some early works [1,2] mainly concentrated on the average distribution of wirelength between source and each terminal to achieve actual delay equalization. Afterwards, delay balancing [3] using Elmore delay model [4] became prevalent to ac- quire more accurate information of timing delay. The deferred-merging and embedding (DME) technique was proposed in [5], it can achieve the zero clock skew with mini- mal wirelength. In topology generation, some algorithms were proposed for unbuffered and ungated clock tree in [6], and buffered but ungated clock tree in [7]. In ISPD 2009 clock network synthesis contest [8], a voltage variation related objective named Clock Latency Range (CLR) was formulated. Subsequent research work was also proposed [9] accordingly. Twenty percent to fifty percent of the power usage is contributed by the clock net- work [10]. On behalf of power reduction, the application of clock gating is an effective approach in the sequential circuits. The principal idea is to turn off the idle modules and tree sections in order to cut down the unnecessary switching power. Clock gat- ing can be applied on logic level [11], register-transfer-level level [12] and architecture level [13]. Nevertheless, besides logical information, physical location of the modules should also be taken into account in case wirelength overhead thus power usage waste. Some achievements were proposed with both logical and physical concerns. The algo- rithm in [14] showed a clock tree topology construction, taking advantage of the activ- ity patterns of modules. Moreover, activity similarity was considered in [15]. Besides,

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 228Ð237, 2011. c Springer-Verlag Berlin Heidelberg 2011 Clock Network Synthesis with Concurrent Gate Insertion 229 a gating method regarding microprocessor design was proposed in [16]. The algorithm constructed the topology in a bottom-up procedure, with the objective of switched ca- pacitance minimization. Further on, in [17] a comprehensive technique with a recursive computation on effective switched capacitance and a solution sampling on merging seg- ment set was discussed. In this paper, we propose a novel synthesizer to construct a binary clock tree in a bottom-up course. Simultaneous optimization on the clock skew and the power dissi- pation is applied. The topology generator is responsible for a buffered and gated clock tree, and the clock gates are inserted concurrently. The major advantage of our work is to take the downstream masking information of subtrees into account during each merging step. An algorithm named dual-MST [9] for topology generation is involved in our work, and the cost function is improved for power awareness. Besides, we perform a more strict slew constraint along the whole clock network. Thus the constraint on buffer and gate location is emphasized. The experimental results show that our method can greatly reduce the power consumption of the clock network with proper gate inser- tion. Meanwhile, the clock skew and PVT variation can still be maintained within an acceptable range. The rest of the paper is organized as follows. Some preliminary knowledges of tree construction and capacitance are discussed in section 2. The details of our approach are discussed in section 3. The technique of power aware topology generation with concurrent buffer and gate insertion is proposed detailedly. Experimental results are shown in section 4. Finally we reach our conclusion in section 5.

2 Preliminaries

2.1 Clock Tree and Controller Tree

Let T = {V,E} denote the clock tree. V = {vi|i =1, 2,...,mv} is the set of nodes, and E = {ej|j =1, 2,...,mv − 1} is the set of clock edges between the node vj and its corresponding parent. Let |ej| denote the length of the edge ej. Apparently, for the root node there will be no edge assigned. Let G = {gi|i =1, 2,...,mv −1} denote the set of gates. The gate gj is assigned to be on the edge ej masking the node vj directly. We use S = {vk|k =1, 2,...,ms} (where ms

ATno (Ai) TRno (Ai) P (Ai)= ,Ptr (Ai)= (1) Len (Ai) 2 × (Len (Ai) − 1) where ATno (Ai) is the number of active times (1s) in Ai,andTRno (Ai) is the number of transitions (10 or 01) in Ai. Len (Ai) denotes the stream length of Ai.

clock signal

e7 v7

EN5

EN6 g5 g6 control logic e5 e6 EN 1 v 5 EN2 v6

EN3 EN4 g1 g2 g3 g4 clock tree T e1 e2 e3 e4 controller v1 v2 v3 v4 tree CtrT

Fig. 1. A gated clock binary tree

2.2 Switched Capacitance The power consumed by CMOS circuits consists of two components: static and dy- namic power. The static power is mostly determined by the feature size and other tech- nology. Therefore, in this paper we only consider dynamic power minimization. The = 1 2 definition of the dynamic power is P 2 αCfVdd. C means the total load capacitance on the circuit, f is the frequency of the clock signal and Vdd is the power supply. α means the amount of switch times in each clock cycle. For clock tree α =2, because there is one rising and one falling edge in each clock period. α =1in the controller tree, respectively. Since f and Vdd are constant parameters in the digital circuits, we can use the switched capacitance as a measure of the power usage. Assume that a sub- ctr tree Ti rooted at vi with a gate insertion gi, and the controller tree is denoted as Ti . ctr u u = + The unmasked load capacitance for Ti and T are C and C ctr CENi Cg ac- i vi Ti cordingly, Cg denotes the input capacitance of a gate. We can get the equation for the Clock Network Synthesis with Concurrent Gate Insertion 231

Ai idle v i active

va vb

Aa Ab

Fig. 2. An example of activity pattern transmission

= u ( ) downstream switched capacitance of vi as SCvi Cvi P Ai . Similarly, the corre- ctr ctr = sponding switched capacitance for the controller tree Ti is measured as SCTi ( + ) ( ) CENi Cg Ptr Ai . The power consumption of a clock network is directly proportional to the average switched capacitance for each clock cycle. The total switched capacitance is contributed by a gated and buffered clock tree T and a controller tree T ctr. In order to reduce the switching activity, modules and clock tree sections can be disabled by clock gates dur- ing their inactive clock periods. From the above example, we can see that the original u capacitance of node vi is Cvi . With gate gi inserted at vi, the resultant switched capac- + ctr u + ctr itance is SCvi SCTi .IfCvi

3 Methodology

We build our clock tree based on the dual-MST construction method [9], and the re- sulting clock tree is close to a full symmetry. In our paper, it is improved with a new cost function to take both distance and power saving into account. As a result, the ac- cording topology can result in both low power usage and small clock skew. A recursive buffer/clock gate insertion method is developed for bottom-up merging. Blockage han- dling technique is also involved, because the buffers and gates cannot be placed inside blockage regions. Elmore model [4] is applied for clock delay computation. DME tech- nique [18] is applied for wirelength minimization. Thus, segment is used instead of point to represent the set of merging location, and deferred embedding is applied to reduce total wirelength.

3.1 Power Aware Topology Generation In order to save the power, the nodes with a bigger similarity of activity patterns should have a higher priority to be matched. Assume va and vb to be a pair of two nodes, 232 J. Lu, W.-K. Chow, and C.-W. Sham as shown in figure 2. If the corresponding activity patterns Aa and Ab are similar, the resulted activity Ai will have a shorter active period, and smaller power cost will be caused. Besides the concerns on activity patterns, an estimation of the merging cost Pwr(va,vb) is also required. This can be determined in multiple ways. For instance, we can actually merge the two nodes together to obtain the exact connection informa- tion. However, exact buffer insertion and wire balancing are performed, which will cost longer time. Instead, we develop a new method for potential switched capacitance esti- mation. The Manhattan distance between the nodes va and vb is denoted by D(va,vb). The Elmore delay difference of these two nodes is denoted by DLY (va,vb). The de- lay and power consumption for unit wirelength are denoted by ρD and ρP respectively, ( ) which are computed in advance for simulation reference. If DLY va,vb is smaller than ρD D(va,vb), then the two nodes can be merged without snaking wire involved, and the corresponding equation for power cost computation is shown as below

Pwr(va,vb)=ρP × D(va,vb) × P (Ai) (2)

Otherwise, snaking will be included, as shown in the following equation

DLY (va,vb) Pwr(va,vb)=ρP × × P (Ai) (3) ρD An improved power aware dual-MST geometric matching technique is developed for topology construction, a specific definition of a geometric matching of one iteration can be found in [2]. The detailed description is shown in procedure 1. It is a weighted perfect matching approach. Given a set of nodes V = {v1,v2 ...vm}, we first construct a complete graph G = {V,E}.Let|V | and |E| denote the number of nodes and edges in the graph G,so|V | = m.SinceG is a complete graph, every pair of two nodes vi,vj = { } | | = m(m−1) is connected by an edge ei,j, E e1,2,e1,3 ...em−1,m and E 2 .Thecost of matching two nodes vi and vj is denoted as fc (ei,j ).LetM denote the matching result of G. M is composed of a group of edges and it is a subset of E. The maximal pairing cost of M is denoted as Cmax and defined as below. We will get close to a symmetric clock tree by reducing Cmax in each level. The merging cost fc(va,vb) is shown as below. α and β are the weight of the Manhattan distance and the estimated power cost, respectively.

fc(va,vb)=α × D(va,vb)+β × Pwr(va,vb) (4)

By means of this weighted cost function, the node pairs with a bigger similarity of switching activity and a shorter distance will have a higher priority to be matched. Our approach of topology generation is based on concurrent gate insertion, therefore the downstream information of the two merging nodes are accurate.

3.2 Concurrent Gate and Buffer Insertion A recursive buffer and gate insertion technique is developed on behalf of three objec- tives: (1) slew rate constraint (2) clock skew minimization (3) power usage reduction. Buffers are utilized for power supply to restrict the signal transition time, and clock Clock Network Synthesis with Concurrent Gate Insertion 233

Procedure 1. Partition(G) Require: G = {V,E} is a complete graph, E is sorted in ascending order of fc (ei,j ). if |V |≤1 then return; else if |V | =2then merge(v1,v2); return; else Building dual-MST with |V |−2 edges inserted. Two subgraphs G = {V ,E} and G = {V ,E} are generated Two minimum spanning trees st and st for V  and V  are generated if |V | is odd and |V | is odd then   em,n {fc ei,j |∀ni ∈ V , ∀nj ∈ V } =argei,j min ( ) ; merge(vm,vn);  remove vm from V ;  remove vn from V ;   remove em,x from E , ∀x ∈ V ;   remove en,y from E , ∀y ∈ V ; end if partition(G); partition(G); return; end if gate insertion can reduce the switched capacitance by disabling idle sections. Real-time simulation of signal slew rate costs much more time and is impractical. Hence we build look-up tables in advance for slew reference. It can estimate the driving ability among diverse circumstances. We model the buffer and gate with according attributes for El- more delay computation. Some previous works [19] already proposed to construct a buffered clock tree with zero clock skew. In our work, we apply similar approach for both buffer and clock gate insertion. The input/output capacitance and resistance of the buffers and clock gates should be obtained first. Hence, the delay of wire, buffers and clock gates can be computed based on Elmore RC model. In our work, we try to maintain the level of buffers and gates of every source-to-sink clock path exactly the same. During the procedure of the bottom-up binary merging, we first examine the two downstream levels of gates. If they differ by two or more, a penalty cost will be engaged. Such matching result will probably be discarded due to the huge cost. Buffer levels will be balanced accordingly. By means of this level balancing, the clock skew will be reduced significantly, and the negative effect caused by signal variation will be reduced. Here we will describe our technique of gate insertion based on a determined match- ing result. We first define three different kinds of gate insertion. They are virtual gate insertion at the upstream level, temporal gate insertion at the current level and none gate insertion. Temporal insertion is controlled by the balancing of gate levels, which will be further divided into two kinds of single gate insertion and one kind of back-to-back double gates insertion. The insertion of a gate is assumed to be closest to the internal 234 J. Lu, W.-K. Chow, and C.-W. Sham merging node on behalf of switched capacitance minimization. Since DME technique is applied in our work, we assume the middle point of the merging segment to be the gate location. The comparison among the three assumption of gate insertion are based on the resulting switched capacitance, which are SCvir, SCtmp and SCnon, respectively. If the power consumption of the virtual insertion or the none insertion is the smallest, no insertion of any gate will definitely result in less switched capacitance compared to the choice of temporal gate insertion. Therefore, we discard any insertion of gates at the current level. Otherwise, temporal gate insertion will probably reduce the switched capacitance rather than the others, and here we will accept the insertion of gates. An example is shown in figure 2. The activity Ai equals to Aa ∪ Ab. The edge connection between each of the two nodes to the merging node are denoted as ea and eb. Cea and Ceb are their corresponding capacitance cost. The equations to compute the three resulting switched capacitance are shown as below

( )=( u + + u + ) × ( )+ u × ( ) SC v ,v C C a C C b P A C ctr P A (5) vir a b a e b e i Ti tr i

u u u SC (v ,v )=(C + C ) × P (A )+C ctr × P (A )+C + C tmp a b a ea a Ta tr a b eb (6)

( )= u + + u + SCnon va,vb Ca Cea Cb Ceb (7)

Notice that here we only describe the equation of SCtmp for a single gate insertion at node va. The other two equations can be derived in a similar way.

4 Experimental Results

In this section, our experimental results are presented. We implement our clock network synthesizer in C programming language. The binary is executed on a Linux machine with an Intel Core2 Quad 2.4G Hz CPU and 4GB memory. The benchmark circuits used in the experiments are released from the ISPD 2009 CNS contest [8]. The detailed information of the benchmark circuits is shown in table 2. In our experiment, one type of wire and one type of buffer is used in our clock tree synthesizer. The unit resistance of the wire is 0.0003Ω/nm, and the unit capacitance of the wire is 0.00016fF/nm. The specific configuration of the buffer in different sizes is shown in table 1. In our synthesizer, the maximum buffer size is set to be 6. Hence we list the attributes of different buffer sizes up to 6. Notice that the corresponding attributes of a gate is listed in the last row of table 1. This table is generated from our SPICE simulation statistics. Cb means the input capacitance, Rb means the driver resistance and db means the internal delay of a buffer, respectively. During the evaluation, the power supply is set to be Vdd=1.0V . The PTM model applied in our simulation are of 45 nanometer scale. A summary of the performance of our clock tree after insertion of clock gates is shown in table 3. We run our program with different values of α and β for topology tuning. The clock skew (SKEW), total capacitance (TC), optimal capacitance (OSC), switched capacitance (SC) and CPU time are listed. The respective units are picoseconds (ps) for SKEW, seconds for CPU and femto-farad (fF) for capacitance. TC denotes the Clock Network Synthesis with Concurrent Gate Insertion 235

Table 1. Buffer configuration

buffer sizes Cb (fF) Rb(Ω) db(ps) 1 35 66.9 4.92 2 70 40.5 5.63 3 105 31.3 6.13 4 140 26.4 6.52 5 175 25.0 6.95 6 210 20.7 7.20 gate 35 52.45 17.03

Table 2. Circuit information of the benchmarks from ISPD 2009

Chip Size No. of No. of block limit Circuits (mm x mm) sinks (Area %) CAP (fF) ispd09f11 11.0 x 11.0 121 0 (0%) 118000 ispd09f12 8.1 x 12.6 117 0 (0%) 110000 ispd09f21 12.6 x 11.7 117 0 (0%) 125000 ispd09f22 11.7 x 4.9 91 0 (0%) 80000 ispd09f31 17.1 x 17.1 273 88 (24.38%) 250000 ispd09f32 17.0 x 17.0 190 99 (34.26%) 190000 ispd09fnb1 2.6x2.1 330 53 (37.69%) 42000 ispd09f33 15.3 x 15.3 209 80 (27.68%) 195000 ispd09f34 16.0 x 16.0 157 99 (38.67%) 160000 ispd09f35 15.3 x 15.3 193 96 (33.22%) 185000 ispd09fnb2 6.4x4.4 440 1346 (63.88%) 88000 avg. 12.1 x 11.6 203 169 (23.62%) 140273 original total capacitance cost of the clock tree without gate insertion. OSC denotes the resulted capacitance after disabling of all the idle periods at each node. SC denotes the resulted switched capacitance of our gated clock tree. It can be seen that SC is mostly smaller than TC, which means a effective power reduction in our gated clock tree con- struction. The nominal skew of each clock tree is zero. Additionally, we use NGSPICE for further evaluation and get the accurate skew estimation, as listed in the table. The activity pattern of all the sinks are generated according to the instruction and RTL de- scription used in [16]. The length of the activity pattern is 10000 for every benchmark. Previous works were proposed with loose constraint on slew or driving power supply, for instance, ≤ 20 × Cg for a buffer or gate insertion in [17,16]. The work in [14] did not involve clock routing and synthesis. However, in our program the transition time (slew rate) is maintained to be under 100 ps throughout the whole clock network, thus more buffers are inserted to follow this rule. As a matter of fact, in this paper it is very difficult for us to include direct comparison with previous works. It can be speculated that the power cost of our work should be larger than the previous ones, but the signal transition time is more consistent hence the work is more practical in use. Generally, in our work the switched capacitance can be reduced by around 10% with the insertion of clock gates. Meanwhile, the clock skew is only about 20 ps in average. The runtime of our program is less than 3 seconds, which represents good efficiency. 236 J. Lu, W.-K. Chow, and C.-W. Sham

Table 3. Clock skew and switched capacitance with gate insertion

Our approach (α =1,β =0) Our approach (α =2,β =1) Circuits SKEW TC OSC SC CPU SKEW TC OSC SC CPU ispd09f11 20.0 103973 61868 78939 0.37 16.7 103851 61422 78261 0.37 ispd09f12 17.2 104874 65539 78970 0.34 16.6 103998 65090 79603 0.35 ispd09f21 20.0 118028 68813 89140 0.35 25.7 108116 67586 81043 0.35 ispd09f22 15.6 69810 43786 53173 0.32 8.5 69552 43938 53597 0.32 ispd09f31 33.7 221639 136596 179336 3.83 19.3 220522 128744 174024 5.60 ispd09f32 33.4 175122 101850 138156 0.51 21.7 162525 103658 123151 0.50 ispd09f33 20.6 171747 107773 139467 5.44 18.8 155995 100329 128386 6.30 ispd09f34 22.2 144688 92341 118570 0.49 20.3 139518 88924 109183 0.46 ispd09f35 16.9 165546 104232 134708 8.11 21.6 163376 102231 128963 8.13 ispd09fnb1 18.6 32635 23452 32635 0.70 29.6 34370 24869 34370 0.63 ispd09fnb2 19.7 67041 46550 66280 2.40 27.5 70478 50113 69788 1.90 avg. 21.6 125009 77527 100852 2.08 20.6 121118 76082 96397 2.26

5Conclusion

In conclusion, power saving and clock skew are two major concerns in clock network synthesis. A power aware topology generation with concurrent buffer/gate insertion is proposed in this paper. This is developed in order to optimize the clock skew and the power dissipation of a clock distribution network simultaneously. Experimental results show that our method can greatly reduce the switched capacitance hence power con- sumption of the clock network with proper clock gate insertion. Meanwhile, the clock skew can still be maintained within an acceptable range.

Acknowledgement

The work described in this article was partially supported by the RGC Direct Allocation Fund from The Hong Kong Polytechnic University (Project No. A-PC0W).

References

1. Jackson, M.A.B., Srinivasan, A., Kuh, E.S.: Clock Routing for High-Performance ICs. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 573Ð579 (June 1990) 2. Kahng, A., Cong, J., Robinsh, G.: High-Performance Clock Routing Based on Recursive Geometric Matching. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 322Ð327 (June 1991) 3. Tsayz, R.S.: Exact Zero Skew. In: Proceedings of IEEE/ACM International Conference on Computer Aided Design, pp. 336Ð339 (November 1991) 4. Elmore, W.C.: The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers. Journal of Applied Physics 19(1), 55Ð63 (1948) 5. Boese, K.D., Kahng, A.B.: Zero-Skew Clock Routing Trees With Minimum Wirelength. In: Proceedings of 5th the Annual IEEE International ASIC Conference and Exhibit, pp. 17Ð21 (1992) Clock Network Synthesis with Concurrent Gate Insertion 237

6. Edahiro, M.: A Clustering-Based Optimization Algorithm in Zero-Skew Routings. In: Pro- ceedings of IEEE/ACM Design Automation Conference, pp. 612Ð616 (June 1993) 7. Chaturvedi, R., Hu, J.: Buffered Clock Tree for High Quality IC Design. In: Proceedings of the International Symposium on Quality Electronic Design, pp. 381Ð386 (2004) 8. Sze, C.N., Restle, P., Nam, G.-J., Alpert, C.: ISPD 2009 Clock Network Synthesis Contest. In: Proceedings of ACM International Symposium on Physical Design, pp. 149Ð150 (March 2009) 9. Lu, J., Chow, W.K., Sham, C.W., Young, E.F.Y.: A Dual-MST Approach for Clock Network Synthesis. In: Proceedings of Asia and South Pacific Design Automation Conference, pp. 467Ð473 (January 2010) 10. Kitahara, T., Minami, F., Ueda, T., Usami, K., Nishio, S., Mruakata, M., Mitsuhashi, T.: A Clock-Gating Method for Low-Power LSI Design. In: Proceedings of Asia and South Pacific Design Automation Conference, pp. 307Ð312 (January 1998) 11. Chang, C.M., Huang, S.H., Ho, Y.K., Lin, J.Z., Wang, H.P., Lu, Y.S.: Type-Matching Clock Tree for Zero Skew Clock Gating. In: Proceedings of IEEE/ACM Design Automation Con- ference, pp. 714Ð719 (June 2008) 12. Donno, M., Ivaldi, A., Benini, L., Macii, E.: Clock-Tree Power Optimization based on RTL Clock-Gating. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 622Ð627 (June 2003) 13. Luo, Y., Yu, J., Yang, J., Bhuyan, L.: Low Power Network Processor Design Using Clock Gating. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 712Ð715 (June 2005) 14. Farrahi, A.H., Chen, C., Srivastava, A., Tellez, G., Sarrafzadeh, M.: Activity-Driven Clock Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys- tems 20(6), 705Ð714 (2001) 15. Chen, C., Kang, C., Sarrafzadeh, M.: Activity-Sensitive Clock Tree Construction for Low Power. In: International Symposium on Low Power Electronics and Design, pp. 279Ð282 (2002) 16. Oh, J., Pedram, M.: Gated Clock Routing for Low-Power Microprocessor Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20(6), 715Ð722 (2001) 17. Chao, W.C., Mak, W.K.: Low-Power Gated and Buffered Clock Network Construction. ACM Transactions on Design Automation of Electronic Systems 13(1) (January 2008) 18. Chao, T.H., Hsu, Y.C., Ho, J.M.: Zero Skew Clock Net Routing. In: Proceedings of IEEE/ACM Design Automation Conference, pp. 518Ð523 (July 1992) 19. Chen, Y.P., Wong, D.F.: An Algorithm for Zero-Skew Clock Tree Routing with Buffer Inser- tion. In: European Design and Test Conference, pp. 230Ð236 (March 1996) Modeling Time Domain Magnetic Emissions of ICs

Victor Lomn´e1, Philippe Maurine1, Lionel Torres1, Thomas Ordas1,2,MathieuLisart2,andJ´erome Toublanc3

1 LIRMM, UMR 5506 university of Montpellier 2 / CNRS 161, rue Ada 34095 Montpellier, France {firstname.lastname}@lirmm.fr 2 STMicroelectronics 190 Avenue Celestin Coq 13106 Rousset, France {firstname.lastname}@st.com 3 Apache Design Solutions 300 route des Cretes 06902 Sophia-Antipolis, France {firstname}@apache-da.com

Abstract. ElectroMagnetic (EM) radiations of Integrated Circuits (IC) is for many years a main problem from an ElectroMagnetic Compatibil- ity (EMC) point-of-view. But with the increasing use of secure embedded systems, and the apparition of new attacks based on the exploitation of physical leakages of such secure ICs, it is now also a critical problem for secure IC designers. Indeed, EM radiations of an IC, and more precisely the magnetic component, can be exploited to retrieve sensible data such as, the secret key of cryptographic algorithms. Within this context, this paper aims at introducing a magnetic field simulation flow allowing pre- dicting, with high spatial and time resolutions, the magnetic radiations of IC cores. Such a flow being mandatory to predict the robustness of secure ICs before fabrication against EM attacks.

1 Introduction

With the ever increasing speed and power consumption of ICs, EM interferences of chips are becoming a more and more challenging issue from an EMC point of view. To prevent from these problems, designers need to simulate these EM radi- ations during the IC design flow. Different simulation methods and tools have been developed to ensure that EM radiations emitted by the different parts of an electronic system do not interfere with the others. Most of these tools model the circuit, and more precisely its pads, its internal Power/Ground network and its digital macro-blocks using passive RLC elements and current sources.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, pp. 238–249, 2011. c Springer-Verlag Berlin Heidelberg 2011 Modeling Time Domain Magnetic Emissions of ICs 239

If these models (IBIS [1], ICEM [2] or IMIC [3]) and the related tools have been demonstrated efficient to predict the EM radiations of a circuit in its whole (leads, bonding, package and IC core), they are too coarse grain to address problems which are specific to the design of secure circuits. Other tools, like CST studio [4], allow to compute a complete 3D EM simula- tion of any electronic device with very high spatial and time resolutions. But this kind of tools need to solve Maxwell equations at lot of positions in the device, and the CPU time necessary to model complex ICs made of several hundred thousand gates, is not reasonnable for a designer. From a hardware security point-of-view, with the ever increasing use of em- bedded systems to manage sensible data, a new kind of threats appeared at the end of the 20th century. They are called Side-Channel Attacks (SCA), and ex- ploit physical leakages like power consumption or EM radiations emanated by the IC while it computes a cryptographic operation. Among these threats, the major ones are the Simple ElectroMagnetic Analysis (SEMA) and the Differential ElectroMagnetic Analysis (DEMA) [5]. The SEMA consists in analysing a single EM trace of a cryptographic opera- tion, measured with the Surface Scan method [6] using a small magnetic probe made of a coiled wire with diameter varying between 50μm and 500μm. The measured trace is the evolution of the magnetic field radiated by the IC versus time. When applying a SEMA at different positions above the IC, it is thus possible to compute static and dynamic (time domain) EM cartographies [7]. Furthemore, advanced techniques based on signal processing have been proposed to localize the crypto module [8] [9] [10]. The DEMA exploits several EM traces corresponding to several cryptographic operations using the same key. It consists in a statistical processing of these traces in order to guess the key. More precisely, it exploits variation in amplitude of EM traces, which are correlated to the processed data. Note that, although these methods are called ElectroMagnetic Analyses,itis usually the magnetic field which is measured with the Surface Scan method [6] and a magnetic probe. Considering this threats, the basic design guidelines to increase the robustness of an IC:

– to reduce as far as possible the EM radiations of the cryptographic modules, or – to hide them within the EM radiations of other blocks, or finally – to design the circuit such as to obtain inintelligible EM radiations.

However, adopting these basic guidelines requires the development of a flow allowing to predict at the design step, with high time and spatial resolutions, the magnetic field generated by a circuit in the close vicinity of its surface. Within this context, the main contribution of this paper is the proposal of an industrial flow allowing predicting the time domain evolution of magnetic radiations with high accuracy and with high spatial and time resolutions. 240 V. Lomn´eetal.

The rest of this paper is organized as follows. Section 2 provides an overview of the simulation flow and then details its main features. Section 3 gives an ex- perimental validation of our flow applied on two complex ICs. Finally, conclusion is drawn section 4.

2 Magnetic Field Simulation Flow

Due to the ever increasing demand of performance, industrial integrated prod- ucts have moved from simple IC to complex integrated system, known as System- on-Chip (SoC) which consumes a significant amount of power. To distribute efficiently this power to the basic elements of SoC, more or less complex Power/Ground (P/G) networks are designed according to specific design guidelines addressing different signal integrity problems such as IR drops. As a result, the current consumed by a circuit typically flows from the top metal layers, characterized by a lower resistivity, down to logic gates regularly and hierarchically in order to minimize static and dynamic IR drops.

2.1 Basic Concept

Consequently, P/G networks of complex SoC, especially the part routed on top level metal wires, constitute the main sources of magnetic emissions, as it has been experimentally observed in [7], since high amplitude currents (several mA) flow within. On the contrary, interconnect wires, which are much more resistive and controlled by simple logic gates, are weaker sources of magnetic emissions. From these considerations, supported by experimental results, it appears that modeling the magnetic radiations of complex SoC results mainly in modeling the magnetic radiations of its P/G network. Our magnetic field simulation flow aiming at being as general as possible, the backbone of our modeling approach, which is represented in figure 1, follows these steps:

– cut the P/G network in small pieces of metal, considered as small electrical dipoles, and simulate the current within each of these pieces of wire – compute the magnetic field generated by each dipole, at several positions on a plane parallel to the IC surface (like a grid), according to Biot-Savart law – obtain the magnetic emissions of the IC at each coordinates of this plane by summing all the contributions of all these dipoles – compute what can be seen by measurement, i.e. to take into account the main characteristics of the measurement setup assumed to be used.

If this approach is simple, it requires computing the time domain evolution of the current flowing in each dipole with a high time resolution. Modeling Time Domain Magnetic Emissions of ICs 241

Fig. 1. Overview of the proposed magnetic field simulation flow

2.2 Current Extraction Step

If it is quite standard to compute the current consumed by a small digital block, using for example a SPICE-like tool, it is much more difficult to compute the current flowing along the entire P/G network of a complex SoC made of many digital and analogue blocks and memories. Thus, to get this current, we use an efficient IR drop tool allowing computing, with a high time resolution, the voltage evolution along the entire P/G network, called RedHawk (from Apache tools suite) [11]. RedHawk allows designers ver- ifying that their P/G network does not suffer any significant static or dynamic voltage drops before launching the production. Another key advantage of this tool is its ability in simulating, with a reduced cpu time and a high accuracy (see section 3), SoC integrating many different elements such as digital blocks, co-processors, memories and analogue blocks. More precisely, another tool from Apache tools suite, called Totem [11], allows to characterize the current evolution of analogue blocks and memories, for a usage within RedHawk. 242 V. Lomn´eetal.

This characterization step achieved, the simulation can be launched, accord- ing to a scenario, specifying to the tool kernel, which block is involved. This simulation provides different results such as static and dynamic maps disclosing the IR drops along the P/G network. A map, allowing identifying the areas that have suffered from the most important IR drops during a scenario, is given figure 2. In that case, it corresponds to a memory decoder power rail (red part on the figure 2). Among all its features, RedHawk offers a key advantage for the modeling of magnetic emissions. Indeed, it allows extracting, by positioning virtual probes (a specific instance of this tool), the time domain evolution of the voltage anywhere along the P/G network, i.e. the ability of computing the evolution of the current flowing in any piece of the P/G network considered as an electrical dipole in our magnetic field simulation flow. More precisely, for the magnetic field simulation of a given IC, the first step (figure 1) is to place virtual probes regularly (every X μm) along the power and ground rails. The placement policy was to place a virtual probe: – every X μm along unidirectional wire – at each intersection of vertical or horizontal wires – at each intersection of a wire and a via in order to warrant that two successive virtual probes are connected by a single and unidirectional wire. This point is important since it allows computing easily the current flowing between two probes, knowing the resistivity of the considered metal layer and the voltage at both wire ends. The computation of the currents flowing in all the dipoles achieved, the results are stored in a file gathering, for each dipole, the sampled current waveform (the sampling rate fixes the time domain resolution and the simulation speed) but also the coordinates of the dipole.

2.3 Magnetic Field Calculation Step The second step of our flow is based on the classical rules of the EM wave theory [12]. As aforementioned, each piece of the P/G network in which a current flows, radiates an EM field according to Maxwell equations. In our case, considering the distance between the magnetic sensor, the IC surface and the typical frequency bandwidth scanned by a magnetic measure- ment setup operating in time domain (from 1MHz to 1GHz), we may adopt the quasi-stationary regime approximation. This fact allows using the Biot-Savart law (1) for faster calculations rather than more complex expressions deduced from Maxwell equations. To get an idea of what can be seen on the scope at a point m of a plane parallel to the IC surface, we first compute the magnetic field at this point. More precisely, knowing the current IAB (t)thatflowsineachpieceofP/G network represented by a finite wire of length AB, its contribution Bi(t)tothe magnetic field B(t)atthepositionm is first evaluated according to the expres- sion of the Biot-Savart law (1), where μ is the permeability of the considered Modeling Time Domain Magnetic Emissions of ICs 243

Fig. 2. A static IR drop map obtained with RedHawk space, and r the distance between the wire AB and the point m. Then, the final value B(t) of the magnetic field at the position m is computed by vectorial sum of magnetic fields radiated by the N pieces of P/G network (2). −−→ −→ −−−→ μ.IAB (t) AB × r Bi(t)= (1) 4π r3

−−→ N −−−→ B(t)= Bi(t)(2) i=1

Thus, we compute the magnetic flux φB(t) flowing through the coiled magnetic sensor, according its diameter giving a surface S (3). This is done, assuming that the surface S is parallel to the IC surface. This assumption is important since it allows computing the magnetic flux by computing the magnetic field at several points inside the surface S and by summing them.  −−→ −→ φB(t)= B(t).dS (3) S Finally, we compute the electromotive force emf(t), measured at the pins of the coiled sensor, by a differentiation of the magnetic flux by the time (4). 244 V. Lomn´eetal.

dφB(t) emf(t)=− (4) dt

2.4 Additional Mandatory Steps

If the calculation of the magnetic field at all points of a plane, parallel to the IC surface, is quite standard, it is not sufficient to get an accurate idea of what can be seen by measurement. Indeed, to obtain, by simulation, a more accurate representation of results provided by a near-field scan of the IC, by simulation, the characteristics of the setup assumed to be used by for measurements have to be considered. In our simulation flow, three main characteristics of the setup are considered. The probe size assumed to be a small loop, and the overall bandwidth of the acquisition chain. More precisely, to increase the accuracy of the results obtained:

– we take into account the change in direction of a wave due to the refraction involved by the passivation layer. Thus, at a given position m of the sensor, the magnetic field radiated by a piece of the P/G network far from the sensor is not taken into account in the resulting magnetic field measured by the sensor. This characteristic is only estimated, because it is hard to estimate the distance between the passivation layer and the magnetic sensor with a precision < 5μm. – we filter (band pass filter) the time domain evolution of the computed emf according to the acquisition chain bandwidth. More precisely, knowing the frequency bandwidth of the sensor, the low-noise amplifier and the oscillo- scope, we can estimate the frequency bandwidth of the acquisition chain. – we take into account the gain (in decibels) of the low-noise amplifier.

3 Validation

To validate the proposed magnetic field simulation flow, static and dynamic cartographies of the magnetic field generated by two circuits have been obtained using:

– a near-field scan setup operating in time domain, composed of a motorized X-Y stage with a minimal displacement step of 1μm, a magnetic sensor made of a coiled metal wire with diameter of 50μm, a low-noise amplifier with a gain of 63dB, an oscilloscope and a computer controlling the whole setup (figure 3). – our magnetic field simulation flow, using characteristics of the near-field scan setup, as described in section 2.

The two considered ICs are microcontrollers designed in 130nm CMOS technol- ogy. They integrate different macro-bocks such as ROM, RAM, EEPROM, CPU and small analogue blocks. Modeling Time Domain Magnetic Emissions of ICs 245

One processing scenario, previously stored in the RAM memory, is executed on each circuit. It consists in reading data in RAM and passing them to the CPU. During the execution of this scenario (several clock cycles), we measured the magnetic field radiated by the chips, using a 50μmsensoranda25μm displacement

Fig. 3. Near-field scan setup used for experimental validation

1.9mm 1.7mm

measured peak to peak map simulated peak to peak map of the magnetic field (IC1) of the magnetic field (IC1)

Fig. 4. Measured and simulated maps disclosing the peak to peak amplitude of the magnetic field in the close vicinity of IC1 surface 246 V. Lomn´eetal. step. The scenario was repeated 100 times for each position of the scanned surface in order to increase the signal to noise ratio. Figures 4 and 5 show the cartographies (revealing the peak to peak amplitude of the magnetic field) obtained respectively using the aforementioned near-field scan setup and the proposed magnetic field simulation flow. Note that data acquisition with the near-field scan setup takes 3 hours while simulation runs in 5 hours. Note also that these simulations have been launched to obtain an emf value every 25μm. During these simulations the probe diam- eter and the frequency bandwidth were fixed respectively to 50μmand1GHz accordingly to characteristics of our near-field scan setup. The simulation time step was chosen accordingly to the sampling rate of our scope. The distance separating the sensor from the IC surface was estimated to be roughly 30μm, using a small micro camera with a zoom x100. As shown, considering the IC1, the agreement between simulations and mea- sures is satisfactory even if some discrepancies still exist. These discrepancies

0.8mm 2.4mm

measured peak to peak map simulated peak to peak map of the magnetic field (IC2) of the magnetic field (IC2)

Fig. 5. Measured and simulated maps disclosing the peak to peak amplitude of the magnetic field in the close vicinity of IC2 surface Modeling Time Domain Magnetic Emissions of ICs 247 may be due to several factors. Among them, some may be due to the modeling of the sensor. Indeed it is assumed that:

– the sensor is perfectly horizontal – the sensor has a perfect circular shape – the distance between the sensor and the IC is perfectly known

This latter point is critical. It is extremely difficult in practice, even with a micro camera, to measure the distance separating the sensor from the IC with a high accuracy (< 5μm) due the package shape. Note also that a fabricated chip does not necessarily have typical character- istics due to process variations. Concerning IC2, one observe a significant difference (around the rectangles on figure 5) between the measured and the calculated maps. However, this difference was expected since the marked positions are above the clock generator, that was not considered during the simulation (our database related to this design being incomplete). If these maps demonstrate the interest of the proposed magnetic field simu- lation flow to compare the efficiency of different P/G network routing strategies in terms of emissions before fabrication, they do not provide any information related to the accuracy of the simulator with respect to time. To fill this lack, figure 6 gives the measured and simulated time domain evo- lutions of the magnetic field at a position marked by dots in figure 5. As shown, the wavefroms are quite similar (without application of any filtering solution to model the bandwidth of the near-field scan setup) demonstrating the interest of the magnetic simulation tool. emf (100mV)

time (ns)

Fig. 6. Measured (continous line) and simulated (dashed line) time domain waveforms of the electromotive force above a supply rail of the IC2 RAM 248 V. Lomn´eetal. emf (100mV)

time (ns)

Fig. 7. Measured (continous line) and simulated (dashed line) time domain waveforms (with filtering) of the electromotive force above a supply rail of the IC2 RAM

The figure 7 shows the same results than those represented figure 6, except that the simulated average emf trace has been filtered accordingly to the fre- quency bandwidth of the acquisition chain. The comparison of Fig. 6 and 7 demonstrates the interest of the considering the acquisition chain impact.

4Conclusion

In this paper, we have introduced an industrial flow allowing simulating the time domain evolutions of the magnetic emissions of an IC in the close vicinity of its surface. The main ideas on which is based this flow are:

– the use of a dynamic IR drop simulator, RedHawk, that quickly provides the current flowing in all parts of the Power/Ground network – the use of Biot-Savarts law for fast calculations – the modeling of the magnetic sensor and, the consideration of the near-field scan setup bandwidth

This flow has been validated by comparing the predicted emissions of two ICs designed in a 130 nm technology with measured emissions. This comparison has demonstrated the efficiency of the proposed flow even if there is room for further improvements.

References

1. Technical Specification IEC 62014-1 (2001) 2. Technical Specification IEC 62014-3 (2002) 3. Technical Specification IEC 62404 (2007) Modeling Time Domain Magnetic Emissions of ICs 249

4. CST Studio suite, http://www.cst.com 5. Gandolfi, K., Mourtel, C., Olivier, F.: Electromagnetic Analysis, Concrete Results. In: Ko¸c, C¸ .K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 251–261. Springer, Heidelberg (2001) 6. Technical Specification IEC 61967-3 7. Ordas, T., Lisart, M., Sicard, E., Maurine, P., Torres, L.: Near-Field Mapping System to Scan in Time Domain the Magnetic Emissions of Integrated Circuits. In: Svensson, L., Monteiro, J. (eds.) PATMOS 2008. LNCS, vol. 5349, pp. 229–236. Springer, Heidelberg (2009) 8. Sauvage, L., Guilley, S., Mathieu, Y.: Electromagnetic Radiations of FPGAs, High Spatial Resolution Cartography and Attack on a Cryptographic Module. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 2(1) (2009) 9. Real, D., Valette, F., Drissi, M.: Enhancing correlation electromagnetic attack using planar near-field cartography. In: International Conference on Design, Au- tomation and Test in Europe (DATE), pp. 628–633 (2009) 10. Dehbaoui, A., Lomne, V., Maurine, P., Torres, L., Robert, M.: Enhancing Electro- magnetic Attacks using Spectral Coherence based Cartography. In: International Conference on Very Large Scale Integration, VLSI-SoC (2009) 11. Apache Design Solutions, http://www.apache-da.com 12. Ben Dhia, S., Randani, M., Sicard, E.: Electromagnetic Compatibilty of Integrated Circuits: Techniques for Low Emissions and Susceptibility. Springer Science, Hei- delberg (2006) Power Profiling of Embedded Analog/Mixed-Signal Systems

Jan Haase and Christoph Grimm

TU Vienna, Austria [email protected], [email protected]

Abstract. In order to optimize power consumption, it is important to know where and why power is consumed in a specific system. Power estimation gives a more or less accurate answer for the first question (where?). Knowing where power is consumed allows designers to optimize these specific components. However, the second question (why?) for the reason of power consumption, is more dif- ficult to answer: Activities that are reason for power consumtion (e.g. address- ing/routing in a WSN) are not located in a single component, but use a variety of components. However, knowing the cost of activities would pave the path to more holistic power optimization. The presentation will introduce methods for ”power profiling” that assist the analysis of power consumption, assigning power consumption to both components and activities.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 250, 2011. c Springer-Verlag Berlin Heidelberg 2011 Open-People: Open Power and Energy Optimization PLatform and Estimator

Daniel Chillet

ENSSAT/IRISA/CAIRN, France [email protected]

Abstract. The presentation will explain the objectives of the ANR Open-People project and will focus on energy estimation based on high level modelling. This project aims at developping an hardware platform for consumption measurement of complex SoC. This platform will be accessible via internet for industrial and academic users and will provide a library of power consumption models for several hardware boards. We are currently working on the description of power model of compo- nents. These models are described through an high level language and enable to make estimations and optimizations of the energy. The platform uses Sys- temC to ensure functional verification and validation in order to provide accurate estimations. The consumption models developed in the Open-People projet can be defined at different levels of abstraction, and the SystemC simulation can use these dif- ferent levels in order to facilitate the exploration step during the system design. In this presentation, we will show how the SystemC models can be used to extract power consumption of a complex system.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 251, 2011. c Springer-Verlag Berlin Heidelberg 2011 Early Power Estimation in Heterogeneous Designs Using SoCLib and SystemC-AMS

Franc¸ois Pˆecheux, Khouloud Zine El Abidine, and Alain Greiner

UPMC/LIP6/SOC, France [email protected], [email protected], [email protected]

Abstract. The presentation will describe a use case that consists in the model- ing and simulation of a genuine heterogeneous system composed of individually powered Wireless Sensor Network nodes. The models are written in SoCLib and SystemC-AMS, an open-source C++ extension to the OSCI SystemC Standard dedicated to the description of AMS designs containing digital, analog, RF hard- ware as well as other disciplines. SoCLib is a library of digital IPs simulation models dedicated to the design of shared memory multiprocessor archutectures. It is currently being extended to support power estimation at the bit-cycle-accurate level of abstraction. Concretely, a power-aware system of WSN nodes will detailed that can mon- itor a physical seismic perturbation, transmit information on this perturbation to other nodes by means of 2.4 GHz RF communication links, and finally compute the epicenter of the perturbation by asking the 32-bits processor embedded in a node to solve the system of nonlinear equations relative to the triangulation algo- rithm. Each node is powered by an autonomous kinetic battery model.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 252, 2011. c Springer-Verlag Berlin Heidelberg 2011 ASTEC: Asynchronous Technology for Low Power and Secured Embedded Systems

Pr. Marc Renaudin

CTOofTIEMPOSAS 110 Rue Blaise Pascal Bat. Vis´eo - Inovall´ee 38330 Montbonnot St Martin - France

Abstract. The presentation is highlighting recent advances and results of the MINALOGIC ASTEC project in the domain of asynchronous microcontroller de- sign and wire-less sensor applications. The work carried out in the ASTEC project is focused on using the asynchronous technology industrialized by Tiempo to de- sign low power and secured embedded systems. In collaboration with TIMA labo- ratory and CESTI/LETI, Tiempo fabricated and evaluated two versions of a fully asynchronous microcontroller, one without security feature and one with secu- rity counter-measures against power and fault attacks. Sensaris and Tracedge are integrating the asynchronous microcontroller into their systems in order to take advantage of the technology, and design competitive products for the low-power embedded systems market (wire-less sensors, medical systems, RFIDs...).

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 253, 2011. c Springer-Verlag Berlin Heidelberg 2011 OPENTLM and SOCKET: Creating an Open EcoSystem for Virtual Prototyping of Complex SOCs

Laurent Maillet-Contoz

STMicroelectronics, Grenoble, France

Abstract. The objective of the OpenTLM project is to offer to embedded soft- ware developers a tool kit, available under open source license, and based on the SystemC/TLM standard. It enables them to develop and test the embedded software ahead of availability of hardware platforms (silicon, but also hardware emulators). It gives the opportunity to promote a broader use of the TLM method- ology, already adopted by hardware teams, as well as a better concurrent de- velopment of hardware and software parts of the system. Indeed, if software is mature enough when silicon is available, the overall period for system integra- tion is reduced, which accelerates the availability of the product and optimizes time-to-market. The SoCKET project (SoC toolKit for critical Embedded sysTems) gathers industrial and academic partners to address the issue of design methodologies for critical embedded systems. The work targets the definition of a ”seamless” design flow which integrates the equipment qualification/certification, from the system level to the Integrated Circuits (ICs) and the associated embedded software, com- pliant with the applicable norms (aeronautics: DO-178C, DO-254, ARP4754 - space: ECSS Q60-02, Q80, E40). This ”seamless” flow requires some formalisms unification (elimination of semantic holes in HW/SW interfaces), the availability of models transformation operators (skeleton generation, requirements traceability), and models & tools interoperability. The main outcomes of the project will be:

* a design flow supporting critical embedded systems development * a draft IDE implementing this flow and tested with partners’s tools (adapt- able with other tools and for other applications) * some return of experience through 4 industrial case studies * some Certification/Qualification kits for IPs and SoCs in each application domain * some recommendations to certification and normalization bodies.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 254, 2011. c Springer-Verlag Berlin Heidelberg 2011 Variability-Conscious Circuit Designs for Low-Voltage Memory-Rich Nano-Scale CMOS LSIs

Kiyoo Itoh

Fellow, Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan Tel.: +81-42-323-1111 [email protected]

Abstract. Low-voltage scaling limitations of nanoscale CMOS LSIs are one of the major problems in the nanoscale era because they cause the evermore-serious power crises with device scaling. The problems stem from two unscalable device parameters: The first is the high value of the lowest necessary threshold voltage Vt (that is, Vt0) of MOSFETs needed to keep the subthreshold leakage low. The second is the variation in Vt (that is, ΔVt), that becomes more prominent in the nanoscale era. The ΔVt caused by the intrinsic random dopant fluctuation is the major source of various ΔVt components. It increases with device scaling and thus intensifies various detrimental effects such as variations in speed and/or the voltage margins of circuits. Due to such inherent features of Vt0 and ΔVt, the operating voltage VDD is facing a 1-V wall in the 65-nm generation, and is ex- pected to rapidly increase with further scaling of bulk MOSFETs, thereby wors- ening the power crisis. To reduce VDD, the minimum operating voltage Vmin, as determined by Vt0 and ΔVt, must be reduced. In this talk the Vmin of memory-rich nanoscale CMOS LSIs is investigated in an effort to reduce to below 0.5 V through variability-conscious device and circuit designs. First, Vmin, as a methodology to evaluate the low-voltage potential of MOSFETs, is proposed on the basis of a tolerable speed variation. Second, Vmins of the logic, SRAM, and DRAM blocks are compared, and the SRAM block comprising the six-transistor (6-T) cell turns out to be particularly problematic because it has the highest Vmin. Third, new devices, such as a fully-depleted structure (FD-SOI) and fin-type structure (FinFET) as ΔVt-immune MOSFETs, are investigated to further reduce the Vmins of the above-described blocks. Also investigated are new circuits to reduce Vmin of each block. For example, for the logic block, new dual-Vt0 and dual-VDD dynamic circuits enable the power- delay product to be reduced to 0.09 at a 0.2-V supply owing to gate-source reverse biasing. For the SRAM block, repair techniques, shortening the data line, up- sizing the MOSFETs, control of the common-source line or the word line of the cell, and even the 8-T cell reduce the Vmin. For the DRAM block, if combined with FinFET DRAM cells, a dynamic sense amplifier minimizes the Vt0 and thus Vmin. Finally, it is concluded that such variability-conscious circuit designs should lead to the achievement of 0.5-V nanoscale LSIs, if relevant devices and fabrica- tion processes are successfully developed.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 255, 2011. c Springer-Verlag Berlin Heidelberg 2011 3D Integration for Digital and Imagers Circuits: Opportunities and Challenges

Marc Belleville

CEA, LETI, MINATEC, France

Abstract. To cope with the market requirements of more functionalities and per- formances, while keeping reasonable power consumption, the microelectronic industry has always extensively relied on 2D technology scaling. However, with the technical and economic challenges increasing dramatically with the very ad- vanced nodes, 3D integration is now recognized as a very attractive alternative solution to sustain increased system integration. The key drivers towards 3D in- tegration will be first introduced in this talk. Examples of the various 3D process, their associated technological challenges and limitations will be given. At this stage, 3D design rules and 3D specific CAD tools (industrial or at the research level) will be presented and discussed. Then, examples of 3D IPs or circuits will be detailed. Finally, a perspective about another type of 3D integration (stacking transistors instead of dies or wafers) will conclude this talk.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 256, 2011. c Springer-Verlag Berlin Heidelberg 2011 Signing Off Industrial Designs on Evolving Technologies

S´ebastien Marchal

STMicroelectonics, France

Abstract. Many specific challenges need to be addressed in current SOC de- signs to offer a competitive product. Chip content become highly heterogeneous, performance has to be on the leading edge and robustness needs be guaranteed. Besides the technical merits, the ”date of availability” of the product plays a key role in its overall competitiveness. Therefore, as schedule pressure is increasing, moving to new technology requires more parallelization of activities that used to done serially. When the technology brick, which is the first link of the chain, moves through various maturity levels, the whol edesign process may be im- pacted. Traditionally, the impact of such evolutions were not anticipated. Layout were updated as a consequence of design rules changes. ”Brute force” timing margins were put in the models regardless of design specificities. Design For Variation techniques operate at design flow, SOC design and library/IP design levels to anticipate those variations. Different examples of Design Rules changes or Timing variations are discussed, and techniques to handle such changes are covered. Some aspects of Silicon process corners variations are also presented. Part of the talk covers clock network building techniques which are variation friendly. Impact of such techniques on final design analysis, also called SignOff are demonstrated. How DFV can simplify SignOff is finally discussed.

R. van Leuken and G. Sicard (Eds.): PATMOS 2010, LNCS 6448, p. 257, 2011. c Springer-Verlag Berlin Heidelberg 2011

Author Index

Alioto, Massimo 62 Indrusiak, Leandro S. 160 Apolloni, R. 116 Itoh, Kiyoo 255

Bachmann, Christian 11 Jagdhold, Ulrich 218 Baz, Abdullah 105 Jain, Abhishek 41 Beigne, Edith 94 Kheradmand-Boroujeni, Bahman 170 Bekiaris, Dimitris 73 Knoth, Christoph 200 Belleville, Marc 256 Kouretas, Ioannis 31 Berkelaar, Michel 190 Berrandjia, Mohamed Lamine 211 Lanuzza, Marco 180 Blanc, Guillaume 1 Lazzari, Cristiano 84 Boudouani, Nassima 1 Leblebici, Yusuf 170 Lebreton, Hugo 94 Calazans, Ney 150 Liacha, Ahmed 211 Carazo, P. 116 Lisart, Mathieu 238 Castro, F. 116 Lomn´e, Victor 238 Chaver, D. 116 Lu, Jingwei 228 Chillet, Daniel 251 Chow, Wing-Kai 228 Maillet-Contoz, Laurent 254 Consoli, Elio 62 Marchal, S´ebastien 257 Corsonello, Pasquale 180 Maurine, Philippe 238 Crippa, Dennis 41 Monteiro, Jos´e84 Moraes, Fernando 150 De Rose, Raffaele 180 Moreira, Matheus 150

Economakos, George 73 Nordholz, Petra 200 Eichwald, Irina 200 Ordas, Thomas 238 El Abidine, Khouloud Zine 252 Oudjida, Abdelkrim Kamel 211 Elissati, Oussama 137 Paliouras, Vassilis 31 Fernandes, Jorge 84 Palumbo, Gaetano 62 Fesquet, Laurent 137 Papameletis, Christos 73 Flores, Paulo 84 Papanikolaou, Antonis 73 Frustaci, Fabio 180 Pˆecheux, Fran¸cois 252 Pedram, Hossein 126 Gag, Martin 21 Pekmestzi, Kiamal 73 Garcia-Ortiz, Alberto 160 Perri, Stefania 180 Genser, Andreas 11 Petri, Markus 218 Ghavami, Behnam 126 Piguet, Christian 170 Grass, Eckhard 218 Pinuel, L. 116 Greiner, Alain 252 Pontes, Julian 150 Grimm, Christoph 250 Raji, Mohsen 126 Haase, Jan 250 Ramezani, Lida 51 Haid, Josef 11 Renaudin, Pr. Marc 253 260 Author Index

Rieubon, S´ebastien 137 van der Meijs, Nick 190 Rolandi, Pierluigi 41 Veggetti, Andrea 41 Ventroux, Nicolas 1 Sassolas, Tanguy 1 Vivet, Pascal 94 Schlichtmann, Ulf 200 Schrape, Oliver 218 Wegner, Tim 21 Sham, Chiu-Wing 228 Weiß, Reinhold 11 Shang, Delong 105 Winkler, Frank 218 Soudris, Dimitrios 73 Steger, Christian 11 Xia, Fei 105

Tajary, Alireza 126 Yahya, Eslam 137 Tang, Qin 190 Yakovlev, Alex 105 Tiar, Rachid 211 Timmermann, Dirk 21 Zarandi, Hamid R. 126 Tirado, F. 116 Zeidler, Steffen 218 Torres, Lionel 238 Zergainoh, Nacer-Eddine 94 Toublanc, J´erome 238 Zjajo, Amir 190