LOW POWER INTEGRATED CIRCUIT ARRAYS

Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOR OF PHILOSOPHY”

by

Adam S. Teman

Submitted to the Senate of Ben-Gurion University of the Negev

August 2013 Be’er Sheva, Israel

Low Power Integrated Circuit Arrays

Thesis submitted in partial fulfillment of the requirements for the degree of “DOCTOR OF PHILOSOPHY”

by

Adam S. Teman

Submitted to the Senate of Ben‐Gurion University of the Negev

Approved by the advisor, Prof. Alexander Fish

Approved by the Dean of the Kreitman School of Advanced Graduate Studies

January 2014 Be’er Sheva, Israel

ii

This work was carried out under the supervision of

Prof. Alexander Fish

In the Department of Electrical and Computer Engineering

Faculty of Engineering Sciences

iii

Affidavit

I, Adam Shmuel Teman, whose signature appears below, hereby declare that:

X I have written this Thesis by myself, except for the help and guidance offered by my Thesis Advisors.

X The scientific materials included in this Thesis are products of my own research, engendered from the period during which I was a research student.

Date: 11 July 2013 Student's name: Adam Teman Signature:______

iv

Author's Declaration

I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public.

v

Acknowledgements and Dedication

First and foremost, I thank my advisor Prof. Alexander Fish for his invaluable guidance and support for the duration of my research. Prof. Fish has been my mentor for the past eight years, from my undergraduate studies through my MSc research to the conclusion of my PhD studies. Through an invaluable and rare combination of professional expertise, managerial guidance, and amiable friendship, Alex has impelled me to excellence, all the while providing me with the opportunities for enjoyment and for the self-fulfillment of achieving my goals. I truly could not ask for a better boss.

I give special thanks to several of my colleagues, who have helped me produce this work. Sagi Fisher has been responsible for convincing me to return to Academia, and I thank him for much of my initial work and his support. Prof. Orly Yadid-Pecht co-mentored my early work on Image Sensors, and helped provide financial support during these projects. Dr. A. Belenky and Dr. A. Spivak have shared with me not only office space but also many experiences and collaborations over the years, and R. Machluf-Zilberberg has been a second mother to us all. I have had the pleasure of mentoring four graduate students: H. Dagan, J. Mezhibovsky, R. Giterman, and L. Atias. Together we have produced much of the work described further on. My several hundred undergraduate students merit my thanks, as they have inspired me to master my knowledge and provided me with the enjoyment of teaching. Several of them are particularly acknowledged here for the excellent projects they have carried out, which often resulted in publications. These include O. Cohen, N. Adri, S. Fraiman, G. Shveky, and O. Bass. I give special thanks to L. Pargament and A. Mordakhay, for our intensive work on some of my most important projects that have achieved great accolades. They have continually impressed me. Finally, I thank two most special colleagues: firstly, Itamar Levy, who has been like a younger brother to me and amazes me every day with his combination of genius and humanity; and secondly, Pascal Meinerzhagen, who has closely collaborated with me over the past year, has taught me volumes, and has become a great friend.

vi

I wish to acknowledge the support of several organizations and foundations that have financially supported my studies and the various projects included herein. First, I thank the Alpha Consortium, for supporting my work on low-voltage SRAM design, subthreshold logic design, and the RFID project. Second, I thank the Kreitman Foundation for providing me with my PhD fellowship, and the Wolf Foundation, Yizhak Ben Ya'akov HaCohen Foundation, Ben-Gurion University, and Intel for having awarded me prizes and scholarships over the course of my studies. Finally, I thank Prof. Andreas Burg of EPFL for providing the majority of the financial support for our collaboration in the research of Gain Cell embedded DRAMs for the past year, as well as for his invaluable mentorship over this period. This document, which is the culmination of the work I have done for the past five years, is dedicated to my wife, Hadas, my parents, Rhisa and Nissan, and to my new love, Shalev Israela Teman, the daughter I have dreamt about for so many years. I give thanks to my wife for her love, patience and support throughout my daily struggles with the various dilemmas that arise and the late hours that have been spent working on this as well as for her ear that is always open to hear my complaints and whimpers when times are hard. Hadas has always been there to remind me why I chose this track and to encourage and support my actions and decisions. Hadas is my partner, my love, and the missing piece that has completed me. I give thanks to my parents for their love and support throughout my whole life, and for their steering me towards a higher education and the achievement of excellence in whatever I undertake without treading on others or using any insincere techniques. It would be hard to imagine better individuals to guide me through life, show me an example of how to live and behave, to provide for all my spiritual and material needs, never expecting anything in return but my love. Finally to my newborn daughter, Shalev: I am writing these pages as my little Miracle adapts to life in the neonatal ward of Sheba hospital. Hadas and I have overcome a long, rigorous road, challenging everything we have ever known or experienced, far more than we could ever imagine. However, at the end of this road, you appeared and conquered our hearts. I could never dream of anything better. Thank you.

vii

Table of Contents

Affidavit ...... iv Author's Declaration ...... v Acknowledgements and Dedication ...... vi Table of Contents ...... viii List of Figures ...... x List of Abbreviations ...... xi Abbreviations of Frequently Used Terms: ...... xi Abbreviations of Organization Names and Specific Tools: ...... xiv List of Symbols ...... xv Abstract ...... 17 Chapter 1 Introduction ...... 21 1.1 Background ...... 21 1.2 Conduction of Research ...... 23 1.3 Organization of the Work ...... 26 1.4 Constitutive Articles Review ...... 27 Chapter 2 Introduction to Low Power Array Design ...... 30 2.1 Motivation for Low Power ...... 30 2.2 Power Dissipation ...... 31 2.2.1 Dynamic Power ...... 31 2.2.2 Static Power ...... 33 2.2.3 Leakage Currents in MOS Devices ...... 34 2.3 Low Power IC Arrays ...... 37 2.3.1 Low Power Memory Arrays ...... 38 2.3.2 Advanced Low Power CMOS Image Sensors ...... 45 Chapter 3 Low Voltage SRAM Design and Stability ...... 51 3.1 Introduction to Low-Voltage SRAM Design ...... 51 3.1.1 Energy Efficient Circuit Design ...... 51 3.1.2 SRAM Noise Margins ...... 52 3.1.3 6T SRAMs in Sub/Near Threshold ...... 54 3.1.4 Previously Presented Low Voltage SRAMs ...... 56 viii

3.1.5 Summary ...... 59 3.2 A 250mV 8kb 40nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM) ...... 60 3.3 A Minimum Leakage 400 mV Quasi-Static RAM (QSRAM) Bitcell ...... 75 3.4 A 40-nm Subthreshold 5T SRAM Cell with Improved Read and Write Stability ...... 88 Chapter 4 Low Power Gain Cell Embedded DRAMS ...... 94 4.1 Introduction ...... 94 4.2 Review of Recent Gain Cell eDRAM Implementations...... 95 4.3 Minimum Voltage Gain Cell Operation ...... 96 4.4 Extending the Retention Time of Gain Cell Arrays ...... 96 Chapter 5 Low-Power Low-Cost NVM for RFID Tag ...... 99 5.1 A Low-Power DCVSL-Like GIDL-Free Voltage Driver for Low-Cost RFID Nonvolatile Memory ...... 99 Chapter 6 Low Power Techniques for Image Sensors ...... 114 6.1 Leakage Reduction in Advanced Image Sensors Using an Improved AB2C Scheme ...... 114 Chapter 7 Summary ...... 128 Bibliography ...... 131 7.1 List of Publications ...... 131 7.1.1 Papers Published in Peer-Reviewed Journals ...... 131 7.1.2 Papers Published in Peer-Reviewed Conference Proceedings: ...... 132 7.2 References ...... 133 Appendix A: Large VLSI Arrays – Power and Architectural Perspectives ...... 142 Appendix B: Review and Classification of Gain Cell eDRAM Implementations ...... 143

Appendix C: Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications under Technology Scaling...... 145 Appendix D: A Low-Cost Low-Power Non- for RFID Applications ...... 146 Appendix E: Autonomous CMOS Image Sensor For Real Time Target Detection and Tracking ..... 147 Appendix F: Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays ...... 148

ix

List of Figures

Figure 1: The well-known prediction of the "Power Density Disaster", from [4]. The plot shows the power density of Intel microprocessors from 1970 to 2000, predicting that, at the current pace, by 2013 the power density would be approaching that of the sun's surface...... 31 Figure 2: Components of Total Chip Power Consumption [11] ...... 34 Figure 3: Primary Leakage Currents in MOSFET Devices ...... 37 Figure 4: General Memory Architecture ...... 40 Figure 5: Generic Smart Image Sensor Component Block Diagram ...... 46 Figure 6: Pareto-Optimal Energy-Delay curve showing the Minimum Energy Point (MEP) and the Minimum Delay Point (MDP). (reproduced from [89]) ...... 52 Figure 7: (a) Schematics of a standard 6T SRAM cell. (b) Butterfly curve of a standard 6T SRAM cell with square showing graphically calculated SNM...... 53 Figure 8: (a) Butterfly curve of a standard 6T SRAM cell during read access. (b) Butterfly curve of a standard 6T SRAM cell during write access...... 54 Figure 9: Standard 8T bitcell, as employed in [87] and [34] ...... 56 Figure 10: Calhoun and Chandrakasan’s 10T bitcell proposed in [33]...... 58 Figure 11: 10T bitcell proposed by Roy’s group in [32] ...... 59 Figure 12: Schematic of the all-PMOS 2T gain cell with I/O write transistor ...... 98

x

List of Abbreviations

Abbreviations of Frequently Used Terms:

1T-1C 1-transistor-1-capacitor 5T Five-Transistor 6T Six-Transistor A/D Analog/Digital AB2C Adaptive Bulk Biasing Control ADC Analog to Digital Converters APS Active Pixel Sensor BISR Built-in Self-Repair BIST Built-in Self-Test BL Bit Line BTBT Band-to-Band Tunneling CAD Computer Aided Design CAM Content Addressable Memory CCD Charge Coupled Device CDS Correlated Double Sampling CMOS Complementary Metal Oxide Semiconductor DC Direct Current DCVSL Differential Cascade Voltage Switch Logic DIBL Drain Induced Barrier Lowering DRAM Dynamic Random Access Memory DRV Voltage eDRAM embedded DRAM EKV Enz, Krummenacher, Vittoz FFT Fast Fourier Transform FIFO First In First Out FPN Fixed Pattern Noise GC Gain Cells GC-eDRAM Gain-Cell embedded Dynamic Random Access Memory GIDL Gate Induced Drain Leakage xi

HVT High Threshold Voltage Devices IC Integrated Circuit LIFO Last In First Out LVT Low Threshold Voltage Devices MC Monte Carlo MEP Minimum Energy Point MLC Multi-Level Cells MOS Metal Oxide Semiconductor MOSFET Metal Oxide Semiconductor Field Effect Transistor MTCMOS Multi-Threshold CMOS

Near-VT Near Threshold Operating Region nMOS n-type Metal Oxide Semiconductor NVM Non-Volatile Memory NWL Negative Word Line pMOS p-type Metal Oxide Semiconductor PVT Process-Voltage-Temperature Q-SRAM Quasi-Static Random Access Memory RBB Reverse Body Biasing RBL Read Bit Line RFID Radio Frequency Identification ROM Read Only Memory RSCE Reverse Short Channel Effect RSNM Read Static Noise Margin RWL Read Word Line S/H Sample and Hold SER Soft Errors SF-SRAM Supply Feedback SRAM SNM Static Noise Margin SNR Signal-to-Noise Ratio SOC System-on-a-Chip SOI Silicon-on-Insulator SOS Silicon-on-Sapphire SPICE Simulation Program with Integrated Circuit Emphasis xii

SRAM Static Random Access Memory SSD Solid-state Hard Drives

Sub-VT Subthreshold ULP Ultra-Low Power VLSI Very Large Scale Integration VTC Voltage Transfer Characteristics WBL Write Bit Line WL Word Line WSNM Write Static Noise Margin WWL Write Word Line

xiii

Abbreviations of Organization Names and Specific Tools:

BGU Ben-Gurion University of the Negev EPFL Ecole Polytechnique Federale De Lausanne IEEE Institute of Electrical and Electronics Engineers IEEEI IEEE Israel Conference IET The Institution of Engineering and Technology International of Journal Information Technologies and IJ ITK Knowledge ISCAS IEEE Symposium on Circuits and Systems ITHEA Institute of Information Theories and Applications ITRS International Technology Roadmap for Semiconductors JLPEA MDPI Journal of Low Power Electronics and Applications JoE Journal of Engineering JSSC IEEE Journal of Solid State Circuits LPC&S Low Power Circuits and Systems Lab MDPI Multidisciplinary Digital Publishing Institute MEJ Elsevier Microelectronics Journal MIT Massachusetts Institute of Technology TAU Tel Aviv University TCAS-II IEEE Transactions on Circuits and Systems-II TCL Telecommunications Circuits Laboratory TSMC Taiwan Semiconductor Ltd.

xiv

List of Symbols

µ0 Zero bias mobility

Ccircuit Total capacitance of the circuit

CDE Decoder output node capacitance

Ceff Effective capacitance

Cload Load capacitance

Cox Gate oxide capacitance

CPT Digital logic capacitance

CSN Storage capacitor

Eanalog Analog energy dissipation

Edigital Digital energy dissipation

Eox Electrical field over the gate oxide

Eread_out Readout energy

Erefresh Refresh energy

Ereset Reset energy f Operating/switching frequency

FR Frame rate iact Active current

IDCP Static current of the periphery

IDS Drain-to-source current

Igate Gate leakage ihld Data retention current of inactive cells

Ileak Standby leakage current

Ioff Off-current

Ion On-current

Istatic Static current

J0 Gate tunneling coefficient

JG Gate tunneling current density k Boltzmann constant or wave vector L Transistor length n Inverse subthreshold swing coefficient

xv

Pdynamic Dynamic power consumption

Pleakage Leakage power

Pshort circuit Short circuit power consumption

Pstatic Static power consumption

ɸt Thermal voltage q Electron charge S Subthreshold swing coefficient

SiO2 Silicon dioxide T Temperature tret Retention time

VBS Body-to-source voltage

VDD High supply voltage

VDS Drain-to-source voltage

VGB Gate-to-body voltage

VGS Gate-to-source voltage

VGS-VT Overdrive

Vint Internal supply voltage

VNWL Underdrive voltage

VSS Low supply voltage

Vsupply Supply voltage

Vswing Voltage swing

VT Threshold voltage

VT0 Zero biasing threshold voltage W Transistor width α Activity factor γ Body-effect coefficient ε Dielectric permittivity η Dibl coefficient

Φb Fermi potential

xvi

Abstract

As Moore’s Law continues to progress, low-power has overtaken high-performance as the primary focus of VLSI design. The continuous scaling of technology processes has led to a degradation of the Ion/Ioff currents of standard devices, which has resulted in unacceptable increases in leakage currents, and has given rise to static power as the major factor of power consumption in many of today’s systems. Due to their large area and number of devices, array based sub-systems, such as memories and pixel arrays, are often the primary consumer of overall system power. A large percentage of these blocks generally reside in a static “hold” or “integration” state during the majority of the clock cycles, so their main power contribution is due to static consumption. Accordingly, many research groups around the world have focused on developing low power techniques and methodologies for all types of arrays; in particular, for static power reduction. The research of power reduction has been applied across several disciplines of array design for the first time in this thesis: particularly for several classes of memories and image sensors. In the process of this research, the characteristics, topologies, and architectures of each type of array have been studied and compared. For each discipline, novel techniques for low power operation have been developed: taking into consideration the particular challenges of each technology, as well as finding opportunities to adapt techniques common to one type of array for implementation on another. Each of these various design disciplines presented additional challenges that necessitated an in-depth understanding of their mechanisms in order to properly design robust solutions. In the realm of Static Random Access Memory (SRAM) design, stability issues have been thoroughly studied to ensure when integrating power reduction techniques, such as voltage scaling. In the field of Gain-Cell based embedded Dynamic Random Access Memory (GC-eDRAM) design, data retention has been considered and analyzed to ensure proper refresh frequencies and to calculate the overall power consumption. In order to properly operate an innovative low-cost Non-Volatile Memory (NVM) device, special analysis of non-standard leakage currents had to be carried

17 out and a sizing methodology had to be developed in order to achieve low-power, robust control signal biasing. "Smart" image sensors were perfect candidates for the integration of several types of arrays together into a single system; and in the design and implementation of such a low-power sensor, component sharing was applied to optimize the power, area, and performance of the system. This thesis presents a cross-discipline, multi-level approach for power reduction. Four different design fields were considered: SRAM, GC-eDRAM, Logic-compatible NVM, and "Smart" Image Sensors. Each of these fields was analyzed for power reduction opportunities, starting from the technology level, considering the effects of scaling and device types for the implementation of each type of sub-system, through the circuit level, which comprises the core of this work, and up to the architecture and algorithm level, which enable innovative methods for operating the underlying circuits for power reduction. At several points, an opportunity for power reduction was observed, and subsequently, a study was carried out to analyze, design, and implement a sub-system based on this technique or integrating several of these techniques. This approach has initiated a large number of sub-projects, several of which have resulted in the fabrication of test-chips, conclusion, and publication of the results. These publications comprise the body of this thesis. The first subject presented is Low Voltage SRAM Design and Stability. As the major solution for embedded memories, SRAM often comprises over 50% of the silicon area of an integrated circuit chip, and its static power is often the major component in the system's power consumption. The most efficient method for static power reduction in SRAMs has been found to be the voltage supply reduction. However, as the power supply is reduced, the functionality of the SRAM is compromised, and is often presented as a reduction in data stability. In this research, three novel SRAM bitcells have been designed for low-voltage, and therefore ultra-low power, operation. In order to analyze and prove the data stability of these designs in light of the extreme process variations inherent to technology scaling, these designs have been examined under advanced static and dynamic stability methods and metrics, several of which were developed as part of this research. Three published journal articles are presented herein, describing the proposed solutions, analyses, simulations, and

18 measurements; in addition, several additional papers that extend these works, are provided in the appendices and references. As an alternative low-power embedded memory solution, the GC-eDRAM design comprises the second part of this thesis. This dynamic memory circuit provides a reduced area solution for logic-compatible memory, as compared to standard SRAM arrays. In the framework of this research, the GC-eDRAM has been considered as a low-power solution, due to its inherently low leakage currents; however, the dynamic power required to periodically refresh these arrays must be taken into consideration when benchmarking the overall standby power consumption. This, in turn, has led to a deep analysis of the retention time of gain cell circuits and the development of methods for extending this retention time to minimize the refresh frequency. One of the outcomes of this research was the first attempt to operate a GC-eDRAM at subthreshold supply voltages, as well as an examination of the minimum feasible supply voltage for the operation of such circuits under technology scaling. The published journal article describing this analysis is provided in full. This research was carried out in collaboration with the Telecommunications Circuits Laboratory (TCL) at Ecole Polytechnique Federale De Lausanne (EPFL). In the third part of this thesis, the core of a major multi-year research project that included the design and implementation of a low-cost passive Radio Frequency Identification (RFID) tag is summarized. The low-cost goal of this system is achieved through low power design and integration of the TowerJazz C-Flash NVM bitcell – a logic-compatible, single-poly floating gate memory. In the framework of this research, I led the group that designed the digital control and memory sub-system, ultimately responsible for full chip integration. This included designing the architecture and circuits required to operate the C-Flash cells, and the many innovations required to achieve this with a low power figure. A number of these circuits are described and analyzed in the published article, provided in the body of this thesis, and in Appendix D. The final field that is discussed in the framework of this dissertation is that of "smart" image sensors. In contrast with a standard image sensor that comprises a pixel array and some or all of the peripherals required for its operation, a smart image sensor provides

19 additional circuits and components on-chip for advanced functionality. One common component of a smart imager is an on-chip memory for uses such as storing data regarding past pixel readouts. As such, this system was a good candidate for applying cross-discipline methods, and was thus chosen for the application of peripheral sharing techniques. In this chapter a published article is provided, which describes an advanced bulk biasing control (AB2C) scheme for power reduction in image sensors, as well as the integration of this technique with a wide-dynamic range imager. Several publications with more in-depth descriptions of this type of smart imager appear in [1, 2], and Appendix E describes a low- power autonomous tracking sensor, as an additional example of a low-power smart image sensor. All of the proposed circuits, systems, algorithms, and methods have been analyzed theoretically and simulated with CAD tools, such as the Cadence Spectre. Several of these designs have been fabricated and tested in silicon. These included an 80nm test-chip for the smart imager, a 40nm test chip with low-voltage SRAMs, an 0.18μm test-chip with GC- eDRAM arrays, and several 0.18μm test-chips with various parts of the NVM array, culminating in a test chip which includes the complete RFID system. The various projects included collaborations with CSR, formerly Zoran, TowerJazz, formerly Tower Semiconductor, Altair, Tel Aviv University, and EPFL. A total of 10 articles were published in peer reviewed international journals and 13 papers were presented at IEEE conferences and published in the conference proceedings. The most important of these publications are included either in the body of the text or in the attached appendices. Keywords: VLSI, Low Power, Arrays, Digital Memory, SRAM, eDRAM, gain cells, digital circuits, Image Sensors, Sub-threshold circuit design, Near-threshold circuit design, “Smart” Image Sensors, Systems-on-a-Chip

20

Chapter 1 Introduction

1.1 Background

As the number of components on a chip, the minimum device size, and the microprocessor frequency have continually progressed for more than four decades in accordance with Moore’s Law [3], so have two important parasitic features: power consumption and power density [4-7]. The exponential growth of power density has been accompanied by increasingly complex and expensive solutions for heat dissipation, while the high power consumption has cramped the expansion of mobile devices due to their limited battery life. By the turn of the millennium, dealing with these previously overlooked characteristics became the main focus of VLSI design [4, 5, 8]. The exponential increase in frequency was discontinued, and various novel techniques were introduced into mainstream applications, such as multiple low frequency cores, voltage islands and power gating [9]. This change in focus did not, however, preempt Moore’s original prediction of the exponential increase in the number of devices on a chip. Accordingly, new generations of process technology have been introduced every few years with shorter channel lengths and numerous side effects. Of these, one of the most problematic is the degradation of the Ion/Ioff ratio of a minimum length transistor that is caused by ever-growing leakage currents [10]. A rather unexpected trend was created; the static/leakage power consumption of various systems surpassed the dynamic/operational power [11]. This phenomenon is especially prevalent in components that incorporate large numbers of minimum length devices that are held in a static or standby state for a substantial percentage of the operating cycle. Two perfect examples of such components are memory blocks and image sensors. Both of these components comprise arrays of basic circuits that cover significant areas, often more than half of the total silicon die. Memory arrays generally store data in a static state and are only accessed when a read or write operation is executed on a small portion of the array. Image sensors generally operate with a long integration period, during which the pixels wait for a

21 predetermined duration while the incident light discharges a photodiode. In both cases, the period of static operation for each basic cell is much longer than the active transients. This, coupled with the ever-increasing size and density of the memory and pixel arrays, generally causes the static power to be the main factor in the component’s total consumption [12]. Low power research in both the fields of memories and image sensors has become increasingly popular over the past decade. A continuous trade-off persists for memory blocks, forfeiting density, i.e. bitcell size, for power consumption, while keeping the minimum performance or speed requirements for a given application. In the realm of Static Random Access Memory (SRAM), circuit level research has proposed several modifications to the standard 6-transistor (6T) bitcell aimed at lowering the power consumption [13-20]. This is usually achieved by adding transistors, control lines and bias voltages, and often includes methods for maintaining noise margins at lower operating voltages. For the more basic memory structures, such as Read Only Memory (ROM), Dynamic Random Access Memory (DRAM) and , a single device is used to store a bit or more of data, limiting the flexibility of circuit level power reduction techniques. The addition of a single transistor generally doubles the silicon area of the memory, resulting in unacceptable costs. Therefore, the methods for power reduction in these components generally focus on process, architecture and/or algorithm level techniques, many of which are effective for SRAMs as well. An unexpected low-power alternative to SRAM has arisen in recent years in that several groups realized that Gain-Cell based embedded DRAMs (GC-eDRAM) can be operated with potentially lower standby power than SRAM [21-25]. The biggest step in the low power operation of image sensors came with the introduction of the Active Pixel Sensor (APS) that enabled the fabrication of an image sensor in a standard CMOS process. Contrary to the traditional Charged Coupled Device (CCD) sensor, an APS based CMOS Imager could operate at lower voltages and could be integrated on the same die with other embedded components, thus paving the way for the camera-on-a-chip and the “smart” image sensor [26-30]. As the CMOS Imagers have progressed, and especially with the advanced functionality of “smart” image sensors, various techniques for power reduction have been proposed and developed at all levels of design [31].

22

Despite the fact that both of these disciplines present similar structures and architectures and are characterized by similar power consumption problems, the research of the two has never previously been combined. The two fields are generally independent of one another, with research groups focusing either on memory design or on image sensor development. In fact, the majority of memory research groups focus primarily on one type of memory, seldom designing both SRAM and DRAM, for example. In this dissertation, a cross-discipline, low- power array design is presented, in which the knowledge and techniques for power reduction in several fields have been assembled under one roof. Within the scope of this research, I have studied, researched, and analyzed these fields, recognized the power reduction opportunities in each discipline, and when applicable, adapted a technique, developed for one type of circuit, for integration with another.

1.2 Conduction of Research

This research was based on two prominent trends in digital Integrated Circuit (IC) design that have characterized the field in recent years: 1. Low power has become the main focus of VLSI design. 2. Array based components, such as memories and image sensors, occupy large percentages of the total die area and often consume over 50% of the total system power. According to these assumptions, it is clear that power reduction in arrays is a central goal in VLSI design, and, in fact, both low power memories [18, 32-37, 37-41] and low power imagers [26, 27, 30, 30, 31, 42, 43, 43-45] are popular fields of research. Upon closer examination of the various types of arrays, it is hard not to notice the similarities among them. They all are composed of a basic unit cell that is optimized for density and functionality and laid out in a two-dimensional regular pattern. Most have a row and column addressing scheme to access the unit cells according to the application requirements. They include digital and analog peripheral components, which are used to improve the Signal-to-Noise Ratio (SNR), enhance their performance, optimize their timing and power profiles and create necessary biases. They are all accompanied by strict design

23 methodologies to achieve high density, noise immunity and high throughput. Nevertheless, each array type is practically independent from the others: the majority of the research groups focus on one field and fine tuning for only one specific design without taking the others into consideration. In this doctoral research, low power array design has been targeted as a single unified discipline, without limiting the techniques and methodologies to a certain field of application. Included herein is an extensive analysis and comparison of SRAMs, GC-eDRAMs, Non- Volatile Memories (NVM), and CMOS Imagers, which recognize the similarities among them, develop low power techniques for each, and making an attempt to adapt techniques developed for one field for implementation on the others. To the best of my knowledge, this doctoral research is the first time such a cross-discipline approach has been taken, especially in the realms of memories and image sensors. The research summarized in this dissertation was conducted according to the following methodology. First, an extensive literature survey was carried out to learn the basic concepts and to review what has previously been tried and achieved. As a result of the literature survey, opportunities for power reduction in each of the disciplines were recognized, and ideas for exploiting these opportunities were conceived. This process continued throughout the research and was not bounded by the ideas conceived during the initial stages. The ideas were compiled into a schematic implementation of the solution: for example, a schematic circuit for circuit level designs or a block diagram for system level solutions. These designs were theoretically analyzed for their expected feasibility and efficiency and then implemented in Computer-Aided-Design (CAD) based tools for synthesis and simulation. At the circuit and device levels, the ideas were implemented in the Cadence Virtuoso suite and tested for functionality with the Spectre SPICE level simulator. All the circuits were analyzed under both process corners to evaluate the impact of global variations, as well as under a Monte Carlo (MC) statistical analysis to evaluate the combined impact of global and local variations. After achieving a satisfactory circuit level design, the designs were laid out and parasitic capacitances and resistances were extracted using Mentor Graphics Calibre parasitic

24 extraction tools. Post-layout simulations were conducted on extracted schematics and the circuits and layouts were modified and improved as necessary. A higher level analysis was carried out with MATLAB, often invoking MATLAB along with Virtuoso for a more sophisticated control of circuit level simulation. Control blocks at a system level were implemented in Verilog and VHDL, simulated with ModelSim, and synthesized with Synopsys Design Compiler or Cadence RTL Compiler for physical implementation. Ultimately, the circuit and logic level designs were integrated into a final layout using Cadence Encounter digital implementation tools along with Cadence Virtuoso custom design tools. The physical verification was implemented with Mentor Graphics Calibre tools. Four separate research projects were completed at the design level within the scope of this work, resulting in test-chips that were fabricated in various process technologies. The first test-chip was a fully custom 80nm chip, designed in TSMC's 80nm LP CMOS process, including the smart image sensor with integrated AB2C scheme. The second test-chip, was a 40nm chip, designed in TSMC's 40nm LP CMOS process, integrating digital control blocks with a pair of custom designed low-voltage SRAM arrays. This chip was the first 40nm test- chip fabricated by an academic research group in Israel. The third project to reach the fabrication stage was the RFID tag, designed along with TowerJazz and Tel Aviv University (TAU). This project included several test-chips that were integrated on periodic shuttles in the TowerJazz 0.18μm CMOS process, including various components of the full system. The complete design was fabricated in a final test-chip at the conclusion of this project. Finally, a gain-cell test-chip, including digital control blocks and two custom low-power gain-cell arrays, was fabricated in the UMC 0.18μm MM/RF CMOS process. The conclusion of the research projects was accomplished as follows. For projects that resulted in the fabrication of a test-chip, post-silicon measurements were conducted to test the functionality of the circuits and systems and to evaluate the results. In these cases, the designs and results were summarized and published in field leading peer-reviewed journals, such as the IEEE Journal of Solid State Circuits (JSSC) [46-49]. For projects that did not reach the fabrication stage or were conducted at a theoretical level, the analysis and

25 simulation results were summarized, presented, and published at leading IEEE conferences, such as the IEEE Symposium on Circuits and Systems (ISCAS) [21, 47, 50, 50-59] and in peer-reviewed journals [25][1, 2, 25, 42, 60]. Two novel low-voltage SRAM designs (SFSRAM and QSRAM) were filed for patent applications [61, 62].

1.3 Organization of the Work

This dissertation presents a multi-level, cross-disciplinary approach for low-power Integrated Circuit Array design as well as demonstrating its efficiency in a variety of designs. Ten peer-reviewed international journal and thirteen peer-reviewed IEEE conference papers were published covering the investigated areas, including two international patent applications. The main body of this dissertation comprises seven of the journal articles published throughout this research, and is organized as follows:  Chapter 3 introduces the reader to low voltage SRAM design and stability analysis. This chapter comprises four in-line journal articles [46, 49, 60, 63], describing three novel low-voltage SRAM solutions, and includes material described in five additional publications [46, 53, 55, 56, 58], including Appendix F.  Chapter 4 introduces GC-eDRAM as an alternative to SRAM, and presents methods for low-power operation of GC-eDRAM arrays. This chapter is comprised of one in-line journal article [25] and includes material described in two additional publications that are attached in Appendix B [21] and Appendix C [25].  Chapter 5 describes the development of a low-power low-cost Radio Frequency Identification (RFID) tag, based on an ultra-low power embedded non-volatile memory solution. This chapter is comprised of one in-line journal article [64] and includes material described in [47] and attached in Appendix D [59].  Chapter 6 analyzes the integration of low-power methods, which were developed for image sensors, into additional arrays, and presents an improved Adaptive Bulk Biasing Control (AB2C) solution for power reduction in a "smart" image sensor, integrating a pixel array alongside an embedded memory. This chapter is comprised of one in-line

26

journal article [48] and includes material described in [1, 2, 52] and attached in Appendix F [51].  Finally, Chapter 7 summarizes the work by discussing the research results and conclusions and emphasizing their significance and novelty.

1.4 Constitutive Articles Review

Throughout the course of this research, ten journal and thirteen conference papers were published in international, electronically indexed publications. The core of this work is comprised of five of the journal articles [46, 48, 49, 60, 64], which have been published in leading IEEE journals (IEEE Journal of Solid State Circuits [46, 64], IEEE Sensors [48], and IEEE Transactions on Circuits and Systems II [60]) and a leading Elsevier publication (Microelectronics Journal [49]). Two of these articles were published in special issues of their respective journals [48, 60], and a brief preliminary was in the prestigious first edition of the MDPI Journal of Low Power Electronics and Applications [63]. These publications appear in their entireties further on in the body of the work. Another five journal articles were published within the framework of this research [1, 2, 25, 42, 63], All conference papers and the remaining journal articles are cited in the body of the work [1, 2, 21, 42, 47, 50-59, 65] and/or attached as appendices, including the three papers that were presented as part of special sessions of international IEEE conferences [21, 52, 58]. Several other papers, which are currently in final stages of compilation or in the midst of a peer-review process, will be briefly discussed within the chapters of the manuscript. This section briefly reviews the articles comprising the body of the dissertation, emphasizing their significance, logical interconnection, and contributions to the research. In addition, papers attached as appendices are enumerated, and other publications in each field are noted below. Low Voltage SRAM Design and Stability 1. A. Teman, L. Pergament, O. Cohen and A. Fish, "A 250mV 8kb 40nm Ultra-Low Power 9T Supply Feedback SRAM," IEEE JSSC, 46/11, pp.2713-2726, Nov. 2011. This is the flagship paper in my research of low voltage SRAMs and describes the novel 9- transistor Supply Feedback SRAM (SF-SRAM), which was invented and patented within the framework of this research. The article describes the topology of this innovative bitcell that 27

was targeted for robust subthreshold operation, and includes an in-depth analysis of the operating mechanisms of the bitcell, the memory array subsystems, and measurement results from a 40nm test chip. 2. A. Teman, A. Mordakhay, and A. Fish “Functionality and Stability Analysis of a 400mV Quasi-Static RAM (QSRAM) Bitcell,” Elsevier MEJ, 44/3, pp. 236-247, 2013. This full length journal article provides an in-depth stability analysis for the novel 9-transistor QSRAM bitcell, extending the brief presentation of Article 2, above. This article includes groundbreaking dynamic stability analysis methods, applied on a non-standard SRAM cell for the first time, further proving the feasibility of the concept. Finally, the measurement results from a 40nm test chip are presented. 3. A. Teman, A. Mordakhay, J. Mezhibovsky, A. Fish “A 40 nm Sub-Threshold 5T SRAM Bit Cell with Improved Read and Write Stability,” IEEE TCAS-II, 59/12, pp. 873-877, 2012. In this brief article, which was published as part of the prestigious special issue on "Ultra-Low Voltage VLSI Circuits and Systems for Green Computing", a small form-factor 5T SRAM bitcell is presented, targeted for subthreshold operation. This article provides an in-depth dynamic stability analysis in a 40nm process for proof of concept. 4. Appendix F: N. Edri, S. Fraiman, A. Teman, and A. Fish “Data Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays” IEEEI 2012, Nov. 2012 Additional publications in this subject include [50, 53-56, 63, 65].

Low Power Gain-Cell Based Embedded DRAM 5. Appendix B: A. Teman, P. Meinerzhagen, A. Burg, and A. Fish, “Review and Classification of Gain Cell eDRAM Implementations” IEEEI 2012, Nov. 2012 6. Appendix C: P. Meinerzhagen, A. Teman, R. Giterman, A. Burg, A. Fish, “Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications Under Technology Scaling,” MDPI JLPEA, 3/2, pp. 54-72, 2013. This article, co-authored with P. Meinerzhagen of EPFL, Switzerland, presents an analysis of gain-cell technology under low operating voltages and across process technologies. This article is an extension of our work presented at the 2012 IEEE Sub-threshold Microelectronics conference in Boston, MA. This analysis is the first time a subthreshold operation of gain-cell embedded DRAM circuits has been considered. It is the first published journal article in a

28

series of planned submissions following the fruitful collaboration between our lab at BGU and Prof. Andreas Burg's lab at EPFL. Additional publications in this subject include [57]. Low Power Non-Volatile Memory for RFID 7. H. Dagan, A. Teman, E. Pikhay, V. Dayan, A. Mordakhay, Y. Roizin and A. Fish, “A GIDL Free Tunneling Gate Driver for a Low Power NVM Array”, IEEE JSSC, PP. 2013. This publication is the first of a series of articles resulting from our multi-year research project targeting the development of a low-power passive RFID tag. In this article, the design of complex voltage multiplexers for biasing a low-cost non-volatile memory solution is presented. In depth, a dynamic analysis of the proposed circuits is included, based on SRAM dynamic stability analysis (Articles 1-4). Test circuit measurement results are provided as proof of functionality. 8. Appendix D: H. Dagan, A. Teman, et al., "A Low-Cost Low-Power Non-Volatile Memory for RFID Applications," ISCAS 2012, pp. 1827-1830, May 2012 Additional publications in this subject include [47].

Low Power Techniques for Smart Image Sensors 9. A. Teman, O. Yadid-Pecht, and A. Fish “Leakage Reduction in Advanced Image Sensors Using an Improved AB2C Scheme,” IEEE Sensors, 12/4, pp. 773-784, 2012. This article, listed in the list of top accessed articles in IEEE Sensors for February 2012, was published as part of the special issue on Low Power Arrays. The work describes an improvement of the previously proposed AB2C scheme for integration in smart image sensors, including test chip measurements and analytical analysis. This work is an extension of a paper presented at the IEEE Sensors Conference in New Zealand, 2009. 10. Appendix A: A.Teman, O.Yadid-Pecht, A.Fish, “Large VLSI Arrays – Power and Architectural Perspectives”, IJ ITK, vol.4, pp. 76, 2010. 11. Appendix E: A.Teman, S. Fisher, L.Sudakov, A.Fish, O.Yadid-Pecht, “Autonomous CMOS Image Sensor for Real Time Target Detection and Tracking,” ISCAS 2008, pp. 2138-2141, Seattle, WA, USA, May 2008. Additional publications in this subject include [1, 2, 42, 52].

29

Chapter 2 Introduction to Low Power Array Design

2.1 Motivation for Low Power

During the first three decades of microelectronic development, the primary focus was on increasing the density and frequency of digital circuits. Power was generally overlooked, and, aside from the move from high powered bipolar and nMOS logic families to CMOS in the early 80’s, power considerations were not usually taken into account. However, the exponential advancement of most of the VLSI metrics, which Gordon Moore [3] had predicted as far back in the past as 1965, resulted in the realization that expected power consumption and power density would soon become unacceptable. According to the predictions made by the International Technology Roadmap for Semiconductors (ITRS) [12], without the application of low power design, the power consumption of microprocessor IC’s would quickly rise into hundreds and then thousands of watts; and, even worse, the power density would reach thousands of W/cm2. To emphasize this point, as far back as 1996, the power density of a microprocessor chip was equivalent to that of a hot plate, and predictions showed that over the past decade, it would surpass that of a nuclear reactor and then a rocket nozzle. This famous “doomsday prediction” is shown in Figure 1 [4, 5, 8, 66] Accordingly, the early 90’s brought a new interest in low power design [7]. The rise in the popularity of mobile and handheld devices, together with the price and complexity of cooling requirements for high performance IC’s, paved the way to a change in focus for VLSI design. The continuous race for performance was stopped, and the primary goal shifted to a reduction of power consumption [9]. At the consumer level, this was noticeable because lower frequency dual-core processors replaced the power hungry 4GHz Pentium 4’s. Research in low power techniques was expanded to all levels of design, from the introduction of multiple threshold devices and low power processes to the implementation of low power algorithms in CAD design flows [7].

30

Source: Intel Sun’s 10000

Surface )

2 Rocket 1000 Nozzle

Nuclear 100 Reactor

8086 Hot Plate 10 4004 P6 8008 Pentium® Power Density (W/cm Density Power 8085 386 286 8080 486 1 1970 1980 1990 2000 2010 Year

Figure 1: The well-known prediction of the "Power Density Disaster", from [4]. The plot shows the power density of Intel microprocessors from 1970 to 2000, predicting that, at the current pace, by 2013 the power density would be approaching that of the sun's surface.

2.2 Power Dissipation

It is important first to discuss the sources of power dissipation in a VLSI system in order to understand the various design techniques that will be referred to herein. The general equation for describing the total power consumption of a circuit is given by [6]:

PPPtotal dynamic static ( 1 ) where Pdynamic is the dynamic power consumption consumed during transients and Pstatic is the static power consumption consumed in steady state. The dynamic power parasitically includes an additional factor, Pshort circuit, which is the power attributed to short circuit currents that flow between the supply rails during transients and are not utilized to charge the output capacitances.

2.2.1 Dynamic Power Traditionally, the dynamic power component has been the dominant factor of power consumption. This is considered a “positive” factor, as the dynamic power provides the 31 functionality of the circuit. The logic levels are represented by the capacitive voltages of the inner nodes, and the dynamic power is that which charges or discharges these capacitances. For most modern digital circuits, the dynamic power can be described with Equation ( 2 )):

( 2 ) Pdynamic f  C eff  V supply  V swing  P short circuit where f is the circuit’s switching frequency; Ceff is the effective capacitance that switches during a single transition, often referred to as αCcircuit where α is the probability that the circuit will switch or a percentage of the total capacitance that switches during a single transition, and Ccircuit is the total capacitance of the circuit; Vsupply is the supply voltage, generally referred to as (VDD-VSS); and Vswing is the swing of the capacitance as the result of a 2 transition. For most digital gates this can be simplified to αfCloadVDD , considering a full rail to rail output swing, with Cload the output capacitance of the gate. Equation ( 2 ) provides five straightforward approaches to dynamic power reduction [6]:

 Lowering the frequency (f) – even though this usually is considered a direct reduction of performance.  Reducing the switching probability/activity factor (α) – this can be achieved with careful, but complex, design and planning.  Reducing the circuit capacitance – this is an inherent feature of device scaling, but also requires both limiting the number of devices, as well as providing a careful layout.

 Reducing the supply voltage and swing (Vsupply and/or Vswing) – this is considered the most effective way to save power due to the quadratic dependence, if supply voltage

equals swing, of dynamic power on VDD; however this usually comes at the expense of performance and robustness.

 Minimizing the short circuit power (Pshort_circuit) – this is generally achieved by keeping equivalent transition times on the inputs and outputs of circuits and by keeping threshold voltages high; however, it cannot be completely removed. Short circuit currents also result in peak power figures that have to be considered during power distribution design.

32

2.2.2 Static Power The second factor of power dissipation is generally considered a parasitic component; which is unnecessary for the operation of most modern digital circuits. With the introduction of Static CMOS digital logic, characterized by a high resistance path between the supplies at steady state, the static power problem was temporarily solved. However, as technology scaling has delved deep into the nanometer region, the Ion/Ioff ratio of standard devices has degraded, leading to leakage currents that are no longer negligible [10]. This is in addition to the static currents that are necessary for the functionality of most analog blocks, as well as for certain application specific digital circuits. The static power dissipation is defined as:

( 3 ) PVIstatic supply static where Vsupply is (VDD-VSS), as defined above, and Istatic is the static current at a steady state.

For an optimized circuit with all its components running exactly at the frequency required to finish their transitions, the static power would be negligible compared to the dynamic power and would not need to be considered. However, the nature of synchronous systems sets the clock frequency according to the longest path delay, leaving the majority of the system’s components in a steady state for a substantial percentage of the clock period [34]. In addition, many components or parts of components are kept in an idle state for the majority of the functional cycles. For example, memory blocks are only accessed during a certain percentage of functional cycles and then only a small number of cells are accessed. Accordingly, a large percentage of a system’s circuits are in a static state at any given time, and some are in a static state for the vast majority of the time. As a result, the static power component is a major factor in the total power consumption of most modern systems. In fact, due to technology scaling together with the increasing sizes of memory, static power has overtaken dynamic power as the main factor of power consumption in many of today’s systems, and this trend is expected to increase (see [11]). This trend is illustrated in Figure 2, as taken from the ITRS roadmap [12].

33

Figure 2: Components of Total Chip Power Consumption [11]

2.2.3 Leakage Currents in MOS Devices As mentioned above, static power consumption of the majority of digital circuits is comprised of device leakage currents. The reduction of static power is often analogous to the minimization of leakage currents; therefore, it is important to discuss the main leakage components in modern VLSI CMOS processes. The traditional leakage component in the long channel devices of older technology nodes was the reverse-bias pn-junction currents of the source-to-body and drain-to-body diodes [9].

However, as technology has scaled and the threshold voltage (VT) of devices has been reduced, the main leakage factor is the subthreshold (Sub-VT) leakage that occurs when the gate-to-source voltage (VGS) of a transistor is lower than the threshold voltage [11]. This leakage is caused due to the weak-inversion state of the transistor, which brings enough charge carriers to the surface to create a significant current flow. Due to the exponential dependence of the subthreshold current on the overdrive (VGS-VT) of the device, the reduced threshold voltage in modern processes has become substantial. This growing factor is the main cause for the degradation of the critical Ion/Ioff ratio that enables the transistor to be used as a switch and to implement digital logic [67]. 34

The subthreshold leakage of an MOS transistor is further enhanced by increasing the drain- to-source voltage (VDS) of a cut-off transistor. This increase in the reverse biasing of the drain diode causes the expansion of the depletion region of the pn-junction, resulting in a number of parasitic effects, the most important being the so-called Drain Induced Barrier Lowering (DIBL) [68]. This effect, which is almost negligible in long channel devices, exponentially increases the drain-to-source current (IDS) of a transistor in a weak inversion or accumulation, substantially adding to the total subthreshold current. In all, taking the DIBL into account, the subthreshold current can be modeled with the following equation [65]:

VV VV GS T DS DS ( 4 ) nt ttn Isubthreshold  I0  e (1  e )  e with

W 2 I C n 1 ( 5 ) 00oxL t where ɸt=kT/q is the thermal voltage, η is the DIBL coefficient, n is the inverse subthreshold swing coefficient of the transistor, Cox is the gate oxide capacitance, µ0 is zero bias mobility, and W and L are respectively the width and length of the transistor,.

The threshold voltage, VT, is simply modeled according to Equation ( 6 ) [6]:

VVV 22     ( 6 ) T T0  b BS b where VT0 is the zero biasing threshold voltage, set during fabrication; VBS is the body-to- source voltage; γ is the body-effect coefficient; and Φb is the Fermi potential. Various parameters affect this, such as the body effect, that raises or lowers the threshold voltage when a non-zero VBS is applied [69]. The so called short channel effect or roll off occurs in short-channel devices, in which the threshold voltage shows a dependency on channel length, as the depletion regions of the diffusions lower the barriers at the channel edges [70]. The

Reverse Short Channel Effect (RSCE) causes a rise in VT at a certain short-channel length, due to technology dependent halo implants [71]. Another important factor that affects the threshold voltage is temperature. Whereas temperature degrades mobility and, therefore strong inversion current decreases with 35

temperature, VT decreases with temperature and the exponential dependence dominates the mobility degradation at subthreshold voltages. Therefore, the subthreshold leakage currents rise with the temperature.

The subthreshold swing coefficient, Snln10  t , is the de facto definition of the ability of a transistor to act as a switch, because it shows the necessary voltage swing to reduce IDS by an order of magnitude. For room temperature the minimum value for S is about 60 mV/dec (n=1); however, for most modern CMOS technologies, S is in the range of 80-90 mV/dec [72]. In addition to the subthreshold and reverse-biased pn-junction leakages, a growing consideration in modern technologies is gate leakage (Igate) , which is due to the tunneling effects. As the supply voltage is lowered, the gate oxide thickness has been continuously reduced to achieve high on-currents, resulting in an increase in the probability of carrier tunneling through the gate oxide. This gate-to-bulk leakage is highly affected by the oxide thickness and the gate voltage (VGB), but is hardly effected by temperature, as shown in Equation ( 7 ) [9, 73]:

2 ktox ( 7 ) JG J0 E ox e where JG is the gate tunneling current density, J0 is an empirical technology dependent parameter, Eox is the electrical field over the gate oxide (dependent on VG/tox), k is the wave vector inversely dependent on gate voltage, and tox is the oxide thickness. In order to achieve sufficient on-currents with thicker or lower leakage oxides, high-k materials : i.e. with a high dielectric coefficient, also known as ε, have replaced traditional

SiO2 dielectrics in advanced nodes from 45nm and below. The use of these materials has reduced the gate leakages in such a way that subthreshold leakages still dominate: however this is a temporary solution for a limited number of technology nodes. In addition to the subthreshold and gate leakages discussed above, under certain conditions a number of other parasitic leakages can be substantial and can effect functionality. The so- called Gate Induced Drain Leakage (GIDL) flows from the drain to the body as a result of the high gate voltages at the gate to drain overlap region. GIDL becomes the dominant

36 leakage when a negative gate voltage is applied, and so limits the leakage reduction through negative biasing and Reverse Body Biasing (RBB). The primary factor of GIDL in modern CMOS technologies is known as Band-to-Band Tunneling (BTBT) [74, 75]. The final parasitic leakage current which is generally discussed is the so-called Punchthrough current. This current is caused due to the increased depletion regions in narrow channels with high supply voltages. As opposed to the DIBL and the short channel effect that increase the channel current, a punchthrough is a source-to-drain current through the body, when the parasitic BJT between the source-body-drain is opened as the base area (body) is depleted and the transistor turns on. Lowering the gate voltage of the device has no effect on this, so it is a constant current. However, it can be practically eliminated by raising the impurity concentration of the body region [76, 77]. A summary of leakage currents in MOSFET transistors is given in Figure 3, which shows a cross section diagram of the currents discussed above, as well as a schematic modeling of the resulting currents.

) e t L a D

E G g

I t ( n

)

i c l e e g e t t e r n a i i a n l g D e I

n G ( n e u n g T u d T E N+ N+ Subthreshold Leakage (Ioff)

e d s g

e a a i e c ) ) k e B g L u

a d a d D e e o I k n i s I L a d

G e Punchthrough Current I e (I ) e I punchthrough n v i t ( L ( e a a r R G D

Figure 3: Primary Leakage Currents in MOSFET Devices

2.3 Low Power IC Arrays

This dissertation integrates two areas of VLSI design – Low Power and VLSI Arrays. The previous subsections introduced the power dissipation concepts and the leakage currents in

37 order to provide the understanding of the techniques focusing on power reduction. This section introduces the VLSI arrays in order to achieve an understanding of the design techniques for the power reduction in these structures. The section is divided into memory arrays and image sensors, as these are the target array disciplines for this research, and many of the concepts of the various memory arrays can be combined into a unified discussion. An extension of this section can be found in the paper: "Large VLSI Arrays – Power and Architectural Perspectives" [42], as published in the International Journal of Information Technologies and Knowledge, and attached in Appendix A.

2.3.1 Low Power Memory Arrays One of the central building blocks of any microelectronic system is its memory. The capacity of memory has grown exponentially over the past five decades in accordance with Moore’s Law, and it is expected to continue to grow [12, 78]. Typical desktop systems today are supplied with at least 4GB of DRAM. Miniature Flash drives have capacities of over 32GB and solid-state hard drives (SSD) have reached capacities of over 1TB and are expected to replace traditional magnetic drives as the industry standard in the near future. High end microprocessors are equipped with L2 and L3 caches that occupy well over 50% of the total die area, comprising more than half of a chip’s power consumption. These are a few examples of the importance of memories in modern systems and, obviously, one can make the observation that this can lead to the necessity for power reduction in these components as a major factor in overall system power consumption.

2.3.1.1 Memory Classification When discussing memories, it is very important to begin with memory classification, because each memory type has different characteristics and power profiles [6, 20]. The first classification could be the volatility of the memory: i.e., does the memory lose its data after the power supply has been disconnected. Traditional Non-Volatile Memories includes mass media storage, such as magnetic hard disks, cd-roms and tapes, as well as small microelectronic arrays, such as ROMs. However the evolution of Flash technology has provided a low priced, high speed alternative, which is used in most modern embedded 38 systems. Volatile memories include high density DRAMs, generally used for main memories; embedded SRAMs, used for unit memory blocks and high speed caches; and register files. This dissertation includes a discussion of three types of these memories: embedded SRAM; GC-eDRAM; and single-poly, logic-compatible NVM. The second classification that should be considered is the access pattern. Most memories belong to the Random Access class, which enable access to any desired bit/word at any given time. These include most SRAM, DRAM and Flash implementations. Serial memories are used for certain functionality, which enable access according to a predefined order. These include FIFO, LIFO and shift-register implementations used for buffers, stacks and video memories among others. Content Addressable Memories (CAMs) are a type of associative memory, which are an important component of microprocessor caches. Non-random memories are generally implemented with modifications of SRAM or flip-flop topologies. This dissertation focuses on both the generic random access memories, as well as on the non- random access memories for specific target applications.

2.3.1.2 Memory Architectures and Building Blocks A general organization of a memory block is given in Figure 4. The core of the memory consists of bitcells that store one or more of data. These can be written/programmed or read out individually or as part of a word consisting of a predefined number of bits. To minimize the wire length required to access the bitcells of a random access memory and thus optimize area, access time and power, the words are generally folded into a relatively square array. This comprises the memory core and requires horizontal or X-addressing and vertical or Y-addressing addressing to access a selected bitcell/word. The busses used to access the rows are often referred to as word lines and the columns as bit lines (BL). To assert the word lines and bit lines, an addressing scheme is required, most often through a row decoder and a column decoder or multiplexer.

39

Precharge Circuitry r e d o

c Memory Core e D

(Bitcell Array) w o R

Internal Timing Column Multiplexer Digital Control

Biasing Read Column Write Column Circuitry Logic Logic

Figure 4: General Memory Architecture The most popular addressing scheme, which is used for standard SRAM, DRAM, ROM and Flash arrays, asserts an entire row by selecting a word line through the row decoder and directs the selected bitline to the output through the column multiplexer. Often this is done dynamically by precharging all BLs to a predetermined state and using sense amplifiers to detect the selected cell’s level at a reduced swing. This provides an advantage in both speed and power. Writing is achieved in a similar fashion, where the BLs are pre-charged or pre- discharged prior to enabling a connection to the cells’ internal storage nodes. These column- wise read and write accesses are achieved with precharge circuits and write drivers. The partitioning of a memory is another important factor in its design. The size of the memory is generally determined by the system’s needs; however, performance, power and area requirements usually set limits on the array block sizes. Various techniques are used to optimize the size of each block for the system’s specification and block partitioning is implemented. A third level of decoding, often referred to as Z-addressing, is added to access the selected partition.

40

2.3.1.3 Power Consumption in Memories The power consumption in a memory chip can be attributed to three major sources [6]: the memory core, the decoders and the periphery. The power dissipation of a modern CMOS memory of m columns and n rows can be approximated to:

PVIIIDD array  decode  periphery   ( 8 ) VDD mi act  mn 1 i hld    nmCVf  DEint   CVfI PT int  DCP  where f is the operating frequency, VDD is the general supply voltage, Vint is the internal supply voltage, iact is the effective current of the selected cells, ihld is the data retention current of inactive cells, CDE is the output node capacitance of each decoder, CPT is the total capacitance of the digital logic and periphery circuits, and IDCP is the static current of the periphery. The dynamic power of a typical memory block is mainly consumed in the following areas: address decoding, bitline charging/discharging, and readout sensing. During address decoding, power is consumed both by the switching of the decoders themselves, as well as by charging and discharging the selected word lines, which can have high capacitances. During both read and write operations, the BLs are precharged and subsequently discharged. For SRAMs this is especially power consuming during writes, when the BL is fully discharged, or when a full discharge read scheme is chosen. For DRAMs, reads are destructive, requiring a subsequent write and always employing a full BL discharge. For Flash memories, the high voltages of program and erase cycles consume an especially large amount of power. For DRAMs, the periodic refresh cycles mean that the activity factor is relatively higher than other memories. As for the integration of sense amplifiers into a memory block, they typically depend on bias currents for operation, consuming constant power when they are activated. The static power of most memories is primarily consumed through leakage currents inside the bitcells themselves during standby or hold periods: i.e., when the particular cell, or the whole array, is not asserted. For SRAMs, this includes subthreshold and gate leakages in both the inner cross-coupled latch structure, as well as to/from the BLs through the access

41 transistors on unselected rows. The leakage currents of DRAMs are that which necessitates the power hungry refresh cycles. The non-volatility of Flash circuits means that leakage currents can actually be eliminated by disconnecting the supply, but as long as the array is powered, leakage currents persist. Another major contributor to static power is from the precharge circuitry when a constant charging scheme is used: i.e., a high-resistance supply or diode-connected transistor is placed on the BLs to replenish lost precharge voltage. Other contributors to the static power are the leakage currents in the decoders and in other blocks. An in-depth analysis of the power dissipation in memories components can be found in [20].

2.3.1.4 Power Reduction in Memories Several standard methods have been developed over the years to reduce the power consumption of memories. In the following section, a very brief overview of common methods will be discussed at all design abstracts, starting from the device level and proceeding through to the system level. At the device/technology level, technology scaling has brought the inherent power reduction due to the decreased capacitances and supply voltages, but has raised issues, such as leakage power and degraded noise margins, that limit the effectiveness of the inherent savings. The first and foremost device technique for leakage reduction is the integration of multi-threshold devices (MTCMOS) into the design kit, thus allowing the utilization of high

VT (HVT), low leakage devices where possible [79, 80]. This also paves the way for the implementation of techniques for power reduction at a minimal loss of density and performance, such as adding low VT (LVT) devices in a series with the supplies to add leakage resistance and enhance the stack effect [81, 82]. In addition, specialized low power processes and Silicon-on-Insulator (SOI) processes are used for power reduction. In the case of DRAMs, deep trenched capacitors help save power by decreasing the refresh frequency and maintaining SNR while decreasing the output swing. Flash memories are increasingly presenting innovative device improvements aimed mainly at increasing the density, but these methods, such as multi-level cells (MLC), enable power reduction by reducing the overall switched capacitance [83] as a byproduct.

42

At the circuit level, many optional bitcell implementations have been proposed for power reduction in all memory types; however, the loss of density with every additional transistor often prevents these implementations to penetrate industry. Recently, the rise in popularity of subthreshold and near-threshold design has created an increased interest in these designs, because the traditional cells lose functionality at low voltages [40, 41]. This issue is the motivation for the research described in ‎Chapter 3. An additional interesting alternative that has recently become popular is the GC-eDRAM that can be utilized for low-power applications, as well as for high-speed caches. This design field is further discussed in ‎Chapter 4. Modifications at the circuit level are obviously a necessity for many system/ algorithm/architectural level power reduction techniques. These often include bitcell modification, and generally include various changes in peripheral circuitry. Independent peripheral circuit optimization for low power operation is also a popular topic of research, and has been utilized in several of the works described in this dissertation, such as the low- power DCVSL-like GIDL-free voltage drivers presented in Section ‎5.1 [47, 64]. At a system/architectural level, the basic assumption realizes that, if only a small selected percentage of the entire memory is activated, less power will be wasted. This is carried out first and foremost through partitioning. Banked organization divides the memory both horizontally and vertically into sub-arrays. An external decoder raises the chip select of the selected bank, reducing the dynamic power consumption, as smaller decoders are needed, and less wordline and bitline capacitances are charged/discharged. The Divided Word Line approach divides the array horizontally, propagating the decoder output on a global wordline. Subsequently this raises the local wordline of a partition of columns, thus reducing the overall capacitance charged, and requiring smaller wordline drivers [84]. Partitioning the columns using the Divided Bit-Line scheme with partial multiplexing inside the array, reduces the bitline capacitance and, in certain sensing schemes, will reduce the power consumption [39]. All of these solutions come at the expense of additional area overhead, but a good tradeoff can achieve a worthwhile reduction of power consumption in addition to an improvement in performance.

43

Turning off the power supply of non-operational blocks is another basic method for power reduction in memories. This has to be supported by software that decides when a block needs to be accessed soon or when a power-up delay can be tolerated. In the latter case, the power is gated through a low leakage transistor to minimize leakage power. For volatile memories, the stored values are lost in this case and an alternative dynamic or adaptive voltage scaling [85] scheme can be used, which is aimed at setting the hold supply voltage as close to the Data Retention Voltage (DRV) as possible [36, 37, 58, 67]. One popular circuit for sensing and applying the DRV is the canary feedback scheme proposed by Calhoun et.al. [38, 86]. Additional system level methods for reducing static leakage power are the application of RBB [18, 35] and/or the Negative Word Line (NWL) voltages [17]. An ongoing research project that was initiated in the framework of this research is the on-chip Built-in Self-Test (BIST) with Built-in Self-Repair (BISR) for minimum DRV operation. The preliminary publication, describing this project, is attached in ‎0Appendix F [58]. Probably the most efficient way for power reduction in memories is to lower the supply voltage, so reducing both the dynamic and static power consumption. However, this comes at the expense of increased access times or performance reduction and decreased noise margins. Various methods enable reducing the power supply to parts of a memory while maintaining functionality through different techniques. Boosting the word line, for example, can help increase the read/write current and reduce the VT drop in DRAMs and SRAMs, thus improving the functionality even though the rest of the array is biased lower [87, 88]. Such power reductions require an in-depth analysis of memory stability and functionality, using techniques such as static noise margin metrics and dynamic stability metrics. These methods have been used extensively in my work on low-voltage SRAMs [46, 49, 53, 55, 58, 60, 63] and have been the basis of several projects devoted to this type of analysis: for example, that which was published in [56]. Using advanced timing and sensing schemes is another standard method to achieve a substantial dynamic power reduction. Using pulsed word lines and/or reduced bitline voltage swings, results in less discharge during read cycles, but is accompanied with complex design considerations and higher sensitivity to process variations. Timing the activation of sense

44 amplifiers limits biasing currents to be present only during the exact times that the sensing is carried out. At an algorithmic level different methods can manipulate the power consumption features of a given memory to optimize the system’s power use. For example, the realization that programming a Flash cell is much slower and more power consuming than reading it has led to the common implementation of read-before-write schemes in Flash technology. Furthermore, as erase is done on a complete block and is even more time consuming and power hungry, advanced data dependent distribution of the stored values helps reduce the frequency of erase operations. This also improves the life cycle of Flash memories, especially for applications such as solid state drives.

2.3.2 Advanced Low Power CMOS Image Sensors Traditionally, digital image sensors were fabricated in Charge Coupled Device (CCD) technology, but the integration of image sensors into more and more products has made the APS an attractive solution [43]. This image sensor architecture is implemented in standard CMOS technology processes, and provides significant advantages over the CCD imagers in terms of power consumption, low voltage operation, and monolithic integration [31]. With the rising popularity of portable, battery operated devices that require high-density, ultra-low power image sensors, the CMOS alternative has become very widespread. In addition, the CMOS technology allows for the fabrication of so called “smart” image sensors that integrate analog and/or digital signal processing onto the same substrate as the imager and its digital interface. Low-power, smart image sensors are very useful in a variety of applications, such as space, automotive, medical, security, industrial and others [26].

2.3.2.1 Imager Architecture and Building Blocks CMOS image sensors generally operate in one of two modes: the rolling shutter mode or the global shutter/snapshot mode. When the rolling shutter mode is used, each row of pixels is initiated for image capture separately in a serial fashion. This creates a slight delay between adjacent rows, which results in an image distortion in cases of relative motion between the imager and the scene. With the global shutter technique, the image is captured 45 simultaneously by all the pixels, after which the exposure is stopped, and the data is stored in-pixel while the image is read out. The operation of both techniques can be divided into three stages: Reset, Phototransduction and Readout. During the Reset stage, an initial voltage is set on the photodiode capacitance that constitutes most of the pixel area. Subsequently, the pixel enters the Phototransduction stage, during which the incident illumination causes the capacitance to discharge throughout a constant integration time. Readout is commenced at the end of the integration time, and the final value of the pixel is read out and converted to a digital value. Figure 5 shows a component block diagram of a generic smart CMOS APS based image sensor. The core of the image sensor is a pixel array, which generally consists of a photodiode, in-pixel amplification, a selection scheme and a reset scheme. A full description of the operation of this pixel is given by Yadid-Pecht, et.al. [44]. Some smart imagers employ more complex pixels, enabling them to perform analog image processing at the pixel level, such as an A/D conversion.

Column-wise Biasing Circuits n o i t c e l

e Pixel Array S

w o R

Digital Timing Column-wise Processing (S/H and CDS) Digital Control A/D Biasing Conversion Circuitry

Figure 5: Generic Smart Image Sensor Component Block Diagram

46

Access to the pixels is carried out through the row selection block. This is usually made up of a shift register as serial access is commonly employed; although, in certain applications, a digital decoder is preferred. An entire row is generally accessed simultaneously for both reset and readout operations, except in applications where random access is required, such as tracking window systems [51]. Several blocks are required at every column for the parallel operation of an entire pixel row. These include Sample and Hold (S/H) circuits, Correlated Double Sampling (CDS) circuits and Analog to Digital Converters (ADC). The S/H circuitry generally measures the reset level of the pixels to enable the CDS to remove Fixed Pattern Noise (FPN). Column- wise ADCs are only one option; the others are in-pixel ADC or single ADC per imager. The final scheme will be chosen according to the tradeoffs of area, power, speed and precision. Additional blocks that are required in the periphery of the imager include the general Biasing Circuitry and Band-gap References for creating biasing currents for the in-pixel signal amplifiers, which are usually implemented through a Source-Follower scheme; and the ADC’s, Digital Timing, and Digital Control blocks for producing the proper sequencing of the addresses, ADC timing, etc.

2.3.2.2 Power Consumption of CMOS Imagers The sensitivity of a digital image sensor is usually proportional to the area of the photodiode and the resolution is set by the number of pixels. This results in a relatively large area covered by the image sensor, as compared to other on-chip circuits, and, accordingly, a large percentage of the overall power consumption. The contribution of different image sensor components to the overall power dissipation may vary significantly from system to system. For example, pixel array power dissipation can vary from a number of µWatts for a small array employing 3 transistor APS architecture to hundreds of mWatts for large format "smart" imagers employing in-pixel analog or digital processing. The power dissipation of the pixel array of a generic “smart” image sensor can be given by Equation ( 9 ) [26]:

PFNMEEEENMP          ( 9 ) Array R reset read_ out ana log digital leakage

47

where FR is the imager’s frame rate, N and M are the number of rows and columns, respectively, Ereset is the energy required for pixel reset, Eread_out is the energy dissipated during signal readout during one frame, Eanalog and Edigital are energy dissipation components dissipated by in-pixel analog and/or digital processing during one frame and Pleakage is the in- pixel leakage power. The dynamic power in the above equation is proportional to the frame rate and is composed of the energy required to refill the photodiode capacitance during reset, the power dissipated through the column-wise biasing currents during readout, and the additional energy consumed by (optional) in-pixel functionality. The static power is due to leakage through the reset and row selection switches during the integration and standby periods. These leakages also degrade the performance and the precision of the imager. The row selection block can be another major source of power dissipation, depending on the size of the array and the method of operation. In both global and rolling shutter modes, the row reset and row selection capacitances are periodically charged, proportionally to the frame rate. On the other hand in the window tracking applications, the power of the row-and column- selection blocks can be dominated by leakage power, because the majority of the rows/columns may not be asserted for long periods. The other primary source of power dissipation is the analog circuitry, including the ADCs, S/H, CDS and biasing circuitry. Optimally, these are timed to consume power only during their precise periods of operation, but they generally have a high power profile. The analog peripheral blocks present a constant tradeoff between speed, noise immunity, and precision versus power consumption and area, and the choice of these blocks needs to be made cautiously for low power systems. Additional power is dissipated in the digital timing and control blocks; however, the complexity and frequency of these tend to be lower than standard digital circuits, so most common power reduction techniques can be implemented on these blocks. An in depth description of all the contributions to power dissipation in a smart image sensor is given by Fish, et.al. [26].

48

2.3.2.3 Power Reduction in CMOS Imagers Image sensors provide power reduction opportunities at all the design levels, starting with the technology and device levels, through the circuit level all the way to the architecture and algorithm level. Standard power reduction techniques, such as supply voltage reduction and technology scaling, are not always applicable to CMOS image sensors, because they are frequently accompanied by unacceptable tradeoffs. The supply voltage reduction reduces both the precision and the noise immunity of the image sensors, while technology scaling generally includes side effects, such as increased leakage current and dark current, as well as reduced photoresponsivity. However, at the technology level, the processes can be modified for low power image sensor fabrication, albeit at an increased cost. An example of such a process is the Silicon-on-Sapphire (SOS) process that provides a very low power figure and enables backside illumination [45]. The device and circuit level provide several opportunities for limiting power dissipation, depending on the options and layers provided by the chosen technology. The presence of separate wells for both nMOS and pMOS transistors enables the application of body biasing on inactive rows for leakage reduction. This technique loses its effect with scaling, because the effect on a devices threshold voltage is reduced, but the image sensors are generally fabricated in technologies up to 90nm, where it is still efficient. This effect has been manipulated in the implementation of an AB2C scheme for a low-power operation of a smart image sensor, as described in Section ‎6.1 [48, 52]. Additional devices, such as HVT transistors and thick oxide transistors, can also be used for leakage reduction on slow busses. Another technique commonly used for leakage reduction is a serial connection of “off” transistors for “stack effect” utilization [82]. Advanced image sensors provide many interesting opportunities for power reduction at the architectural and algorithm levels. Depending on the functionality of the sensor, these systems can be equipped with designated blocks for eliminating unnecessary power consumption. An example of this is the tracking sensor I proposed [51] that used row and column shift registers for window definition and an analog winner-take-all circuit for motion tracking. In this system, the pixels outside the window of interest were deactivated and

49

ADC’s were used only for initial detection. In addition, the switching activity of the shift registers was very low, further reducing the system power consumption. This paper is provided in Appendix E.

50

Chapter 3 Low Voltage SRAM Design and Stability

3.1 Introduction to Low-Voltage SRAM Design

An extended version of the following overview, was presented at the IEEE Israel Conference in 2010 [53].

3.1.1 Energy Efficient Circuit Design It has long been established that the Minimum Energy Point (MEP) occurs in the subthreshold region for most logic families [89, 90], as shown upon the pareto-optimal delay- energy curve of Figure 6. Over the past few years several groups have shown that it is possible to operate complete systems very close to the MEP [91, 92]; however this comes at the price of about three orders of magnitude loss of performance. Because of the realization that the pareto-curve is relatively flat around the MEP, operating at a slightly lower delay without a significant increase in energy has become an attractive alternative. This is achieved by raising the supply voltage, close to the device’s threshold voltage: an operation region known as the near-threshold (Near-VT) regime. The development of digital logic for operation in the sub/near-threshold regimes has raised the need for embedded memories, primarily SRAMs, operating at the same supply voltages. This presents various difficulties, such as the loss of noise margin and sensitivity to process variations as will be shown later. Optionally, the memory blocks could continue to be operated at a nominal supply voltage; however, SRAM’s are one of the main components of today’s digital systems and SRAM leakage power often dominates the overall leakage power of a chip. In past research methods have been developed for reducing the standby voltage of SRAMs, as well for accompanying algorithms for entering sleep modes, although these were mainly developed for power reduction in the SRAM block considering standard VDD digital logic. More importantly, the pareto-optimal curve of Figure 6 [89] assumes that the circuit can be operated at exactly the optimal speed and then shut off after completion of the

51 operation to eliminate the leakage power. SRAMs generally have to store data for some length of time unrelated to their own access period and cannot be shut off during this time. In this case, minimizing leakage during the active periods is essential and dynamic voltage scaling during system standby periods is insufficient [34].

Minimum ~ 1,000X Delay

n Point Traditional o i t (MDP) Operation Region a r e p O / y g r e n

~ E

Sub-optimal 1 0 X

Infeasible Ultra-low Energy Region Minimum Energy Minimum Energy Point (MEP)

m

y Time/Operation u a l m i e n i D

M Figure 6: Pareto-Optimal Energy-Delay curve showing the Minimum Energy Point (MEP) and the Minimum Delay Point (MDP). (reproduced from [89]) The above arguments, together with the development of complete systems biased in the sub/near threshold region, have led to the design of several SRAMs designated for low- voltage and ultra-low power operation.

3.1.2 SRAM Noise Margins The most common SRAM cell that is found in digital systems is the standard 6T cell shown in Figure 7a. The cell is comprised of a cross coupled inverter latch (M1 & M3, M4 &

52

M6) plus a pair of access transistors that enable differential read and write operations. Due to the positive feedback loop of the cross-coupled inverters, this structure is very robust and non-ratioed during hold cycles, when the access transistors are closed. In this case, the strong-inversion on currents under standard VDD operation is much stronger than the leakage currents through the cutoff devices. This provides a bi-stable circuit that is generally described with the “butterfly curve” of Figure 7b, which plots the Voltage Transfer Characteristics (VTC) of the two inverters upon a single graph. The common terminology for the robustness of this cell is achieved with the definition of the Static Noise Margin (SNM), generally calculated as the side of the largest square that fits inside one of the lobes of the butterfly curve (Figure 7b). Any DC noise larger than the SNM will cause the cell to flip.

VDD BL BLB ]

WL WL V [

M3 M5 B M2 M6 Q Q QB M1 M4

0 VDD Q [V] (a) (b)

Figure 7: (a) Schematics of a standard 6T SRAM cell. (b) Butterfly curve of a standard 6T SRAM cell with square showing graphically calculated SNM. The SNM of a standard 6T cell under nominal supply voltages is affected by the ratio between the pull up and pull down networks, but is generally sufficient regardless of the device sizing. Nevertheless, during a read operation, the cross-coupled structure is disrupted, modifying the butterfly curve and resulting in a degradation of the SNM, as shown in Figure 8a. To ensure a non-destructive read, the voltage of the node holding ‘0’ should not cross the trip point of the opposite inverter, which would result in the immediate flipping of the cell.

53

This is achieved by keeping the pull down devices (M1 and M4) stronger than the access transistors (M2 and M6), thereby maintaining a positive Read SNM (RSNM).

Q

QB

RSNM WSNM

Q QB (a) (b)

Figure 8: (a) Butterfly curve of a standard 6T SRAM cell during read access. (b) Butterfly curve of a standard 6T SRAM cell during write access. For write operations, the access transistors are likewise asserted; although the present goal is to overcome the positive feedback loop and create a mono-stable circuit, i.e., to achieve a negative SNM. A successful write is shown in Figure 8b, and the Write SNM (WSNM) is defined by the noise that would cause the two VTCs to intersect and retain bi-stability. To ensure proper cell writeability, this time the node storing a ‘1’ must be pulled down below the opposite inverter’s trip point. This is achieved by keeping the access transistors (M2 and M5) much stronger than the pull up transistors (M3 and M6). The pull up devices are required to be as weak as possible, as can be understood from the above explanations, This is inherent in most CMOS technologies, where the mobility of a pMOS transistor is 2 to 4 times lower than nMOS. Therefore, a minimum sized pMOS is usually used in the pull up network. nMOS sizes are then set according to minimal dimension and SNM requirements.

3.1.3 6T SRAMs in Sub/Near Threshold The previous section has described the sizing of an SRAM’s components under a standard, or strong-inversion, supply voltage. In this case, the ratioed operation of the SRAM was 54 optimized according to two factors: device mobility differences and device sizing. These factors both have a linear effect on device currents, which is sufficient to ensure the proper operation under strong inversion conditions. However, once the gate voltage of a conducting device falls beneath the threshold voltage, the current becomes exponentially dependent on the overdrive voltage (VGS-VT). This can be shown by Equation ( 4 ) that describes a subthreshold conduction [65]. A similar conclusion can be reached for near-threshold conduction, according to the EKV model as described in [89]. The consequence of this exponential dependence is that the sizing that was described in Section ‎3.1.2 no longer holds. On the one hand, the threshold voltage of pMOS and nMOS transistors is unequal for most technologies, resulting in a different overdrive voltage for nMOS and pMOS transistors. This factor overtakes the difference in mobility between the two device types and can make a conducting pMOS (with VGS=VDD) much stronger than an equivalent nMOS. On the other hand, the sensitivity to process variations increases substantially, as small changes in VT cause large changes in currents. This becomes even more complicated across a spectrum of temperatures, because the thermal voltage has a strong influence on the device current and disrupts the nMOS/pMOS ratio. An additional consideration is the short channel effects, which can have a much higher degree of influence at low supply voltages. The direct result of these new conditions is that the sizing considerations for sub/near- threshold circuits are different from their strong inversion counterparts. Nevertheless, sizing alone will not solve the problems. In considering a write operation, it must be noted that the requirement of a much stronger nMOS access transistor than the adjacent pull up pMOS is almost impossible to meet under the process variations at low supply voltages. Calhoun and Chandrakasan [33] showed that the minimum voltage that enables writeability in a certain 65nm technology was 600mV. This was without enabling any noise margin and without taking local mismatch into consideration. Read SNM presents an additional challenge at low voltages. Due to the changes in pull up vs. pull down currents under low voltages and process variations, the hold butterfly curve is already degraded. In addition to this, the RSNM is considerably lower than the hold SNM

55 due to the disruption of the cross-coupled latch via the access transistor on the side storing a ‘0’. Together, this results in a limitation on supply voltage due to the RSNM. A 6T cell without an RSNM limitation can operate at approximately half the supply voltage of a standard 6T cell [33]. The read problem of a standard 6T bitcell under low voltages also limits the number of cells on a bitline. Due to the degraded Ion/Ioff ratio at low supply voltages, a worst case read cycle should be considered with all of the bitcells in the same column of a certain cell which is storing the opposite value of the accessed cell. While the accessed cell is trying to discharge the bitline of the ‘0’ side, the leakage currents of all the other cells are trying to charge the bitline and discharge the opposite bitline, degrading the voltage difference between the two bitlines. A strong limitation is caused on the number of bitcells that can be connected to each bitline. This is approximately 16 cells per bitline, according to approximations given in [10].

3.1.4 Previously Presented Low Voltage SRAMs Early discussions of low voltage SRAMs started to appear about a decade ago [93], but the first discussions of subthreshold memories were presented in 2004 [40, 92, 94, 95]. The group at Purdue showed that the operation of a standard 6T SRAM under process variations is problematic. The group at MIT used a memory in their FFT that resembles a register file (This is a latch with a tristate driver for writing and static muxed outputs) and has an equivalent bitcell size of 18T. A number of options that have been proposed since then will be reviewed further on.

BL BLB RBL

WWL WWL M3 M5 RWL M6 M2 QB Q M8 M1 M4 M7

Figure 9: Standard 8T bitcell, as employed in [87] and [34] 56

3.1.4.1 8T Subthreshold SRAM Utilizing RSCE [87] In 2007 Kim’s group from Minnesota introduced a standard 8T SRAM cell that functions at voltages as low as 200mV, by utilizing the Reverse Short Channel Effect. This effect is caused during a process scaling due to halo dopings that are used to negate the short channel effect (Roll-off). As a result, increasing the length of a transistor actually lowers VT in most modern processes until a minimum point. By using access transistors with a channel at this minimum VT length, the write current is increased, which results in a write margin equivalent to that achieved with a boosted wordline. In addition, the standard 8T topology, shown in Figure 9, decouples the cell node from the bitline by using additional read path transistors. By this process, the RSNM becomes equal to the hold SNM. This proposed cell in 0.13μm CMOS achieved improved write and read margins, which allowed functionality at 200mV; a 52% speedup in read bitline discharge time; improved immunity to process variations and Ion/Ioff ratio, due to the long channel lengths; and required no additional periphery. The area overhead was 20%.

3.1.4.2 10T Cell with Gated Write Supply [33] A 256kb SRAM test chip in 65nm was fabricated by Calhoun and Chandrakasan in 2007, comprising a 10T bitcell able to operate under 400mV. This 10T bitcell, shown in Figure 10, consists of a cross coupled inverter core with a virtual supply (M1-M6) and a decoupled read buffer (M7-M10). During hold cycles, the virtual supply is connected through a pMOS gating transistor, creating a standard cross coupled latch as described for the standard 6T cell. The readouts are non-penetrating in order to enable a sufficient read SNM similar to the 8T cell. The additional pMOS transistor (M9) helps keep the bitline at VDD throughout the long access time, while the additional nMOS (M10) reduces leakage through the stacking effect. This enables connecting up to 256 bitcell on one bitline. The write SNM at low voltages is achieved by gating the supply voltage to the cell during a write. This weakens the charge current of the pMOS in contention with the discharged bitline, enabling the writes at a supply voltage of 300mV.

57

BL VVDD BLB VVDD RBL RWL WWL WWL M9 M3 M5 M8 M6 M2 QB M10 Q M1 M4 M7

Figure 10: Calhoun and Chandrakasan’s 10T bitcell proposed in [33] The proposed bitcell shows a leakage power reduction of over 60X at 0.3V, as compared to a 6T cell at 1.2V. The test chip operated at 475kHz at 400mV, consuming 3.28μW.

3.1.4.3 Standard 8T Cell with Peripheral Assistance [34] In 2008, Chandrakasan’s group proposed using a standard 8T cell for increased density, thus achieving the low voltage operation through peripheral modifications. Using a cell similar to the one shown in Figure 9, a 30% reduction of area as compared to their 10T cell (Figure 10) was achieved. By applying “zero leakage” and a differential sense amplifier redundancy during reads, and “internal cell feedback control” during writes, their SRAM operated successfully at 350mV with a 20X leakage reduction, as compared to a 1V supply.

The “zero leakage” readout scheme raises the source of the readout transistor (M7) to VDD when the row is deselected, minimizing its DIBL leakage, which almost totally eliminates the leakage. This enables attaching 128 bitcells to a single bitline and improves the read performance. The reads are further improved by using a differential sensing scheme that eliminates global variations. To improve the performance under local variations, a number of sense amps are put on each column, tested at system initialization, and the one that functions correctly is selected. To achieve sufficient write margin, the row’s supply voltage is collapsed during writes and a small wordline boost (50mV) is applied. The virtual supply node is actually driven to ground, but the contention by the cell itself limits the drop and retains the data. A 256 kbit memory was fabricated with 32kb blocks, operating at 350mV with a frequency of 25kHz.

58

3.1.4.4 10T Cell with Differential Read Scheme [32] Contrary to many of the previous designs using a single ended readout, Roy’s group at Purdue proposed a differential readout cell. This 10T cell actually employs a pair of readout buffers made up of transistors AL2, NL and AR2, NR that enable the differential readout. To ensure a high read SNM, the standard access transistors (AL1, AR1) are connected in a series with the readout access transistors (AL2, AR2) and are closed during reads. Both access transistors are opened during writes, but this requires a boosted wordline due to the reduced write current caused by the stacked access transistors. To eliminate additional leakages, the source of the evaluation transistors (NL, NR) is raised during non-read cycles. This system also incorporates bit interleaving for immunity to multi-bit soft errors (SER) and a dynamic differential cascade voltage switch logic (DCVSL) read scheme to ensure a large bitline swing for differential sensing.

BL BLB

WL WWL WWL WL M3 M5

AL2 AL1 AR1 AR2 QB Q NL M1 M4 NR

VGND

Figure 11: 10T bitcell proposed by Roy’s group in [32] The proposed bitcell was fabricated in a 90nm CMOS process as the core of a 32kb array. At 300mV with a 33% wordline boost, the SRAM operates at 581kHz, dissipating 1.81μW for reads and 1.07μW for writes. The minimum achievable operating voltage was 180mV. Leakage was comparable to a 6T cell at the same voltage.

3.1.5 Summary Digital subthreshold and near-threshold design are fast becoming a popular selection for ultra-low power systems. As a major component in digital systems, the development of low- voltage SRAMs has become a popular focus in recent years. Operation of standard 6T or 8T SRAMs at sub or near-threshold voltages is unachievable, primarily due to degraded static 59 noise margins and extreme fluctuations in the device currents under the process variations at low voltages. Several designs have been proposed and fabricated, which show full functionality at voltages down to around 200mV, despite an extreme loss of performance. These designs manipulate various techniques to improve the read and write margins, as well as to enable a large number of bitcells on a single bitline. Most of these cells resemble an 8T cell with a decoupled readout path to solve the read margin problem. Various additional techniques have been used to improve the write margins and to minimize bitline leakage. Altogether it appears that there is substantial room for additional research and future work on improving the operation of SRAMs at low voltages for close to minimal energy point systems. The following subsections include three published journal articles that describe novel, low- voltage SRAM cells. Four additional publications [53, 55, 56, 63] include published designs, analyses, and reviews of low-voltage SRAMs and stability issues in SRAM design.

3.2 A 250mV 8kb 40nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM)

The following paper, published in the IEEE Journal of Solid State Circuits, describes a subthreshold SRAM, which was designed, fabricated, and measured as part of the 40nm "RAMBO" test-chip:

60 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011 2713 A 250 mV 8 kb 40 nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM) Adam Teman, Student Member, IEEE, Lidor Pergament, Omer Cohen, and Alexander Fish, Member, IEEE

Abstract—Low voltage operation of digital circuits continues to be an attractive option for aggressive power reduction. As stan- dard SRAM bitcells are limited to operation in the strong-inversion regimes due to process variations and local mismatch, the develop- ment of specially designed SRAMs for low voltage operation has become popular in recent years. In this paper, we present a novel 9T bitcell, implementing a Supply Feedback concept to internally weaken the pull-up current during write cycles and thus enable low-voltage write operations. As opposed to the majority of existing solutions, this is achieved without the need for additional periph- eral circuits and techniques. The proposed bitcell is fully functional under global and local variations at voltages from 250 mV to 1.1 V. In addition, the proposed cell presents a low-leakage state reducing power up to 60%, as compared to an identically supplied 8T bitcell. An 8 kbit SF-SRAM array was implemented and fabricated in a low-power 40 nm process, showing full functionality and ultra-low power. Index Terms—CMOS memory integrated circuits, leakage sup- pression, SRAM, ultra low power.

Fig. 1. Schematics of standard SRAM bitcells. (a) 6T bitcell. (b) 8T bitcell. I. INTRODUCTION

UB-THRESHOLD and near-threshold operation have be- S come popular alternatives for digital VLSI design in re- traditional ratioed topologies to lose functionality, especially at cent years [1]–[18]. These approaches utilize very low supply low voltages, where sizing and mobility are not always the dom- voltages for digital circuit operation, decreasing the dynamic inant factors in device drivestrength. power quadratically, and sufficiently reducing leakage currents. One of the major blocks that is highly ratioed and therefore As static power is often the primary factor in a system’s power affected by the aforementioned process fluctuations, is the consumption, especially for low to medium performance sys- Static Random Access Memory (SRAM). The standard 6T tems, supply voltage scaling for minimization of leakage cur- bitcell, shown in Fig. 1(a), provides a robust, non-ratioed hold rents is essential. Optimal power-delay studies show that the state, fully operational under process variations at very low Minimum Energy Point (MEP) is found in the sub-threshold re- supply voltages [22]. However, when the data is accessed (read gion, where ultra-low power figures are achieved; albeit, at the and write operations), maintaining drive strength ratios is es- expense of orders-of-magnitude loss in performance [1], [19], sential for maintaining functionality. Device sizing is sufficient [20]. Recent studies have shown that an attractive tradeoff be- to ensure these ratios under nominal strong inversion operation tween power and delay can be found in the near-threshold re- [23]; however, at low voltages, process variations and mis- gion with a slightly higher power figure, but with a significant match can cause a loss of functionality [24]. Both theoretical increase in performance [1]. and measured analyses show that standard SRAM blocks are Low voltage operation of Static CMOS logic is quite straight- limited to operating voltages of no lower than 700 mV [8], forward, as its non-ratioed structure generally achieves robust [25], [26]. The read margin problem is easily solved with a operation under process variations and device mismatch [21]. small area penalty by decoupling the read out path, as in the However, extreme global and local variations that are intensi- two-port 8T cell, shown in Fig. 1(b). This structure features fied at deep-nanoscale technology processes can easily cause read margins equivalent to its hold margins, however its write margins maintain the 700 mV supply limitation [7], [9]. Note that throughout the majority of this paper the 8T implementa- Manuscript received December 07, 2010; revised July 24, 2011; accepted July 25, 2011. Date of publication September 01, 2011; date of current version Oc- tion will be used as a reference, as it has similar hold margins tober 26, 2011. This paper was approved by Associate Editor Peter Gillingham. to the 6T cell, but improved read margins. Therefore, from a The authors are with the Low Power Circuits and Systems Lab (LPC&S), low voltage stability perspective, the 8T is superior to the 6T; VLSI Systems Center, Electrical and Computers Engineering Department, Ben- Gurion University, Be’er Sheva, Israel (e-mail: [email protected]). however, this design introduces additional constraints due to Digital Object Identifier 10.1109/JSSC.2011.2164009 single-ended readout and half-select sensitivity.

0018-9200/$26.00 © 2011 IEEE 2714 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

The requirement to develop low voltage SRAMs is derived and measurements are presented in Section V, including a not only from the need to integrate these blocks within sub/near discussion of design considerations according to the results. threshold digital systems, but as an independent design focus Section VI summarizes the results and concludes the paper. in itself. SRAMs comprise very large portions of the total die area of most modern digital ICs, and this is only expected to grow [27]. Accordingly, the SRAM blocks are often the major II. DRIVE STRENGTH RATIOS AT LOW OPERATING VOLTAGES factor in a given system’s power consumption. They are even a Similar to other low voltage bitcells, the proposed 9T bit- more substantial factor in the static power consumption, as the cell is based on the two-port 8T topology, shown in Fig. 1(b). majority of SRAM cells are in a standby hold state throughout This structure presents two major advantages. First, the Static the clock period. Standby power reduction in SRAMs is one of CMOS cross coupled inverter structure of the cell core provides the best opportunities for power reduction in digital ICs. a non-ratioed robust positive feedback structure for bi-stability Low voltage SRAM design has become increasingly popular with high data retention (“hold”) noise margins. Second, the over the past few years. Various bitcell designs and architec- readout buffer, comprised of transistors M7-M8, decouples the tural techniques have been proposed to enable operation deep read path from the cell core, enabling a non-penetrative read into the sub-threshold region [3], [7]–[9], [13], [28], [29]. These operation. This makes the worst case read margin equivalent designs generally incorporate the addition of a number of tran- to the hold margin, as described in [35]. To ensure writeability, sistors into the bitcell topology, trading off density with robust the write bit lines (WBL/WBLB) must be able to pull down the functionality. Decoupling the readout path is a common tech- internal node ( )storinga’1’ past the cross coupled in- nique [7], [8], [28], as is write margin improvement by word verters’ trip point. This is achieved by maintaining stronger ac- line boosting and supply gating. One of the first examples of a cess transistors (M2/M5) than their respective pull up pMOS fully operational sub-threshold bitcell is the 10T circuit of [8], (M3/M6). At strong inversion voltages, mobility ( ), device di- while peripheral techniques enabled 350 mV operation of a stan- mensions ( ), and threshold voltage ( ) have a linear (or dard 8T bitcell in [7]. Fully differential implementations are pro- square) influence on device currents, so it is usually sufficient posed in [3] and [30], enabling half-select stability. Other recent to maintain a higher pull-down transconductance, i.e., approaches for aggressive powerreductioninSRAMsinclude Adiabatic operation [31], FinFET based SRAM design [32] and (1) bitline charge recycling [33], [34]. In this paper, we propose a novel 9T bitcell with a new Supply with the device’s transconductance, and the oxide capac- Feedback approach to low voltage operation. This Supply Feed- itance coefficient. back SRAM (SF-SRAM) internally weakens the pull up net- As the supply voltage is reduced and approaches the sub- work of the bitcell during a write and thus assists in discharging threshold region, the device’s current dependence on the over- the high data node. By applying this concept, the cell achieves drive voltage ( ,where is the gate-to- highly improved write margins and is fully functional under source voltage) grows from a linear dependence to an exponen- global and local process variations at voltages as low as 250 tial one. This is shown by the combination of two models: the mV. This low voltage operation is accomplished without the sub-threshold conduction model, and the EKV model [36]. For need for any additional peripheral circuits or techniques. Cell voltages well below , the sub-threshold conduction model functionality and stability are presented, including Monte Carlo shows the exponential dependence on overdrive [6]: (MC) statistical distributions for proof of concept. In addition to low voltage operation, the cell topology presents a low-leakage state, at which the bitcell’s static power (2) is lower than an equivalently supplied 8T cell. In this state, the with static power of the bitcell has a 15%–60% lower leakage factor than a standard 8T bitcell, depending on the cell implementa- (3) tion and supply voltage. This results in an 83 reduction in leakage power for an SF-SRAM cell operating at 300 mV, as compared to an 8T bitcell at 1.1 V. where is the transistor’s drain-to-source voltage, An 8 kb array of SF-SRAM bitcells was fabricated as part of a is the thermal voltage, is the Drain Induced Barrier 40 nm test chip and successfully measured in the laboratory. A Leakage (DIBL) coefficient, is the subthreshold swing coeffi- deep analysis of the operating concepts, internal mechanisms, cient, is zero bias mobility, is gate oxide capacitance, and stability, power and performance are presented in this manu- and are the width and length of the transistor, respectively. Note that this model also emphasizes the exponential increase script. of DIBL as increases, as will be discussed later. The rest of the paper is constructed as follows. Section II When the device approaches inversion, as gets close to describes the problems that arise at low operating voltages , the sub-threshold model of (2) loses accuracy and the EKV and thus limit standard SRAM operation. The proposed 9T model should be considered [1]: SF-SRAM bitcell is presented in Section III and its data re- tention is discussed; Section IV describes the read and write operations of the proposed cell. Test chip implementation (4) TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2715

Fig. 3. Schematics of the 9T SF-SRAM bitcell employing a Supply Feedback device (M9) connected to the internal node ( ).

and/or loss of drive current must be taken into account. In addi- Fig. 2. Simulated drive current ratio of minimum sized nMOS/pMOS devices tion, layout implications at deep-nanoscale processes are non- at the typical (TT) and SF corners in a standard 40 nm LP process. Whereas the nMOS devices are stronger than the pMOS devices at all supply voltages for trivial. Therefore, a standard 8T two-port SRAM cell is limited the typical corner, the pMOS devices are stronger than their nMOS equivalents to a minimum voltage supply of approximately 700 mV for most at the SF corner under 900 mV. nanoscale CMOS processes [8].

III. THE PROPOSED 9T CELL with representing the current at the onset of inversion, ,a The previous section presented the writeability limits of fitting parameter, and IC representing the inversion coefficient, standard SRAM cells due to drive strength ratios between given by contending devices under process variations. The proposed 9T Supply Feedback SRAM bitcell, shown in Fig. 3, was developed (5) to overcome this problem by utilizing a novel Supply Feedback concept. This concept is implemented by adding a supply Based on these models, it can easily be concluded that gating transistor (M9) that is connected in a feedback loop to threshold voltage has a strong effect on device current. There- thedatastoragenode( ). The feedback weakens the pull up fore, differences between the thresholds voltages of nMOS path of the cell during a write operation, ensuring that the cell and pMOS devices, especially under extreme process vari- flips, even when the pMOS devices are much stronger than ations, often overtake mobility as the primary factor in the the nMOS ones. In addition, the internal gating creates a slight device strength ratio. The problem is accentuated at the Slow voltage drop at the drain of the supply feedback transistor (M9) nMOS/Fast pMOS (SF) process corner, where the pMOS de- during one of the hold states. This voltage drop results in lower vices become stronger than equivalently sized nMOS devices at leakage currents at the expense of a slight reduction of hold voltages below 900 mV. This variation, that causes 8T SRAMs noise margins. Hold states and stability of the 9T SF-SRAM to fail at low voltages, is shown in Fig. 2, plotting the current cell are described in detail below, whereas read and write ratios (with )ofminimumsizednMOS operations are described in Section IV. devices as compared to pMOS devices at different supply voltages. The figure shows that for a typical process corner at A. 9T Cell Hold States room temperature, the nMOS is always stronger, but for the SF In contrast with the symmetric 6T and 8T bitcells, the 9T corner at 0 , the pMOS becomes stronger at 900 mV. In the SF-SRAM bitcell, shown in Fig. 3, presents a pair of asymmetric near-threshold and sub-threshold regions, the pMOS current is stable states for data storage. For the case of storing a ’0’ (i.e., as much as 35 stronger, such that meeting the writeability node is discharged), the feedback loop turns on M9 ( constraints through sizing would require impractical device ), propagating to the virtual supply sizes. node, . In this “trivial” case, the internal transistors (M1, Two relatively simple methods to improve this ratio are word- M3, M4 and M6) create a standard pairofcross-coupledStatic line boosting and increasing the channel length, as incorpo- CMOS inverters, with similar noise margins to an equivalent rated by the authors of [7]. Adding approximately 150 mV to 8T cell. is therefore charged to , providing a strong thegatevoltage( ) of an nMOS will ensure stronger drive gate bias for M7 ( ) and enabling fast than equivalent pMOS at sub and near-threshold voltages, but single-ended readout (RBL discharge) when RWL is asserted. this comes at the expense of additional periphery and power to The opposite state is initiated when is charged and is create and propagate this boosted voltage. Enlarging the channel discharged. Now M9 is cut off as drops below the device’s length utilizes the Reverse Short Channel Effect (RSCE) [9] threshold voltage. is strongly discharged, as the low resis- and improves the susceptibility to process variations, therefore tance of a conducting M4 only has to overcome the high serial maintaining a better ratio. However, this alone cannot solve resistance of disconnected M6 and M9. M1 is therefore strongly the problem for most processes, and the increase in device size cut off, with a very low gate voltage ( ), and M3 2716 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

Fig. 4. Simulated statistical distribution of steady state voltages at nodes and (left and right side graphs, respectively) for both the hold ’0’ and hold ’1’ states (top and bottom graphs, respectively) with . The results are for a 2500 point simulation under both global and local mismatch variations. Note that the graphs for at hold ’0’ and at hold ’1’ are given in . is conducting ( ), such that implants at a discrete sub-threshold voltage. Fig. 5 extends the the final state of is equivalent to . This voltage is set discussion by showing the distribution over the full range of according to the contention between the primarily DIBL current supply voltages and with three different implementation op- through M1 ( ) and the sub-threshold tions. These three implementations, which will be compared current through M9 ( ), which is throughout the text, are achieved by replacing the pMOS transis- much stronger providing a high level. Ultimately, the steady tors with low-threshold (LVT) or high-threshold (HVT) devices. state voltage at is approximately 10% lower than ,and In general, using an LVT feedback device results in a higher fluctuates with the implant of the supply feedback transistor. stable value for in the hold ’1’ state, and consequently in- Fig. 4 shows the Monte Carlo statistical distribution of the creased hold margins, but it achieves less of an improvement steady state voltages at the and nodes with a supply in write margins and leakage currents. Using an HVT feedback voltage of 300 mV, under both global and local mismatch. The device provides the opposite trade-offs. The steady state node figure includes the distributions for the two nodes at both the voltages at are shown for all three implementations in Fig. 5 hold ’1’ and hold ’0’ states. It is clearly shown that for the hold at the full range of operating voltages, from 200 mV to 1.1 V. ’0’ state, both nodes are clamped to strong levels, while at the In addition, the figure shows the voltage drop under local hold ’1’ state, the node shows a voltage drop with a mean and global variations. The distributions of and of the hold value of 264 mV, which is 12% less than . This voltage ’1’ state are not shown, as they are very close to the rails with drop results in a slight loss of static noise margin (SNM), but a very small deviation ( 1 mV) for all implementations and enables low voltage writes and provides internal leakage sup- supply voltages. Note that as shown in Fig. 5, whereas the SVT pression. This voltage drop doesn’t affect the readability of the and LVT implementations retain hold stability under 200 mV, cell, even at extreme variations, as can be seen in the the HVT implementation loses functionality at a slightly higher states depictions in Fig. 4. All the distribution points fell supply voltage. in the region for the hold ’1’ state, and had less than a 1% With respect to global variations, the most problematic corner voltage drop for the hold ’0’ state. This ensures a strong over- is the Fast nMOS/Slow pMOS (FS) corner, as the leakage of M1 drive voltage on M7 during read operations for all states and is strengthened as compared to the current through M9 and M3. under all operating conditions. But at this corner M4 is also strengthened, providing a robust Fig. 4 shows the steady state voltages of an SF-SRAM bit- ’0’ level at despite the degraded level at . This ensures cell implemented exclusively with standard (SVT) threshold correct readout through these extreme situations. TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2717

Fig. 5. Simulated mean and 6 boundaries of the steady state voltage at node compared to for the hold ’1’ state. Three implementation options are shown, marked by the threshold implant of the feedback transistor (M9) – SVT, LVT and HVT. The distributions for the node and for the hold ’0’ state are very close to the rails for all supply voltages and therefore are not shown.

Fig. 7. SNM ratio of 9T SF-SRAM cell as compared to standard 8T cell at full range of supply voltages. (a) Hold ’1’ state with depleted noise margins for three possible implementations of the 9T cell. (b) Hold ’0’ state with equivalent noise margins to the 8T cell. All graphs are shown according to simulation at the TT process corner.

the leakage through M1. As the gate voltage of M1 increases, the Fig. 6. Butterfly curves of an SF-SRAM cell with three threshold implant op- voltage level at decreases leading to a “deflated” lobe. When tions for the supply feedback transistor (M9) – SVT, LVT and HVT, as well as M9 is implemented with an LVT device, the loss of static level the reference 8T butterfly curve. The simulation data was plotted for a 300 mV is limited, in contrast with the HVT device that presents a se- supply voltage. verely deflated lobe. However, if ultra-low power has a higher priority than large static noise margins, the HVT or SVT im- plementations are attractive alternatives, since the hold ’1’ state B. Static Noise Margin provides a low leakage factor, as will be shown later. As described above, the threshold implant of the supply feed- A comparison of the SNM of the 9T SF-SRAM cell with back device (M9) has a large effect on the voltage drop and the standard 8T SRAM cell at the full range of supply voltages therefore on the noise margins of the hold ’1’ state. Fig. 6 em- (0.2 V to 1.1 V) is given in Fig. 7. Due to the asymmetric char- phasizes this fluctuation in noise margins, showing the butterfly acteristic of the proposed cell, the SNM is partitioned into hold curves of the SF-SRAM cell with different threshold implants ’0’ SNM and hold ’1’ SNM graphs. These are identical for sym- applied to M9. All curves in Fig. 6 are shown for a 300 mV metric cells, such as the 8T SRAM. Obviously the worst case supply voltage. Note that the 8T cell SNM is shown for compar- (Hold ’1’) should be considered, however we choose to show ison and due to the fact that various other low voltage SRAMs both cases, as SNM is an over-pessimistic metric for an asym- are based on this structure; however, the reference 8T cell is metric cell. A better comparison could be according to Dynamic non-functional at this voltage. As expected, the 9T butterfly Noise Margin (DNM), as described in [37], [38]. curves are asymmetric, as the node always holds strong As can be seen in Fig. 7(a), which shows the SNM ratio for levels, whereas the node’s steady state level is a function of the hold ’1’ state, the LVT implementation has a 10–20% lower 2718 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

SNM for the ultra-low voltages (200 mV–450 mV). The HVT option is unstable below 300 mV, with a quick fall in SNM. At the median to nominal voltages, the SVT option should be considered, as its SNM is comparable to the LVT implementa- tion, and it presents a lower power figure, as shown in the next subsection. The HVT option has static noise margins that are ap- proximately 40% lower than the reference 8T cell, and so should only be considered for ultra-low power applications in a low noise environment or with the integration of strong error correc- tion codes. Note that for the hold ’0’ state, shown in Fig. 7(b), the SNM is hardly degraded for the LVT and SVT implemen- tations throughout the full range of supply voltages. The HVT option has degraded margins for this state as well; however they are substantially higher than the hold ’1’ state. Fig. 8. Simulated leakage current ratio of the 9T SF-SRAM cell, as compared to the reference 8T bitcell for the Hold ’1’ state.

C. Static Power The leakage reduction described above is, as far as we know, The major advantage of the proposed 9T SF-SRAM cell is the first time a sub-threshold bitcell has been proposed with an its functionality at ultra-low voltages without the need for addi- advantage over the standard 8T cell in static power dissipation. tional periphery due to its extended write margins. However, a The majority of the proposed low-voltage cells have slightly by-product of its topology is internal leakage suppression in the larger static power dissipation due to the additional transistors, hold ’1’ state. and therefore additional leakage paths. The primary focus of all As shown above, the steady state voltage of the data storage of these implementations is to aggressively reduce the system’s node ( ) is slightly lower than (see Fig. 5). On the one power dissipation at the expense of performance and area. hand, this presents a decrease in SNM, but on the other hand it Even though the proposed cell’s additional power reduction reduces the internal leakage of the bitcell. For the standard 8T is only present at one stable state, data processing algorithms cell, the static leakage currents are set (assuming and can maximize the number of cells storing a ’1’ (for example, precharged bitlines) by the cut off inverter transistors, M1 and unused storage could be pre-written to this state) maximizing M6 (see Fig. 1), along with the right access transistor (M5). The the leakage reduction of the array. It should be noted that leakages are dominated by DIBL with , . the hold ’1’ state is also beneficial for read power and bitline In the proposed SF-SRAM cell, assuming , leakage, as M7 is cut-off at this state. the left leakage path is reduced due to the degraded of M1, while the right leakage path is suppressed by the stack effect of M9 in series with M6. M9 has a sub-threshold gate bias ( IV. READ AND WRITE OPERATIONS )withlowDIBL( ), while The previous section discussed in detail the static operation M6 is not as severely cut off as in the standard 8T case ( of the proposed 9T SF-SRAM cell when holding a ’1’ or a ’0’. ), but has lower DIBL ( ). However, the major benefits of this topology are its readability Leakage through the access transistors is bitline dependent but and writeability at low operating voltages, without the need to hardly changed, and gate leakages are slightly decreased on M4 implement additional peripheral circuitry and techniques (com- and increased on M6. M9 provides an additional, albeit very pared to a standard 8T cell). small, gate leakage current. The final result of these effects is a drop in total leakage as is A. Read Operation depicted in Fig. 8 for the three threshold implementations of the supply feedback transistor across the full range of supply volt- The read operation of the proposed cell is equivalent to that ages. The figure depicts the ratio of leakage currents as com- of the reference 8T cell. Transistors M7 and M8 comprise a pared to the reference 8T bitcell. It is clear that the higher the read buffer that decouples the readout path from the internal threshold of the SF device, the lower the leakage, achieving as cell storage. M7 is gated by node , such that when a ’1’ is much as a 63% reduction in static power for a nominally biased stored (at ), M7 is cut off and when a ’0’ is stored, it is con- ( ) HVT implementation. For the more robust LVT ducting. A read operation is initiated by precharging the read bit implementation, a 20%–25% reduction is achieved for all appli- line (RBL in Fig. 3) and asserting the read word line (RWL). If a cable voltages. The graphs for the hold ’0’ state are not included, ’0’ is stored, RBL is discharged through the M7 and M8. If a ’1’ as the leakage power is almost identical to the 8T reference cell. is stored, M7 blocks the discharge path and RBL remains at its Note that the reference 8T cell is non-functional under 700 mV, precharged value. A single ended sensing scheme is used to rec- due to lack of writeability. The SVT implementation of the pro- ognize if RBL has been discharged or not. Despite the deflated posed cell in the hold ’1’ state presents a 91% (11 ) and 99% level of when holding a ’1’, is always clamped to or (83 ) leakage power reduction at 300 mV as compared to the GND, resulting in strong conductance through M7 ( 8T cell at 700 mV and 1.1 V, respectively. or ) and equivalent read performance to the 8T reference TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2719 cell. This decoupling of the readout path results in a read margin equivalent to the hold margin described in the previous section, which is sufficient at the full range of supply voltages under global and local process variations. From a performance per- spective, the read access time is proportional to a number of de- sign factors, primarily bitline capacitance, sense amplifier sen- sitivity and drive strength of read buffer transistors (M7, M8). Therefore the read performance is application/architecture spe- cific and very controllable according to required specifications. However, it is clear that it degrades severely with reduction of the supply voltage.

B. Write Operation

The most interesting aspect of the proposed SF-SRAM cell is its performance during a write operation. Similar to an 8T cell, the write operation is initiated by driving the differential Fig. 9. Write ’1’ operation. write bit lines (WBL and WBLB of Fig. 3) to the level of the data to be written and asserting the write word line (WWL). To en- sure the success of this operation in a standard 8T cell, the pull down path on the side to be written to ’0’ must overcome the pull up pMOS that was previously holding the ’1’ state. As pre- viously mentioned, at strong inversion voltages, this is solved by transconductance ratios, and is easily solved by sizing the pull up pMOS devices equivalent to the nMOS access transis- tors, as hole mobility is lower than electron mobility. However, as the gate voltages approach the device’s threshold, the large current fluctuation due to process variations often disrupts this ratio. Therefore, even a downsized pMOS can overcome the ac- cess transistor that is weakened due to higher , longer channel length, degraded gate widths, etc., resulting in a failed write. In the proposed cell, the feedback loop from tothegateof M9 assists in the write operation by weakening the pull up path. Fig. 10. Write margin ratio of a Write ’1’ operationascomparedtotherefer- Again, as the cell is asymmetric, the operation is quite different ence 8T cell as simulated at various process corners and at all operating voltages. for the case of writing a ’1’ to a cell holding a ’0’ and vice versa. Therefore, these two cases will be described separately. 1) The Write ’1’ Operation: Let us assume that the cell is to the reference 8T cell. The great advantage of the proposed in the hold ’0’ state, i.e., is discharged to GND and is cell over the reference cell at the SF corner is accentuated by charged to . In order to write a ’1’, WBL is driven to this graph, showing a vast rise in the write margin ratio as the and WBLB is discharged to GND. WWL is asserted and the write supply voltage is lowered. Below 500 mV, the 8T write margin operation commences, as illustrated in Fig. 9. The read buffer becomes negative, while the proposed cell maintains a positive transistors are omitted from the figure, as they are irrelevant margin down to below 200 mV under global variations. At the to this operation. Initially, M9 is strongly conducting, enabling typical corner, the proposed cell provides an advantage over the full contention (along with M6) to the pull down path through 8T cell at voltages under 700 mV. At higher voltages and in the M5. Providing this situation would persist, would be pulled FS corner, the proposed cell’s write margins during a Write ’1’ down towards a steady state voltage between and GND. operation are approximately 10% lower than the reference cell; Under standard conditions, this voltage would be low enough however they are still very sufficient, as their absolute value is to turn on M3 and cut off M1, initiating the positive feedback of high at these voltages. the cross-coupled inverters and resulting in a successful write. 2) The Write ’0’ Operation: The Write ’0’ operation includes However, under certain conditions, such as the SF corner, the a similar feedback process that provides improved write mar- steady state voltage is high enough not to initiate this feedback gins. We will now assume that is charged to and the write would ultimately fail. In this case, the feedback and that is discharged to GND. To write a ’0’, WBL is dis- of M9 comes into play, as is clearly depicted in Fig. 9. As charged to GND and WBLB is driven to .Subsequently, is charged, M9 is weakened ( )lowering WWL is asserted, providing us with the initial state illustrated the contention to M5. This enables an “easy” write, further en- in Fig. 11 (again, omitting the M7-M8, that are irrelevant to the hanced by the weakening of M1 as is discharged. process). In this case, the advantage of the write operation is The Write ’1’ margins achieved for the SVT implementa- straightforward, as not only is initially residing at a lower tion of the proposed bitcell are shown in Fig. 10, as compared voltage, but M9 is initially cut-off, providing a pull-up current 2720 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

Fig. 11. Write ’0’ operation.

Fig. 12. Write margin ratio of a Write ’0’ operationascomparedtotherefer- ence 8T cell according to simulation at various process corners and at all oper- ating voltages. that is no match for the strong discharge current of M2. The pos- itive feedback of the cross-coupled inverters is quickly initiated, Fig. 13. Dynamic trajectories of write operations to the SF-SRAM cell at the SF corner with a 300 mV power supply. (a) Write ’1’ operation. (b) Write ’0’ and the storage nodes are pulled up to their respective rails. operation. Both figures include the trajectories of: (1) a successful write opera- The ratios of the Write ’0’ write margins as compared to tion with a very long write pulse; (2) a successful write operation with a write the 8T cell are shown in Fig. 12 at various process corners pulse close to the minimum width ( ); (3) a failed write pulse just below the minimum width; and (4) the trajectory of a (failed) write attempt into an 8T throughout the full range of supply voltages. The behavior is reference cell under these conditions (with a very long write pulse). similar to that described for the Write ’1’ operation, with the proposed cell showing an even higher advantage for this opera- tion. For example, at the typical corner with a supply voltage of The figures show four cases: a typical (successful) write; a suc- 700 mV, the 9T cell’s write margin for the Write ’0’ operation is cessful write with slightly above ; a failed write with 30% higher than the 8T cell, whereas for the Write ’1’ operation slightly below ; and the trajectory of a (non-time lim- it is 5% higher. ited) failed write into an 8T bitcell under these conditions. It is important to look at the dynamic stability of the cell’s write operation in addition to its static write margins. This C. Write Access Time stability takes into account the trajectory of the internal cell As previously explained, the major feature of the SF-SRAM voltages during write access according to different write access cell, as compared to the standard 8T cell, is its immunity to times. A successful write occurs when ,where process variations at low voltages, specifically around the SF is the pulse width of the WWL signal and is the process corner. When writing to a standard 8T cell at the SF critical write access time that brings the cell state past its corner, the pMOS based pull-up path wins the ratioed fight separatrix [37], [38]. against the nMOS based pull down path, such that the write Fig. 13 shows the trajectories of Write ’1’ (Fig. 13(a)) and fails. In the SF-SRAM cell, the internal feedback loop weakens Write ’0’ (Fig. 13(b)) operations at 300 mV under the SF corner. the pull-up path either at the beginning or at the end of the TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2721

Fig. 15. Test chip micrograph and evaluation board.

Fig. 14. Worst case write access comparison of the proposed SF-SRAM cell with the reference 8T cell throughout the range of supply voltages. Under 500 is lower than the final state voltage presented in Section III. mV, the 8T cell is non-functional at the SF corner, so this corner is excluded from the comparison at these voltages (i.e., the plot shows the simulated ratio During this period the noise margins will be slightly lower than to the worst case, not including the SF corner). those simulated for the cell’s final DC state.

V. I MPLEMENTATION AND MEASUREMENTS discharge process, enabling low-voltage writes, in spite of the strong pMOS. Ultimately, this also strongly affects the access A. Test Chip Implementation and Architecture time, providing a substantial advantage to the SF-SRAM at the The 9T SF-SRAM bitcell was implemented in a low-power SF corner, which presents the slowest write time for the 8T cell. 40 nm TSMC technology, using only standard process steps and Fig. 14 plots the access time ratio between the SF-SRAM multiple implants. An 8 kbit array consisting of 256 rows cell and the reference 8T cell across the supply voltage range. and 32 columns was integrated into a test chip and fabricated The figure shows the ratio of the worst case access time due on a TSMC shuttle. The micrograph of the test chip is shown in to global variations. For the nominal voltages (above 800 mV), Fig. 15. Layout of the 9T cell is shown in Fig. 16. This layout the proposed cell has slightly longer access times for the Write was implemented according to the process design rules, using ’1’ operation, as the supply feedback slows down the positive minimal width devices with slightly larger than minimal lengths feedback of the cross-coupled inverters that is initiated once the to dampen the variability. Poly strips were kept equivalently trip point is crossed. However, as the voltage is lowered and the sized and oriented for advanced node fabrication considerations, drive strength ratio between the nMOS access transistors (M2 such as double patterning. The implemented layout leaves the and M5 in Fig. 1(b)) and the pMOS pull up devices is degraded, option of adding threshold implant masks to the entire n-well the 8T cell’s access time becomes slower. The SF-SRAM cell area, as well as adding an HVT implant to M1, M2 and M5, fur- ther enhancing the cell’s hold ’1’ margin and reducing leakage, is less affected in these cases, and therefore, the worst case ac- at the expense of increased write time and slightly reduced write cess time is faster for the proposed cell. The access time of the margins. For area comparison, Fig. 16 also includes the layout of 8T cell degrades rapidly as the voltage is lowered, resulting in a standard two-port 8T cell with full adherence to the same de- write failure below 500 mV. At this point the SF-SRAM’s worst sign rules for a fair comparison, rather than comparing the area case write access is approximately 10 faster than the reference to previously reported “pushed rules” implementations. The 9T 8T, and successful writes are achieved under global variations layout shows an increased area of approximately 20% over the at voltages lower than 200 mV. The access time advantage is 8T cell. not unique to the worst case comparison. At the TT corner, for Post layout simulations confirmed functionality of the array. example, the proposed cell’s access time is shorter than the 8T Several tests were carried out to ensure that parasitic effects cell at all voltages below 500 mV. wouldn’t impair its operability, such as coupling into the cell It should be noted that the access times measured above take during RWL assertion, especially in the hold ’1’ state. This type into account the delay from the rise of WWL until node of operation has a negligible effect ( 1%) on the internal data discharges past the cell’s trip point. This definition is almost node voltage; partially due to the horizontal routing of the power non-conditional when discussing an 8T cell, as at this point the rails creating a shielding between the WWL and RWL wires. Bit- cell’s positive feedback has been initiated and the storage nodes line cross-capacitance also has a negligible effect on the internal ( and ) will quickly be pulled to their respective rails. The data levels in the final array. same is true for the Write ’0’ operation on the SF-SRAM cell, In order to maintain focus on functionality of the cell and but not for the Write ’1’ operation. When a ’1’ is written to the eliminate risks, the cell periphery was synthesized with stan- SF-SRAM cell, node is quickly discharged, ensuring that dard cells and read sensing was implemented using a skewed the final state will be written once WWL is lowered. However, inverter. This sensing scheme results in an obvious loss in per- the negative feedback on causes it to charge at a much slower formance at nominal voltages, but enables low voltage testing pace. Therefore, if WWL is lowered before charges to its final without the risk of dysfunctional sense amplifiers at these sup- value, there will be a finite period during which the voltage at plies. To decouple the power consumption of the non-optimal 2722 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

Fig. 16. Layout of 9T SFSRAM bitcell and area comparison to standard 2-port 8T bitcell. peripheries from that of the array core, the array and internal pe- reading a ’0’,asincycle(4),theData Out signal falls once ripherals (precharge circuits, write drivers, sense inverters) were RBL crosses the switching threshold of the skewed sensing in- powered by a separate supply ( ). Level shifters were in- verter; slightly above . It should be noted that the signals serted at the interface between the standard digital circuitry and CLK, WR, Data In and Data Out are biased at a nominal supply the array to enable standard operation of the digital circuits at voltage (1.1 V), whereas WBL, WBLB, RBL, WWL, RWL, and nominal voltages, while testing the novel array at low voltages. are connected to the lower supply. A comprehensive built-in-self-test (BIST) block was integrated Each column was divided into four 64 row partitions in a di- with the array to test functionality. vided bitline (DBL) structure [39], to reduce RBL capacitance In order to test the performance of the array core without and off-row leakage. Each partition was separately precharged taking into account the non-optimal, high power peripherals, the and pre-sensed, with the inverted bitline section signal propa- array was operated on a falling-edge initiated half cycle. There- gated to a NOR gate at the bottom of each row. Write bitlines fore all performance was tested assuming the decoding, write were common for all 256 rows. The complete array architec- driving and pre-charged signal were ready prior to the operation. ture, with the addition of the chip select (CS) signal, is shown The timing period started with the falling edge of the clock and in Fig. 18. The reference 8T array was compiled in the same included down-shifting, word line charging, bitcell read/write, fashion. read sensing and up-shifting, finishing with the rising edge of the next clock. An 8T reference array was fabricated with the B. Test Chip Measurements same operating principle for accurate performance and power The test chips were packaged and connected to a Xilinx eval- comparisons. Typical waveforms, showing subsequent opera- uation board (see Fig. 15) to enable test control with a standard tion of writing and then reading a ’1’ andthena’0’, are shown FPGA. The arrays were tested with both complete array tests in Fig. 17. using an on-chip BIST, as well as specific tests programmed The clock signal (CLK) is positive-edge synchronized with into the FPGA and propagated through the chip’s I/O pads. All the Write-Read Select (WR), Data-In,andData-Out signals. packaged test chips functioned successfully at the full range of The digital periphery drives the write bitlines (WBL, WBLB) supply voltages from 400 mV to 1.1 V. As mentioned above, the or precharges the read bitline (RBL) according to the WR and test setup didn’t enable operation at voltages below 400 mV. An Data-In signals. The write and read wordlines (WWL and RWL, example set of waveforms, measured at one of the test-chip’s in- respectively) are raised at the negative-edge of the clock, fol- terface is shown in Fig. 19. The waveforms show writing a ’1’ lowing which, the internal nodes are written, or RBL is condi- and a ’0’ to two separate addresses and subsequently reading tionally discharged. When writing a ’1’,asincycle(1), is them out. Synchronization of the array to the negative clock charged to a level slightly lower than ; however is fully edge is pointed out in the figure. discharged. When writing a ’0’,asincycle(3), both internal Power measurements were taken using the Agilent B1500a nodes are clamped to their respective rails. The read out value Semiconductor Device Analyzer. Fig. 20 shows the static of the cell (Data Out) is stable prior to the next positive clock (Hold) power of the array when loaded with all zeros and all edge and is held stable until the next negative clock edge. When ones. At the minimum measured voltage, 400 mV, the array TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2723

Fig. 17. Timing diagram of a subsequent writes and reads (a single bit is illustrated). (1) Write a ’1’; (2) Read the stored ’1’; (3) Writea’0’;(4) Read the stored ’0’. consumed 40 nW and 60 nW for the array loaded with ones and leakage suppression for one of its stable states, sufficiently with zeros, respectively. As shown in the figure, the measured reducing the array’s static power dissipation. These advantages static power of the 8T reference array (entirely loaded with come without the need for additional peripheral circuits and zeros) was very similar to that of the SF-SRAM when loaded techniques, such as wordline boosting and supply gating that with zeros. present various drawbacks. Dynamic power during single Read and Write cycles are The improved characteristics of the SF-SRAM technique shown in Fig. 21. The write power is shown for writing a 32 comeattheexpenseofanincreaseinareaandareductionof bit word equally composed of zeros and ones, whereas the read traditional static noise margins metrics. Three implementation power is shown only for reading a 32 bit vector of zeros, as alternatives were discussed, trading off leakage suppression the dynamic power consumption for reading a ’1’ is negligible. with static noise margins in the hold ’1’ state. For ultra-low The read power was measured for a row in the top quarter of power applications, operating at sub-threshold supply voltages the array, presenting a worst case power figure in the DBL and at very low frequencies, the LVT implementation should structure. At 400 mV, a write operation consumed 360 fW be considered. Using a low-threshold transistor as the supply and a worst case read operation consumed 210 fW. At this feedback device reduces thevoltagedropatthedata( )node voltage the array was operated at 1 MHz, however this could during hold ’1’ cycles, resulting in a minimal loss of static be improved significantly with the integration of an enhanced noise margins. The leakage reduction is less significant using sensing scheme. An overall comparison of figures of merit with this implementation due to the low resistance of such a device; the reference 8T array is given in Table I. however the power reduction at such low operating voltages is already extremely advantageous. C. Discussion For applications with higher supply voltages for higher per- Several design factors were presented and compared for formance, but with an emphasis on static power minimization, the proposed 9T SF-SRAM, as an alternative to standard the SVT or HVT implementations should be considered. The SRAM implementations. The first and foremost advantage of loss of static noise margin is more tolerated at higher voltages, the SF-SRAM cell is its robust functionality at low operating and the power saving is substantial. Integrating an SF-SRAM voltages, much lower than those achievable with standard 6T array with a data processing algorithm to maximize cells in the or 8T cells. In addition, the SF-SRAM bitcell provides in-cell low-leakage state further enhances this advantage. 2724 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

Fig. 18. Architecture of the 256 32 bitcell array with divided wordline and level shifting.

Fig. 20. Measured static power of the fabricated SF-SRAM array when all zeros and all ones are loaded. The reference 8T array was measured with all zeros loaded, for comparison.

Fig. 19. Measured waveforms at test-chip interface. The figure shows a single and compared to those of a standard 8T bitcell. Various imple- bit written to two different addresses and subsequently read out. mentations were proposed and tradeoffs were explained. The proposed bitcell provides robust functionality under global and VI. CONCLUSION local process variations throughout the full range of supply volt- This paper presented a novel 9T Supply Feedback SRAM bit- ages, as low as 250 mV. This is achieved without the need for cell. The operational concepts and mechanisms inside the bit- additional peripheral circuitry. In addition, in one of its stable cell were discussed. Static and dynamic metrics were presented states, the cell provides internal leakage suppression, resulting TEMAN et al.: A 250 MV 8 KB 40 NM ULTRA-LOW POWER 9T SUPPLY FEEDBACK SRAM (SF-SRAM) 2725

[2]B.Zhai,S.Pant,L.Nazhandali,S.Hanson,J.Olson,A.Reeves,M. Minuth, R. Helfand, T. Austin, D. Sylvester, and D. Blaauw, “Energy- efficient subthreshold processor design,” IEEE Trans. Very Large Scale Integration (VLSI) Syst., vol. 17, pp. 1127–1137, 2009. [3] I. J. Chang, J. J. Kim, S. P. Park, and K. Roy, “A 32 kb 10T sub-threshold SRAM array with bit-interleaving and differential read scheme in 90 nm CMOS,” IEEE J. Solid-State Circuits, vol. 44, pp. 650–658, 2009. [4] E. A. Vittoz, “Weak inversion for ultra low-power and very low-voltage circuits,” in IEEE Asian Solid-State Circuits Conf. (A-SSCC 2009), 2009, pp. 129–132. [5]L.Chang,R.K.Montoye,Y.Nakamura,K.A.Batson,R.J.Eicke- meyer, R. H. Dennard, W. Haensch, and D. Jamsek, “An 8T-SRAM for variability tolerance and low-voltage operation in high-performance caches,” IEEE J. Solid-State Circuits, vol. 43, pp. 956–963, 2008. [6] S. Fisher, A. Teman, D. Vaysman, A. Gertsman, O. Yadid-Pecht, and A. Fish, “Digital subthreshold logic design- motivation and challenges,” in Proc.IEEE25thConventionofElectrical and Electronics Engineers Fig. 21. Measured dynamic power consumption of single read and write oper- in Israel (IEEEI 2008), 2008, pp. 702–706. ations. The write operation included writing a 32-bit word with an equal number [7] N. Verma and A. P. Chandrakasan, “A 256 kb 65 nm 8T subthreshold of ’1’ and ’0’ bits. The read operation included reading a 32-bit vector of ’0’ SRAM employing sense-amplifier redundancy,” IEEE J. Solid-State bits. Circuits, vol. 43, pp. 141–149, 2008. [8] B. H. Calhoun and A. P. Chandrakasan, “A 256-kb 65-nm TABLE I sub-threshold SRAM design for ultra-low-voltage operation,” IEEE J. 9T SF-SRAM FIGURES OF MERIT Solid-State Circuits, vol. 42, pp. 680–688, 2007. [9] T. Kim, J. Liu, and C. H. Kim, “An 8T subthreshold SRAM cell utilizing reverse short channel effect for write margin and read per- formance improvement,” in IEEE Custom Integrated Circuits Conf. (CICC’07), 2007, pp. 241–244. [10] A. Wang, B. H. Calhoun, and A. P. Chandrakasan, Sub-Threshold Design for Ultra Low-Power Systems. New York: Springer-Verlag, 2006. [11] B. H. Calhoun and A. Chandrakasan, “Analyzing static noise margin for sub-threshold SRAM in 65 nm CMOS,” in Proc. 31st European Solid-State Circuits Conf. (ESSCIRC 2005), 2005, pp. 363–366. [12] A. Raychowdhury, S. Mukhopadhyay, and K. Roy, “A feasibility study of subthreshold SRAM across technology generations,” in Proc. 2005 IEEE Int. Conf. Computer Design: VLSI in Computers and Processors (ICCD 2005), 2005, pp. 417–422. [13] A. Wang and A. Chandrakasan, “A 180-mV subthreshold FFT pro- cessor using a minimum energy design methodology,” IEEE J. Solid- State Circuits, vol. 40, pp. 310–319, 2005. [14]A.Vladimirescu,Y.Cao,O.Thomas,H.Qin,D.Markovic,A.Valen- tian, R. Ionita, J. Rabaey, and A. Amara, “Ultra-low-voltage robust design issues in deep-submicron CMOS,” in Proc. 2nd Annu. IEEE Northeast Workshop on Circuits and Systems (NEWCAS 2004), 2004, pp. 49–52. [15] A. Wang and A. Chandrakasan, “A 180 mV FFT processor using sub- threshold circuit techniques,” in 2004 IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.Tech. Papers, 2004, vol. 1, pp. 292–529. [16] D. Bol, R. Ambroise, D. Flandre, and J. Legat, “Interests and limita- tions of technology scaling for subthreshold logic,” IEEE Trans. Very in a 15%–60% reduction of static power as compared to an 8T Large Scale Integration (VLSI) Syst., vol. 17, pp. 1508–1519, 2009. cell at the same supply voltage (depending on the implementa- [17] D. Bol, R. Ambroise, D. Flandre, and J. Legat, “Impact of technology scaling on digital subthreshold circuits,” in Proc. IEEE Computer So- tion). ciety Annu. Symp. VLSI (ISVLSI ’08), 2008, pp. 179–184. An 8 kbit array of SF-SRAM bitcells was implemented, fab- [18] D. Bol, C. Hocquet, D. Flandre, and J. Legat, “Robustness-aware ricated and tested in a Low Power 40 nm CMOS process. Mea- sleep transistor engineering for power-gated nanometer subthreshold circuits,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS 2010), surement results show full functionality at all voltages between 2010, pp. 1484–1487. 1.1 V and 400 mV (the limit of the test chip). [19] B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and sizing for minimum energy operation in subthreshold circuits,” IEEE J. Solid- State Circuits, vol. 40, pp. 1778–1786, 2005. ACKNOWLEDGMENT [20] D. Markovic, V. Stojanovic, B. Nikolic, M. A. Horowitz, and R. W. Brodersen, “Methods for true energy-performance optimization,” The authors would like to thank Mr. N. Sever, the Zoran Cor- IEEE J. Solid-State Circuits, vol. 39, pp. 1282–1293, 2004. poration, and the Alpha Consortium for their help and support [21]A.Wang,B.H.Calhoun,andA.P.Chandrakasan, Sub-Threshold De- in the completion of this work. sign for Ultra Low-Power Systems. Secaucus, NJ: Springer-Verlag New York, Inc., 2006, Series on Integrated Circuits and Systems. [22] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “SRAM leakage suppression by minimizing standby supply voltage,” in Proc. REFERENCES 5th Int. Symp. Quality Electronic Design, 2004, pp. 55–60. [1]D.Markovic,C.C.Wang,L.P.Alarcon,T.-T.Liu,andJ.M.Rabaey, [23] J. M. Rabaey, A. Chandrakasan, and B. Nikolić, Digital Integrated “Ultralow-power design in near-threshold region,” Proc. IEEE, vol. 98, Circuits: A Design Perspective, 2nd ed. Englewood Cliffs, NJ: Pren- pp. 237–252, 2010. tice-Hall, 2003, p. 761. 2726 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 11, NOVEMBER 2011

[24] J. Wang, S. Nalam, and B. H. Calhoun, “Analyzing static and dy- low-power CMOS image sensors and low-power design techniques for digital namic write margin for nanometer SRAMs,” in Proc. ACM/IEEE Int. and analog VLSI chips. He has authored 11 scientific papers and two patent Symp. Low Power Electronics and Design (ISLPED 2008), 2008, pp. applications, and has presented excerpts from his research at a number of 129–134. international conferences. [25] M. Yamaoka, N. Maeda, Y. Shinozaki, Y. Shimazaki, K. Nii, S. In 2010, Mr. Teman was honored with the Electrical Engineering Depart- Shimada, K. Yanagisawa, and T. Kawahara, “Low-power embedded ment’s “Teaching Excellence” recognition at Ben-Gurion University. He is a SRAM modules with expanded margins for writing,” in 2005 IEEE recipient of the Kreitman Foundation Fellowship for Doctoral Studies and re- Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2005, vol. ceived the Yizhak Ben-Ya’akov HaCohen Prize in 2010. 1, pp. 480–611. [26] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli, Y. Wang, B. Zheng, and M. Bohr, “SRAM design on 65-nm Lidor Pergament received the B.Sc. degree in elec- CMOS technology with dynamic sleep transistor for leakage reduc- trical engineering from Ben-Gurion University, Be’er tion,” IEEE J. Solid-State Circuits, vol. 40, pp. 895–901, 2005. Sheva, Israel, in 2011. As part of his senior project, [27] International Roadmap for Semiconductors, ITRS, 2009 [Online]. he worked on low-voltage/low-power SRAM design Available: http://www.itrs.net/ and culminated his work with the fabrication of the [28] I. J. Chang, J. J. Kim, S. P. Park, and K. Roy, “A 32 kb 10T subthreshold first 40 nm test chip by an academic group in Israel, SRAM array with bit-interleaving and differential read scheme in 90 two scientific papers, and two patent applications. nm CMOS,” in 2008 IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Currently he is a Design and Verification Engineer Tech. Papers, 2008, pp. 388–622. with Mellanox Technologies, Tel Aviv, Israel. [29] J. P. Kulkarni, K. Kim, and K. Roy, “A 160 mV, fully differential, ro- Mr. Pergament’s senior project awarded him with bust schmitt trigger based sub-threshold SRAM,” in Proc. ACM/IEEE an award of merit for outstanding projects in the BGU Int. Symp. Low Power Electronics and Design (ISLPED), 2007, pp. Department of Electrical Engineering for the 2009–2010 academic year. 171–176. [30] M. F. Chang, J. J. Wu, K. T. Chen, Y. C. Chen, Y. H. Chen, R. Lee, H. J. Liao, and H. Yamauchi, “A differential data-aware power-sup- Omer Cohen received the B.Sc. degree in electrical plied (D AP) 8T SRAM cell with expanded write/read stabilities for engineering from Ben-Gurion University, Be’er lower VDDmin applications,” IEEE J. Solid-State Circuits, vol. 45, pp. Sheva, Israel, in 2011. As part of his senior project, 1234–1245, 2010. he worked on low-voltage/low-power SRAM design [31] S. Nakata, T. Kusumoto, M. Miyama, and Y. Matsuda, “Adiabatic and culminated his work with the fabrication of the SRAM with a large margin of VT variation by controlling the cell- first 40 nm test chip by an academic group in Israel, power-line and word-line voltage,” in IEEE Int. Symp. Circuits and two scientific papers, and two patent applications. Systems (ISCAS 2009), 2009, pp. 393–396. Currently he is working as a Design and Verifi- [32] A. Carlson, Z. Guo, S. Balasubramanian, R. Zlatanovici, T. J. K. Liu, cation Engineer at Marvell Semiconductors, Petah and B. Nikolic, “SRAM read/write margin enhancements using Fin- Tikva, Israel. FETs,” IEEE Trans. Very Large Scale Integration (VLSI) Syst.,vol.18, Mr. Cohen’s senior project awarded him with an pp. 887–900, 2010. award of merit for outstanding projects in the BGU Department of Electrical [33] B. D. Yang, “A low-power SRAM using bit-line charge-recycling for Engineering for the 2009–2010 academic year. read and write operations,” IEEE J. Solid-State Circuits, vol. 45, pp. 2173–2183, 2010. [34] K. Kim, H. Mahmoodi, and K. Roy, “A low-power SRAM using Alexander Fish (S’04–M’06) received the B.Sc. de- bit-line charge-recycling,” IEEE J. Solid-State Circuits, vol. 43, pp. gree in electrical engineering from the Technion, Is- 446–459, 2008. rael Institute of Technology, Haifa, Israel, in 1999. [35] T. H. Kim, J. Liu, and C. H. Kim, “A voltage scalable 0.26 V, 64 kb He completed the M.Sc. degree in 2002 and the Ph.D. 8T SRAM with V lowering techniques and deep sleep mode,” IEEE J. degree (summa cum laude) in 2006, respectively, at Solid-State Circuits, vol. 44, pp. 1785–1795, 2009. Ben-Gurion University in Israel. [36] C. C. Enz, F. Krummenacher, and E. A. Vittoz, “An analytical MOS He was a postdoctoral fellow in the ATIPS Labora- transistor model valid in all regions of operation and dedicated to low- tory at the University of Calgary, Canada, from 2006 voltage and low-current applications,” Analog Integr. Circuits Signal to 2008. In 2008, he joined Ben-Gurion University, Process., vol. 8, pp. 83–114, Jul. 1995. Israel, as a faculty member in the Electrical and Com- [37] M. Sharifkhani and M. Sachdev, “SRAM cell stability: A dynamic per- puter Engineering Department. There he founded the spective,” IEEE J. Solid-State Circuits, vol. 44, pp. 609–619, 2009. LPCAS Laboratory, specializing in low-power circuits and systems. His re- [38] J. Wang, S. Nalam, and B. H. Calhoun, “Analyzing static and dynamic search interests include low-voltage digital design, energy-efficient SRAM and write margin for nanometer SRAMs,” in Proc. 13th Int. Symp. Low Flash memory arrays, low-power CMOS image sensors and low-power design Power Electronics and Design (ISLPED 2008), 2008, pp. 129–134. techniques for digital and analog VLSI chips. He has authored over 60 scientific [39] A. Karandikar and K. K. Parhi, “Low power SRAM design using hier- papers and patent applications. He has also published two book chapters. archical divided bit-line approach,” in Proc. Int. Conf. Computer De- Dr. Fish was a co-author of two papers that won the Best Paper Finalist awards sign: VLSI in Computers and Processors (ICCD ’98), 1998, pp. 82–88. at ICECS’04 and ISCAS’05 conferences. He also received the Young Inno- vator Award for Outstanding Achievements in the field of Information Theo- Adam Teman (S’10) received the B.Sc. degree in ries and Applications by ITHEA in 2005. In 2006, he was honored with the electrical engineering from Ben-Gurion University, Engineering Faculty Dean “Teaching Excellence” recognition at Ben-Gurion Be’erSheva,Israel,in2006.Heworked as a Design University. He serves as Editor in Chief for the MDPI Journal of Low Power Engineer at Marvell Semiconductors from 2006 to Electronics and Applications (JLPEA) and as an Associate Editor for the IEEE 2007, with an emphasis on physical implementation. SENSORS JOURNAL. He was a co-organizer of special sessions on “smart” CMOS He completed the M.Sc. degree at Ben-Gurion Uni- Image Sensors at IEEE Sensors Conference 2007, on low-power “Smart” Image versity in 2011. He is currently pursuing the Ph.D. Sensors and Beyond at the IEEE ISCAS 2008 and on Design Methodologies for degree under Dr. Alexander Fish as part of the Low Advanced Ultra Low Power Sensor and Memory Arrays at the IEEE Sensors Power Circuits and Systems (LPC&S) Lab in Ben- Conference 2009. Gurion University’s VLSI Systems Center. Mr. Teman’s research interests include low- voltage digital design, energy-efficient SRAM and Flash memory arrays,

3.3 A Minimum Leakage 400 mV Quasi-Static RAM (QSRAM) Bitcell

The following paper was recently published in the Microelectronics Journal (MEJ), published by Elsevier [49]. A brief introduction to this design was previously published in the inaugural version of the up and coming, open-access, MDPI Journal of Low-Power Electronics and Applications (JLPEA) [63]. The following paper provided here is the complete extended version, including an in-depth stability analysis of the bitcell and test-chip measurement results.

75 Microelectronics Journal 44 (2013) 236–247

Contents lists available at SciVerse ScienceDirect

Microelectronics Journal

journal homepage: www.elsevier.com/locate/mejo

Functionality and stability analysis of a 400 mV quasi-static RAM (QSRAM) bitcell

Adam Teman a,n, Anatoli Mordakhay b, Alexander Fish a,b a Ben-Gurion University of the Negev, Electrical and Computer Engineering, PO Box 653, 84105 Be’er Sheva, Israel b Bar-Ilan University, Faculty of Engineering, Ramat Gan, Israel 52900 article info abstract

Article history: The development of low-voltage SRAM bitcells with ultra-low static power consumption has become a Received 20 September 2012 primary focus of memory design in recent years. The analysis of these bitcells requires the evaluation of Received in revised form dynamic noise margin metrics in addition to the traditional static noise margins. In this paper, we 10 December 2012 extend the presentation of our recently proposed quasi-static RAM (QSRAM) cell that employs an Accepted 11 December 2012 aggressive internal feedback technique for leakage suppression. In addition to the presentation of the Available online 19 January 2013 QSRAM circuit topology and operation, a broad stability analysis of the cell is introduced, proving the Keywords: functionality and bi-stability of the bitcell. Many of the recently proposed dynamic stability metrics CMOS memory integrated circuits used in this analysis have been demonstrated on standard SRAM bitcells; however, this is one of the SRAM first times these metrics have been used to analyze the functionality of an alternative implementation. Leakage suppression Functionality of the proposed bitcell is shown for a sub-threshold 400 mV supply voltage, providing a Ultra low power Sub-threshold SRAM typical leakage reduction of 21X–45X as compared to a standard two-port bitcell operating at its nominal voltage. An 8 kb QSRAM array was implemented and fabricated in a commercial low-power 40 nm process demonstrating full functionality and ultra-low power consumption under a sub-threshold 400 mV supply. & 2012 Elsevier Ltd. All rights reserved.

1. Introduction circuit techniques to enable the operation of on-chip SRAM arrays at ultra-low operating voltages, deep into the sub-threshold region The continuous rise in leakage power of on-chip SRAM arrays is [3–15,36]. the main factor behind recent efforts to reduce the operating voltage The traditional method for measuring the stability of an SRAM of these memories. Lowering the supply voltage results in an cell is with the well-known static noise margin (SNM) metric [16]. aggressive reduction of both sub-threshold and gate leakage cur- This metric ensures data retention in the presence of a pair of rents. However, lower supply voltages also cause the noise margins serial voltage noise sources at the bitcell’s internal data nodes. to decrease, leading to an inevitable degradation of robustness. These sources are applied with opposite polarities to represent a Designing robust, low voltage circuits is further complicated by the worst case situation, as shown in Fig. 1b. Write and read stability large variations in circuit behavior that are caused by high variability are tested in a similar fashion, measuring the ability of the bitcell in the fabrication process. Ratioed circuits, such as the standard to change or retain its state, respectively, following an infinitely six-transistor (6T) SRAM cell (Fig. 1a), are even more susceptible to long access pulse. In older process technologies, obtaining suffi- these fluctuations, as device drive strengths can vary by as much as cient margins was achieved solely based on device sizing, as Ion/ three orders of magnitude at sub-threshold voltages in modern Ioff ratios were large and process variations were limited. nano-scaled processes [1]. Both theoretical and empirical analyses However, it is increasingly more difficult to meet stability have shown that 6T SRAM cells fabricated in sub-90 nm technolo- requirements at scaled down processes, due to the increased gies are limited to super-threshold minimum operating voltages device fluctuations that are inherent to these technologies. In

(VDDmin) [2,3]. These limitations occur during read and write opera- addition, the static write margin (WM) criterion has been found to tions, when the drive strength ratios determine circuit functionality. be overly optimistic, due to the finite duration of write access Accordingly, many research groups have shifted their focus to the operations, as defined by the system’s operating frequency. development of alternative bitcell topologies and peripheral Device variations are further emphasized at low operating vol- tages, since the drive strengths of the devices are more severely

affected by threshold voltage (VT) fluctuations [3]. Accordingly, n Corresponding author. Tel.: þ972 8 647 7155. recent years have seen increased research into the analysis of the E-mail address: [email protected] (A. Teman). dynamic operation and stability of SRAM cells [17–23,36].

0026-2692/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.mejo.2012.12.005 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 237 BL BLB WL WL sub

M3 M6 I I Vn Vn gate QB=0 M2 M5 + + I Q=VDD sub - -

sub M4 I M1 Igate

Fig. 1. (a) Schematic of standard 6T bitcell, including biases and leakage currents in the hold ‘1’ state. Isub and Igate refer to sub-threshold and gate leakage, respectively. (b) Setup for traditional SNM measurement.

Contrary to a circuit’s static stability, which was relatively LB BL B simple to define and measure, dynamic stability requires a more L

comprehensive analysis, based on several metrics and visualiza- M9 RB VVDD tions. Recent publications discuss several of these techniques, L

such as: state-space analysis and visualization [17,21,23,37]; W W separatrix extraction [22–24]; loop gain analysis [21]; various WWL definitions of dynamic noise margins (DNM) [20,22,25]; and N- M3 M6 L curve analysis [26], which is actually a re-evaluation of SNM M2 M5 RW criteria. These metrics provide the designer with a toolbox to Q QB accurately evaluate and optimize the operation of a bitcell, M1 M4 M8 according to the design specifications and frequency require- ments. This helps prevent the costly over-design caused by non- M7 realistic hold and read SNM metrics and the potential failures due to the optimistic write SNM metrics [20]. Accordingly, it is clear that the development of novel, alternative bitcell topologies for ultra-low power consumption requires a thorough stability Fig. 2. Schematic of the 9T QSRAM bitcell. analysis. We recently proposed the 9T quasi-static RAM (QSRAM) bitcell circuits proposed in the past three decades. Most importantly, this [11] as a minimum leakage topology for very aggressive power structure comprises the core of the standard 6T SRAM bitcell, reduction. This topology utilizes internal feedback from the data shown in Fig. 1a. Traditionally, the static power dissipation of this stored in the bitcell to cut off the power supply and reduce the structure was considered to be very low, or even negligible, as in leakage current through the cell. Similar to all of the other low- both stable states, the path between the supply rails is separated voltage alternative topologies, this cell trades off area and noise by a cut-off device. However, with the ever-increasing leakage margins to achieve ultra-low power consumption. However, as currents associated with process (and threshold voltage) scaling, opposed to most of the other options, this topology modifies the the static power consumption of this structure has become sub- internal cross-coupled latch structure of the standard bitcell, stantial. The most common method for reducing these leakage which leads to speculation about its stability. In this paper, for currents is to lower the supply voltage; but in most cases, the best the first time, we use advanced dynamic and static stability case power reduction of alternative, low-voltage bitcells, is equiva- analysis to show that the QSRAM bitcell is both functional and lent to that of the 6T bitcell operated at a similar supply. robust, despite its non-conservative approach. This analysis was The novel QSRAM bitcell challenges the convention described performed in a TSMC 40 nm LP process with a sub-threshold above by changing the internal structure of the bitcell. In this way, supply voltage of 400 mV. An 8 kbit QSRAM array was fabricated the leakage paths are significantly disrupted and the static power as part of a 40 nm test chip and showed full functionality at consumption of the cell is reduced. Interestingly, the topology 400 mV, operating at 1 MHz with an ultra-low leakage figure proposed below actually displays higher relative noise margins as between 21 nW and 45 nW, depending on the stored data. the supply voltage is lowered; this makes it an attractive candidate The rest of this paper is constructed as follows: an extensive for a minimal leakage, low-voltage SRAM alternative. description of the QSRAM bitcell topology, operation, and leakage power is given in Section 2; Section 3 provides an in-depth static 2.1. QSRAM topology and dynamic stability analysis of the 40 nm LP implementation; test chip architecture and post-silicon measurements are At first glance, the 9T QSRAM bitcell, shown in Fig. 2, resem- presented in Section 4; and Section 5 concludes the paper. bles the well-known dual-port 8T bitcell, often used as a base structure for low-voltage SRAM operation [3,7,27]. However, the additional transistor (M9) gates the power supply to provide 2. The 9T quasi-static RAM bitcell additional serial resistance in the cell’s leakage path, resulting in reduced static power consumption, and robust low-voltage write- The strong positive feedback, inherent to the cross-coupled ability. The gate of this device is connected to the inverted data inverter structure of a standard latch, is widely manipulated to node (QB) in a feedback loop, changing the characteristics of this provide the robust bi-stability required by an SRAM bitcell. device according to the value of the stored data. In this way, the This structure has been the basis for almost all static memory QSRAM cell looks similar to our previously proposed supply 238 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247

I 9 M9 VGS9 <0 VVDD =VQ

M5 M3 M6 V M2 Q 0

I2 M8 I1 M1 M4 M7 VGS4 =VQ

Fig. 3. (a) Currents and biases under a hold ‘1’ state. (b) Monte Carlo distribution of the steady state voltages at node QB in the hold ‘1’ state for a 400 mV supply. Note that the steady state voltages are given in mV. feedback SRAM (SF-SRAM) cell [7], but its functionality is very operation. This is shown through Monte Carlo simulation of the different, as will become apparent in the following sections. steady state voltage at node QB, as plotted in Fig. 3b. This figure In addition to the low-threshold (LVT) feedback device (M9), clearly shows that QB is fully discharged throughout the distribu- the cell core comprises a pair of high-threshold (HVT) nMOS pull- tion, providing a robust readout under process variations and device down devices (M1, M4); a pair of standard-threshold (SVT) pMOS mismatch. pull up devices (M3, M6); and a pair of SVT nMOS access transistors (M2, M5). An additional two nMOS devices (M7, M8) 2.3. The hold ‘0’ state create a single-ended read buffer, similar to that employed by the dual-port 8T bitcell. The gate of M7 is driven by the inverted data Contrary to many of the other SRAM bitcell implementations, the node (QB) and is completely independent of the non-inverted QSRAM is an asymmetric circuit, requiring a separate discussion for data node (Q); this is an essential observation to understand the each of its stable states. An initial observation may lead to the functionality of the QSRAM bitcell. Since the read-out is done assumption that the hold ‘0’ state is quite similar to the correspond- exclusively according to the level stored at QB, the level at Q has ing state of the 6T cell, as M9 is biased with a high voltage and no influence on read functionality. therefore conducting. A secondary observation could make one It should be noted here that the QSRAM cell, as presented in believe that the cell is completely non-functional; assuming that this manuscript, operates with standard peripheral circuits as passing the supply through M9 leads to a threshold drop, then this required for the operation of a dual-port 8T bitcell. These include lower voltage is subsequently passed through M6 to QB. This would a pair of word lines for write and read operations (WWL and lead to an additional threshold drop, and so on, until QB would RWL); a pair of bitlines for write operations (BL and BLB); and a eventually fully discharge. However, an even closer look shows that single-ended bitline for read operations (RBL). The simulations neither of these observations is correct. and measurements presented hereafter were achieved without Starting with a write ‘0’ operation, Q is completely discharged, 1 the employment of non-standard techniques, such as overdrive or while QB is charged to a high level . At this point, assuming VVDD is negative biasing. However, these methods, as well as other at some median level, M3 is cut-off (VGS3 ¼VQB VVDD 40) and the previously proposed techniques, could easily be incorporated source of M6 is QB, such that QB charges VVDD through M6. with this topology for improved performance. Ultimately, VVDD ¼VQB,andthereforeVGS9¼0, cutting off M9. The voltage of QB following a write ‘0’ operation is, therefore, slightly

2.2. The hold ‘1’ state lower than VDD. With the access transistors (M2 and M5) closed, the leakage ratio (I9þI5)/I4 sets the final state of this node. Sizing and VT Following a write ‘1’ operation (as will be described later), Q is implants are utilized to ensure that this level is as close as possible charged to a high level, while QB is fully discharged to ground. The to VDD to ensure a correct readout according to the QB voltage biases applied to the bitcell at this state and the resulting leakage sampled by M7 (VGS7¼VQB). The Monte Carlo distribution of the currents are illustrated in Fig. 3a. The gate bias (VGS)ofM4isinitially steady state voltage at node QB is plotted for this state in Fig. 4b. The much higher than that of M1; M4 is conducting, while M1 is firmly figure clearly shows that QB retains a high level throughout the cut-off, ensuring that QB stays discharged. The discharged gate bias distribution, providing a robust readout under process variations of M3 (VG3¼VQB ¼0) causes a charge share between Q and VVDD, and device mismatch. Note that corners that cause a drop in the resulting in a high level at VVDD. At this point, the feedback of QB stable state voltage of QB will incur a slightly slower readout, as the back to M9 creates a negative VGS (VGS9¼VQBVVDDo0), comple- overdrive voltage of M7 will be lower. However, this penalty is tely cutting off the power supply to the cell, and essentially floating dampened at low voltages, as the current through M7 and M8 must the high data node (Q). The stable state of Q is set according to the leakage currents through the left side cut-off devices (M2, M9, and 1 For super-threshold operation, we would claim that this would be a M1—the leakage through M6 is negligible); the aforementioned threshold drop under VDD; however, the main operating region of this cell is in device threshold implants ensure that a discharged voltage at QB is the sub-threshold region, where the drop off is not well defined, but is not maintained, providing the required gate voltage at M7 for a read ‘1’ negligible. A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 239

I 9 M9 VGS9=0 VVDD =VQB

M5 M3 M6 I5 V M2 0 QB

M8 M1 M4 I4 M7 VGS1=VQB VGS7=VQB

Fig. 4. (a) Currents and biases under a hold ‘0’ state. (b) Monte Carlo distribution of the steady state voltages at node QB in the hold ‘0’ state for a 400 mV supply. Note that the steady state voltages are given in mV.

be equalized, trading off the values of overdrive and VDS as a function of the transient voltage at the node connecting the two.

2.4. Write operations

The QSRAM bitcell was designed to minimize leakage; however, as a positive side-effect, it also demonstrates a very robust write operation. Considering that write margin is the voltage limiting factor for the majority of the single-ended read SRAM cells [3,4,7,27], this presents quite an advantage during low-voltage operation. Writes are usually carried out by discharging the high internal data node, requiring that the pull-down nMOS devices overcome the feedback of the pull-up pMOS devices. Ensuring this is often a great challenge, as under process variations, the drive strengths of nano-scaled devices can fluctuate by as much as three orders of magnitude [1] at sub-threshold voltages. For the 40 nm technology we used, nMOS devices at 400 mV can be as much as 52X stronger than pMOS devices and as low as 16X weaker than their pMOS counterparts. Therefore, in order to lower VDDmin,many of the low voltage SRAMs incorporate specialized peripheral write- Fig. 5. RBL discharge time for the two cell states, i.e., holding a ‘1’ and holding a assist techniques that either strengthen the pull-down devices or ‘0’. The plot covers 5 k Monte Carlo samples at VDD ¼400 mV. weaken the pull-up ones. Weakening the pull-up is an inherent property of the QSRAM design, as the feedback device (M9) adds QSRAM doesn’t incur the same positive feedback; therefore, the additional serial resistance to the supply path. In fact, M9 is cut-off behavior of the other node must be considered, as well. In this case, during both hold states, presenting virtually no contention to the QB is charged through M5 with a weak overdrive voltage bitline that is trying to pull down its adjacent node. Accordingly, the (VGS5¼VDDVQB). This is achieved without the contention of a QSRAM retains write-ability at sub-threshold voltages across pro- pull-down node, as once Q is discharged, M4 is cut-off. At this cess corners and under local mismatch without the need for point, M6 is conducting (VSG6¼VQB VQEVQB), so VVDD is charged up specialized write-assist circuits. to VQB, as defined by the uncontended charging operation through A closer look into the QSRAM write operations will show a M5. At the end of the write operation, the final state of VQB is slightly slight difference between the write ‘0’ and write ‘1’ actions. lower than VDD and can fluctuate according to the leakage ratios of Starting with the write ‘0’ operation, the non-inverted bitline (BL) the cell’s pull-up and pull-down networks. This fluctuation can be is discharged to ground, while the inverted bitline (BLB) is charged seen in the distributions of Fig. 4b, above. to VDD. Subsequently, the write word line (WWL) is asserted, Continuing on to the Write ‘1’ operation, the biases of the initializing the write operation. Assuming that the accessed bitcell bitlines are inverted (BL¼VDD, BLB¼0) and M5 is strongly over- was holding a ‘1’, the voltages at node Q and VVDD are at a median driven to discharge QB. Assuming Q was discharged and QB was level, VQ, and M9 is strongly cut-off with a negative overdrive high, VVDD¼VQB, as previously described, then M9 is cut-off with voltage (VGS9¼VQB VQ ¼VQ o0). At this point, M2 is turned on VGS9¼0. Therefore, as in the write ‘0’ operation, the access with a strong overdrive (VGS2¼VDD), easily discharging Q, as it transistor meets very little contention from the pull-up network needs only to overcome the very low leakage through the (made up of M9 and M6). Moreover, as VQB drops, the overdrive of serial connection of M9 and M3. Even at the problematic slow M9 becomes (at least temporarily) negative, further weakening nMOS–fast pMOS (SF) corner, the operation is easily achieved, as the pull-up current. For this operation, observing the QB side is both drive networks include nMOS devices. almost sufficient, as this is the side that controls the cell’s readout Whereas for most SRAM designs, observing a successful dis- value; all that is required from the Q node is that it is charged charge of one of the nodes is enough to ensure a write operation, the higher than QB, which is easily accomplished through M2. 240 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247

The final voltage at Q is lower than VDD in accordance with the QSRAM leakage substantially lower than that of the 6T bitcell. leakage ratios; this voltage is generally between 40 and 60% of This is due to the negative VGS of the feedback device and the VDD, depending on the process corner and mismatch parameters. lower DIBL current through both devices. If gate leakages are taken into account, a small additional reduction is achieved. 2.5. Read operation The 6T presents two primary gate leakages (through M3 and M4 in the hold ‘1’ state), while the QSRAM in the hold ‘1’ state has As described above, read operation of the QSRAM cell is three contributors (M3, M4 and M6); however, the gate leakage is achieved in a similar fashion to the 8T read. Fig. 3b and Fig. 4b exponentially dependent on the potential that falls across the show the steady state voltage distributions at the QB node, gate oxide, such that the sum is lower for the QSRAM cell. through which the read operation is sensed. In our implementa- A similar observation can be applied to the QSRAM in the hold tion, we chose a 256 row array with a divided bit-line, such that ‘0’ state as shown in Fig. 4a. In this case, the current through M5 is every 64 bits in a column are connected to the same RBL. In order negligible, so the total sub-threshold leakage can be estimated as to correctly read out of the cell, the frequency must be set the sum of the currents through M9 and M2. Assuming that Q is between the worst case time it takes to read out a ‘0’ (i.e., QB is fully discharged and the steady state voltage of QB is VQB (which high with all other cells on the same RBL holding a ‘1’) and the is also the bias at VVDD) we can write: shortest time it takes for leakage currents to discharge the RBL Zlvt ðÞVDDVQB VTlvt =nlvt vT ðÞVDDVQB =vT while reading a ‘1’ (i.e., QB is low with all other cells on the same I9T_hold0 ¼ I9 þI2 ¼ I0lvte 1e RBL holding a ‘0’). Fig. 5 shows the distribution of the RBL discharge time for 5k Monte Carlo samples of cells holding a ‘1’ Z V V =n v þI e n DD Tn n T ð3Þ and a ‘0’. The plot shows that the time it takes to discharge the 0n RBL due to leakage is substantially higher than the time it takes to The current described by (3) is usually slightly larger than that of discharge the bitline through M7 and M8. Note that the time axis (2), as the exponent of I is more negative; but this depends on the is logarithmic, such that the gap between the two distributions is 9 ratio between V and V . In fact, the higher the steady state of V sufficient for read sensing. Q QB QB is, the lower the leakage in the hold ‘0’ state. However, this increases the gate leakage through M9 and M6 in the hold ‘0’ state, and 2.6. Leakage power ultimately, these factors can trade off. In particular, this is the case for low-K processes, like the 40 nm LP technology that we used. The hold states of the QSRAM cell display a very non- To summarize the leakage current dissipation, Fig. 6 shows the conventional approach to SRAM design. Rather than having a distribution of the leakages of the two states of the QSRAM as fully static cross-coupled feedback structure, the stable state is set compared to a standard 8T bitcell at 400 mV. The figure clearly according to initial conditions and held by leakage currents. shows the significant leakage suppression of the QSRAM cell in While this does make the bitcell more susceptible to injected the hold ‘1’ state, with an average reduction of 3.7X over the noise, it retains functionality and provides enhanced leakage standard 8T cell operated at the same voltage. The hold ‘0’ state suppression over the standard topologies. also provides a vast improvement with an average reduction of The majority of the leakage in a standard 6T SRAM cell is set by 1.8X. It should be noted that this reduction is on top of the sub-threshold leakage through cut-off devices with VGS ¼0 and exponential reduction achieved by operating the array at such a VDS¼VDD. Assuming the bitlines are precharged to VDD, the low voltage, under which the standard implementations are leakage current of the 6T cell in the hold ‘1’ state (as depicted in non-functional. When compared to an 8T bitcell at its nominal Fig. 1a) can be estimated as the sum of the currents through M1, operating voltage (1.1 V), the average leakage reduction is 45X M5, and M6. Neglecting gate leakages, which will be referred to and 21X for the hold ‘1’ and hold ‘0’ states, respectively. separately, the leakage can be written according to the EKV based sub-threshold current model [28] as2 :

Z V DDVTn=nnvT Z VDDV Tn=npvT I6T ¼ I1 þI5 þI6 ¼ 2 I0ne n þI0pe p ð1Þ where n and p indices indicate nMOS and pMOS devices, respec- tively; I0 is the sub-threshold current coefficient; VT is the device threshold voltage; Z is the drain-induced barrier lowering (DIBL) coefficient; n is the sub-threshold swing coefficient; and vT is the thermal voltage. For the QSRAM in the hold ‘1’ state, the sub-threshold leakage can be estimated as the sum of the currents through M9 and the access transistors, M2 and M53. Assuming that the steady state voltage of Q is a median voltage VQ, and that VVDD equates to VQ through M3, we can write:

Zlvt ðÞV DDVQ VQ VTlvt =nlvt vT I9T_hold1 ¼ I9 þI2 þI5 ¼ I0lvte

Z ðÞV DDV Q VQ VTn=nnvT Z VDDVTn=nnvT þI0ne n þI0ne n ð2Þ where the lvt index indicates a low-threshold nMOS device. In comparison to (1), the exponentials in the currents of M9 and M2 have additional negative values (VQ and ZVQ) that make the

2 Note that this estimation assumes that VDD 46vT. Fig. 6. Monte Carlo statistical distribution of the ratio of leakage current between 3 This is neglecting the leakage through the read buffer (M7, M8) for a standard 8T SRAM bitcell and the proposed 9T QSRAM bitcell. The figure generality in comparison to the 8T standard bitcell; however, this leakage could includes ratioed plots for both the hold ‘1’ and hold ‘0’ stable states (with easily be added to the estimation. VDD ¼400 mV). A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 241

3. Stability analysis of the QSRAM bitcell non-penetrative readout buffer [3]). However, as we will show in the following subsections, the dynamic stability of the QSRAM cell Section 2 presented the QSRAM bitcell and its advantages; it is higher than would be expected based on its SNM. But first, as provides low leakage and retains write-ability at low voltages with the initial stage of our stability analysis, we will present the SNM standard peripheral circuitry. However, any experienced SRAM of the QSRAM cell. designer will immediately contemplate the stability of this non- Fig. 7a shows the ‘‘butterfly curves’’ [16] of the QSRAM cell for conventional approach. In this section, we will provide a broad hold/read under typical conditions with both a 1.1 V and a 400 mV stability analysis of the QSRAM bitcell, through a large array of supply. The DC transfer functions from both Q-QB and QB-Q techniques and metrics. provide the expected inverting function; albeit, with a relatively low switching threshold, as the conducting nMOS has less contention 3.1. Standard static noise margin metrics when pulling down against the gated supply. A large difference is displayed between the SNM of the hold ‘0’ vs. hold ‘1’ states, as is The de-facto standard for measuring stability analysis is the expected due to the asymmetric nature of the circuit. One of the static noise margin, originally proposed by Hill [29]. This method interestingaspectsoftheseplotsisthatdespitetheextreme plots the voltage transfer characteristics (VTC) of the inverters that reduction of supply voltage (from 1.1 V to 400 mV), the SNM comprise the bitcell core, finding the largest static disturbance reduction is relatively small. In fact, for a 63% decrease in supply that the cell can withstand without losing its state. Seevinck et al. voltage, the SNM is only about 25% lower. This is another justifica- [16] proposed a method for efficiently measuring this metric tion for designating this topology for low-voltage operation. using a single DC sweep simulation. For the hold state, the SNM Fig. 7b shows the Monte Carlo statistical distribution of the ¼ metric assumes an infinite, serial voltage noise source (Fig. 1b), SNM for 5k points at VDD 400 mV. The median value of the hold while for read and write operations, it assumes an infinitely long ‘0’ SNM is in the order of 100 mV, whereas for hold ‘1’ it in the access pulse. These assumptions lead to a pessimistic evaluation order of 70 mV. These values are sufficient, yet the standard of read and hold operations, as the duration of the actual noise is variations are quite large (23 mV for hold ‘0’ and 27 mV for hold finite. In addition, a serial voltage noise source is non-physical ‘1’), such that the 6s distribution would seem to fail. However, and ignores the temporal pattern of injected noise and circuit dynamics [21,30]. For older technologies, this pessimism was reasonable, as process variations were not as severe, and the resulting bitcell sizes scaled with Moore’s Law [31]. However, for nano-scaled technologies with extreme process variations and local device mismatch, meeting large SNM requirements can lead to overdesign, and therefore, impede array size scaling. On the other hand, when discussing write operations, the SNM metric is an overly optimistic evaluation. Since the actual access pulse is finite, the cell won’t reach its intended state if the pulse is too short. Therefore, by exclusively measuring the SNM metric for write operations, potential write failures could be overlooked. To conclude, it has been shown that for read and hold operations, the SNM metric is a sufficient, but not a necessary condition for stability, whereas for write operations, it is insufficient [21]. Measuring the SNM metrics of the QSRAM cell presents an additional aspect. This cell is quasi-static, rather than fully static, as its steady state is set by leakage ratios, rather than strong positive feedback. This leads to rather depleted SNMs for hold Fig. 8. Write margin for QSRAM vs. 8T SRAM at standard process corners. Note (and read, which is equivalent to hold for bitcells employing a that the 8T cell loses write-ability (i.e., WSNMo0) at the SF corner.

Fig. 7. (a) Butterfly curves of QSRAM cell at 1.1 V and 400 mV. Note that the graph has been reduced from the full 0–1.1 V range for emphasis. (b) Hold SNM distribution of QSRAM cell for 10 k samples at 400 mV. 242 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247

the implications of this are much less significant under a time VDD limited noise source. To ensure a correct read, it is essential that QB remains low, and since M9 gates the supply and dampens the I9 positive feedback that causes the cell to flip, a noise source with a VVDD limited duration would usually be discharged rather than flip the ICVVDD cell. This will be shown in the following subsections. CVVDD The SNM for write operations is very robust, as expected, I3 I6 I considering the descriptions given in Section 2. Fig. 8 shows the BLI2 VQ VQB 5 BLB write margin for the write ‘0’ and write ‘1’ operations as compared I CQ C ICQB to a standard 8T cell with a 400 mV supply. The margins were CQ QB measured using a modified version of the bitline sweep method I I [25]. In the standard method, a voltage noise source is only 1 4 applied to the ‘0’ bitline, as the standard bitcell is designed to be insensitive to the ‘1’ bitline (in order to maintain the read constraint). However, for an asymmetric cell, and for the QSRAM Fig. 9. QSRAM non-linear system representation. Note that gate currents are neglected in this representation. in particular, this argument does not hold up, and so a noise source was added to both bitlines. Note that this methodology should also be used when measuring an 8T cell sized without consideration of half-select scenarios, such as the reference cell used here, as the influence of the ‘1’ bitline should be taken into consideration. Fig. 8 shows the margins across standard process corners. The QSRAM cell displays substantially higher margins at all corners, other than the best-case fast nMOS–slow pMOS (FS) corner, where the margins are comparable. At this corner, the access transistors of the 8T cell are much stronger than the pull-up pMOS, ensuring a successful write. On the other hand, the 8T loses write-ability at the SF corner, whereas the QSRAM maintains robust write operations under these conditions. Here, the internal feedback weakens the pull-up, which for the 8T cell is stronger than the pull-down through the access transistors, impeding the write margins.

3.2. State space analysis

Several recent works have shown that the standard 6T SRAM bitcell is a non-linear time-variant system, and have developed models to describe the system behavior using state-space analysis [17,18,21,22,32,37]. A similar model can be developed for the QSRAM cell; however, this topology displays a third internal node

(VVDD), in addition to the standard data nodes (Q and QB). Fig. 10. State space of QSRAM cell showing variations of equilibriums and The cell’s transient behavior can be modeled as a system of seven separatrices. The lines represent the separatrices at standard process corners; voltage controlled-current sources and three capacitors (the total large symbols indicate equilibriums at standard 3s process corners; gray dots lumped capacitance to ground at the three data nodes), as shown indicate Monte Carlo samples under global and local variations. in Fig. 9,4. This model provides the following set of equations: 8 > dVQ =dt > CQ ¼I1 V Q ,V QB þI2 V Q þI3 V Q ,V QB,VVDD multiple parameter dependence of each of the device currents. < dVQB=dt However, it is clear that state space analysis based on simulation C ¼I V ,V þI V þI V ,V ,V ð4Þ > QB 4 QB Q 5 QB 6 QB Q VDD > data is necessary to understand the dynamics and behavior of the : dVVDD=dt CVVDD ¼ I9ðÞV VDD I3 VQ ,VQB,V VDD I6 V QB,V Q ,V VDD cell. In the following subsections, we will present the state space map, separatrix, phase portrait, and internal loop gain of the where In indicates the voltage dependent current through device QSRAM bitcell. Note that as far as we know, this is one of the first n of Fig. 2; Cnode indicates the lumped capacitance to ground at times such an analysis has been presented for a non-standard nodes Q, QB and VVDD; and Vnode indicates the transient voltages at (6T or 8T) bitcell. nodes Q, QB and VVDD. This system comprises a three-dimensional state vector, V, made up of the node voltages of the three internal 3.3. Separatrix nodes (VQ,VQB,VVDD). Interestingly, the circuit settles at two stable points, similar to a standard SRAM cell, with VVDD approxi- To consider dynamic stability, it is essential to consider how mately equal to the higher of the two data nodes. This simplifies the state of a bi-stable circuit transients from one stable equili- the dynamic analysis of the QSRAM topology, as the cell state can brium to the other (i.e., a state flip) under a single event upset be represented on two-dimensional state plots of VQ and VQB,as (SEU), a read,orawrite event [22]. The state space of such a shown in previous works [21,22,30,32,37] for the standard 6T cell. circuit is divided into two stability regions by a stable manifold, Analytically solving the dynamic system of (4) is out of the better known as the circuit’s ‘‘separatrix’’ [33]. Following an event scope of this manuscript, especially in light of the non-linear (e.g., SEU, read, or write access), the circuit will stabilize at one of its two stable states, depending on which side of the separatrix it 4 Note that this model excludes gate currents for simplicity; however, these is at, once the event is over. The separatrices of the QSRAM cell at currents may have a significant effect in low-K processes. standard process corners are mapped in Fig. 10, providing an A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 243

[30]. These plots show the magnitude and direction of the state space for any given starting point. The hold phase portrait for the

QSRAM cell under typical conditions with VDD¼400 mV is shown in Fig. 11. The cell’s two ROAs and the separatrix are easy to observe from within the plot. Any temporal state above the separatrix has a high QB voltage, causing Q to quickly discharge, as shown by the horizontal left-pointing vectors. The state will

subsequently progress to the V0 stable point, as the leakage current replenish QB, while keeping Q low. The opposite attrac- tion occurs under the separatrix; these states have a high Q value, providing a strong QB pull-down through M4. The result is the down-pointing vertical vectors showing a quick discharge of QB. Subsequently, Q will slowly discharge to its equilibrium state of

V1. For additional clarity, the circuit’s two equilibriums (V0 and V1) are clearly marked on the figure. In addition to the hold phase portrait, similar plots can be extracted for the asymmetrical write ‘0’ and write ‘1’ operations, as shown in Fig. 12a and b, respectively. Here we can see that each operation is attracted to a single stable state. The behavior of the write operations can also be observed, corresponding with the descriptions given in Section 2. For the write ‘0’ operation, the Q node is strongly discharged, with the approach to the high QB state much less aggressive. For the write ‘1’ operation, QB is discharged

Fig. 11. QSRAM phase portrait in hold. The cell’s equilibriums for the hold ‘1’ (V1) much more abruptly than Q is charged. Both plots show a single and hold ‘0’ (V0) states are marked. The plot was extracted for typical conditions at stable equilibrium (clearly marked on the figures), as required by a VDD¼400 mV. write operation. Note that the stable points are slightly different than those shown in Fig. 11, as the plots are extracted with the interesting insight into the QSRAM cell5. The figure shows that access transistors asserted. The cell will eventually stabilize at its the separatrix is quite different than that of a standard cross- equilibriums following the discharge of WWL. coupled latch structure, where the manifold is roughly set by the line where Q¼QB. For the QSRAM, the region of attraction (ROA) 3.5. Loop gain of the hold ‘1’ state is larger than that of the hold ‘0’ state, as the separatrix is slightly above the Q¼QB line; this enables the An additional observation of a circuit’s stability should be voltage level of Q to vary as long as QB is low. The ‘1’ equilibrium verified using a bi-stable circuit’s small signal loop gain [21]. is at the state vector of V (V , V , V )¼(0.185 V, 3 mV, 0.186 V) 1 Q QB VDD A loop gain plot can be derived by performing small signal analysis providing a dynamic margin from the separatrix during either a at every arbitrary DC operating point across the state space. noise event that discharges Q or one that charges QB. The ‘0’ For stability, the loop gain plot must show gain lower than unity equilibrium, on the other hand, is at V ¼(0 V, 0.382 V, 0 in the areas of the stable states. This corresponds to the circuit’s 0.383 V)—close to the standard SRAM equilibrium. This equili- ability to recover from a transient interference. If the gain is larger brium provides a dynamic margin from the separatrix during than 0 dB, the interference will amplify, resulting in loss of state either a discharge of QB or a charge of Q. (i.e., cell flip). If the loop gain is less than unity, the state will return In addition to the separatrices of the QSRAM cell, Fig. 10 also to the minimum gain point. The QSRAM cell’s loop gain, shown in includes the cell’s stable states at standard process corners, as Fig. 13, is very different than the saddle shape of the standard 6T well as a scatter of 5k Monte Carlo samples of these states. As the bit cell’s gain. The unity gain level is emphasized in this figure by separatrix is relatively constant under process variations, the the gray plane that dissects the plot and the local minima are scatter plot provides additional insight to the stability of the cell. marked. For clarity, the topographical contours of the 3-D plot are All samples distinctively fall onto one side or the other of the also shown. It is clear that the two minima are well below unity separatrix, providing bi-stable ROAs for the circuit. The plot also and reside in the immediate area of the cell’s equilibriums, shows the worst case SF corner with a depleted QB level, as verifying the cell’s stability. The maximum gain of the cell is described in Section 2; however, the hold ‘0’ state at corner achieved in the area of the meta-stable point, corresponding with maintains a distinct gap away from the separatrix, showing a the high gain of both the feed-forward and feedback networks, as distinct bi-stable ROA for this corner6 . can be seen in Fig. 7a. Additional areas with less than unity gain correspond to the relatively large recovery times under the 3.4. Phase portrait occurrence of a disturbance that would skew the space to such a region. To further comprehend the dynamic behavior of the QSRAM cell it is worthwhile to observe the circuit’s phase portrait plots 3.6. Dynamic noise margin

5 Note that due to the abundance of data on the figure, and since the Combining the observations of the preceding sub-sections, we separatrices for all corners are very close to each other, the key attaching each can now discuss the dynamic noise margins of the QSRAM bitcell. A separatrix to its relevant corner was omitted. concrete DNM metric was first proposed by Dong et al., [22] as the 6 Note that this cell includes three internal nodes, rendering a three- difference between the access time in read and/or write (TR and dimensional state space analysis. However, since V is connected to either VDD T , respectively) and the time it takes the cell to cross the Q or QB through a conducting pMOS device at almost every state, the addition of a W third axis to these plots provides very little additional insight at the cost of visual separatrix during these events (Tacross,R and Tacross,W, respectively). complexity and therefore the third state variable was omitted from these plots. Specifically, the DNM for read (RDNM) and write (WDNM) are 244 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 given as:

RDNM ¼ Tacross,RTR

WDNM ¼ TW Tacross,W ð5Þ

For cells displaying a non-penetrative read operation, such as the QSRAM cell, RDNM is infinite, as the core of the bitcell is not disturbed. Dynamic stability during a read operation should therefore be considered according to the cell’s dynamic hold margin, as discussed below. WDNM is separate for the write ‘1’ and write ‘0’ operations. and due to its dependence on cell access time (TW) which is controllable, it is more informative to measure the write operations’ Tacross.W characteristics. At 400 mV, both Tacross,W0 and Tacross,W1 were found to be 0.6 ns at nominal conditions. This can be compared to the Tacross,W time of a standard 6T cell, which is 1.4 ns at 400 mV. For the worst case

SF process corner, Tacross,W1 is 0.9 ns and Tacross,W0 is a much slower Fig. 13. Loop gain plot of the QSRAM cell from Q to QB. The black circles mark the 3.2 ns, but this is compared to the infinite Tacross,W of the 6T cell, as local minima that correspond to the cell’s stable states. The 0 dB plane is also the cell is non-writeable (WSNMo0) at this corner. This concurs shown to emphasize convergence where the loop gain is smaller than unity. with the discussion of Section 2. The DNM for hold was only briefly introduced in [22], but was later extended by Zhang et al. [24], [34]. During hold, two transient noise models are discussed: the square pulse model, and the exponential model. In the square pulse model, a current noise source is injected into the low data node of a 6T SRAM cell.

For a current amplitude of In, the required pulse width to drive the state across the separatrix is measured (Tcrit). Since the square pulse model presents a very simplified representation of an actual noise event, the exponential model is proposed to more accu- rately model an SEU caused by radiation [35]. A third model is achieved by applying the noise current infinitely, and thus measuring the static current noise margin (ISNM), which is an alternative to the standard SNM metric. Due to the symmetry of the 6T bitcell, it is sufficient to solely discuss the current noise injection at the low data node to develop an analytical model, as presented in [17]. However, the QSRAM cell is extremely asymmetric, such that hold DNM must be measured for all cases. Fig. 14 plots Tcrit as a function of In for the four cases of a square noise pulse. These cases are: a charging current to QB or a discharging current from Q for the hold ‘1’ state; and a charging current to Q or a discharging current from QB for Fig. 14. Tcrit as a function of In for the four options of a destructive noise current to the hold ‘0’ state. The ISNM is extremely low for this cell the QSRAM cell.

Fig. 12. (a) Phase portrait of the write ‘0’ operation. (b) Phase portrait of the write ‘1’ operation. Both plots are at typical conditions with VDD ¼400 mV. A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 245

(approximately 10 pA – shown by the vertical asymptotes of total area 1.044 mm2 per bitcell. This is approximately 10% larger Fig. 14), as the cell is quasi-static and held at its stable state by than an 8T 2-port cell abiding to the same rule set, but about 2.8X leakage currents. However, as shown in the figure, the dynamic larger than the pushed rule single-port 6T bitcell provided by the hold stability of the QSRAM cell is much more robust than the ISNM predictions. The cell is most sensitive in the hold ‘1’ state with noise charging the QB node; however, these are still worst case predictions, as generally a charging pulse to one node will 1.8μm also similarly affect the inverted node, maintaining some (or all) of the difference between them. In order to further test the state VSS VDD VSS RBL stability under a typical cross-talk disturbance, all four scenarios M1 M9 M3 M7 M8 were simulated with a 0 to VDD pulse through the extracted RWL coupling capacitances, and the cell returned to its initial state 0.58μm VVDD M6 following each of these interferences. WWL QB M2 Q M4 M5 WWL A further observation of the cell’s behavior during a square pulse noise event is shown in Fig. 15a. This figure plots the trajectory of BL VSS BLB two noise events in the hold ‘0’ state. The first noise pulse is with

Tnoise 4Tcrit, resulting in a destructive state flip, whereas the second is with TnoiseoTcrit resulting in full state recovery. A similar plot is shown for a write ‘0’ event in Fig. 15b. In this case, when Fig. 16. QSRAM layout. TW4Tacross,W0, the write operation is successful and the cell reaches the ‘0’ state; however, when TWoTacross,W0, the cell state never crosses the separatrix and eventually returns to the ‘1’ state.

4. Implementation and measurements

4.1. Bitcell implementation and layout

The proposed 9T QSRAM bitcell was implemented in a com- 64 x 32 Block of mercial 40 nm low-power TSMC technology. Layout of the bitcell Down Shiters QSRAM Bitcells was done exclusively with standard process layers, including Digital Periphery multiple VT implants. The chosen layout, shown in Fig. 16, main- tains unidirectional polysilicon gates, while reducing area over- head by implementing vertical wells and sharing bus contacts between adjacent bitcells. This implementation includes LVT implants for M9 and M7, as well as HVT implants for M1 and Readout,Up-Shift, and Latch M4. Longer than minimal lengths were used for several of the devices in order to utilize the reverse short channel effect (RSCE) Digital Periphery [27] and improve immunity to process variations. The layout was carried out according to standard logic design rules, resulting in a Fig. 17. Array architecture.

Fig. 15. (a) Trajectories of destructive (dotted line) and non-destructive (solid line) square pulse noise events in the hold ‘0’ state. (b) Trajectories of successful (dotted line) and failed (solid line) write operations. 246 A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 foundry. This ratio could be significantly reduced by employment digital logic, and 400 mV for the array core. The 400 mV supply of pushed rules for this circuit, as well. was provided by the Agilent B1500a Semiconductor Device Analy- zer in order to accurately measure power consumption. This also enabled testing the functionality and power consumption of the 4.2. Test chip implementation and architecture chips while sweeping the array core supply voltage. All packaged test chips functioned successfully at the specified voltage (400 mV) An 8 kbit array of QSRAM bitcells was assembled and fabricated and were further tested at voltages up to 1.1 V. as part of a 40 nm test chip along with other test structures. Fig. 19 shows representative waveforms of the chip testing. In The array utilized a divided bit-line architecture, separating the Fig. 19a, signals from the FPGA’s digital chip scope show writing and readout path into four 64-row sub-sections to enable the integration reading a 32-bit word, while the analog behavior of a single bit being of a 256-row block. In order to isolate the power consumption of the written and subsequently read is shown in Fig. 19b. A write operation SRAM core, separate power supplies were used for the periphery is carried out when both the write–read (WR) and chip select (CS) and the array. The peripheral circuits were powered with a nominal signals are high, whereas a read operation occurs at the falling edge of supply voltage and level shifters were employed to propagate the the clock (CLK) after WR goes low and CS goes high. 400 mV signals to and from the array core. The operation cycle was At the specified voltage of 400 mV, the test chips were divided into two phases, such that the low-voltage core signals were operated at 1 MHz. The frequency limit is dominated by the only produced at the falling edge of the clock. The data out signals peripheral circuits which were not optimized for performance. were held by a negative latch during the positive clock phase to Note that read access time is highly dependent on the array eliminate hold violations when passing the data to the digital architecture (i.e., the number of cells on a bitline) and the sensing circuitry. The general architecture of the array is shown in Fig. 17 scheme. Various techniques can be implemented to improve upon and the test chip micrograph is shown in Fig. 18.Anextensive these, as shown in other publications. The static power consump- description of the array architecture and operation cycle can be tion of the array was measured to be 21 nW for an array filled found in [7], for which the same test setup was used. with ‘1’s, and 45 nW for an array filled with ‘0’s. The power consumption for continuous reads and writes of an equivalent 4.3. Test chip measurements number of ‘0’s and ‘1’s distributed amongst the array was 234 nW for operation at 1 MHz. The test chips were fabricated with a dual interface enabling internal testing with an on-chip built-in-self-test (BIST) unit, as well as external testing with a standard FPGA for better debugging. Three power supplies were used: 2.5 V for I/O circuits, 1.1 V for

Fig. 19. (a) Chip Scope waveform of a pair of write and read operations; first, Fig. 18. Chip micrograph. FFFFFFFF is written and read out, and then 00000000 is written and read out. (b) Analog scope waveform showing a ‘0’ written and subsequently read out.

Table 1 Comparison to other sub-threshold bitcells.

Year 2007 2007 2007 2008 2009 2011 2012

Publication ISSC [8] ISLPED [6] JSSC [3] JSSC [4] JSSC [13] JSSC [7] This work Node 0.13 mm 0.13 mm65nm65nm90nm40nmLP40nmLP Capacity 2 kb 4 kb 256 kb 256 kb 32 kb 8 kb 8 kb Transistor count 6T 10T 10T 8T 10T 9T 9T

VDDmin 193 mV 160 mV 380 mV 350 mV 180 mV 250 mV 400 mV Word lines 2 1 12222 Bit lines 1 2 33233 Floated supplies VVDD, VGND N/A VVDD N/A VGND N/A N/A WL boost None None 100 mV 50 mV 33% None None Frequency 21.5 kHz @210 mV 620 kHz @400 mV 475 kHz @400 mV 25 kHz @350 mV 581 kHz @300 mV 1 MHz @400 mV 1 MHz @400 mV Dynamic power 1.8 mW @400 mV 0.146 mW @400 mV 3.28 mW @400 mV 3.39 mW @350 mV 1.44 mW @300 mV 285 nW @400 mV 234 nW @400 mV Leakage power 120 nW 2.5 mW 2.2 mW 363 nW 37–60 nW 21–45 nW A. Teman et al. / Microelectronics Journal 44 (2013) 236–247 247

A comparison of several recent sub-threshold SRAM designs is [12] B. Zhai, S. Hanson, D. Blaauw, D. Sylvester, A variation-tolerant sub-200 mV presented in Table 1. This table displays various relevant figures of 6T subthreshold SRAM, IEEE J. Solid-State Circuits 43 (2008) 2338–2348. [13] I.J. Chang, J.J. Kim, S.P. Park, K. Roy, A 32 kb 10T sub-threshold SRAM array merit, including technology node and transistor count; minimum with bit-interleaving and differential read scheme in 90 nm CMOS, IEEE J. operating voltage (VDDmin); number of peripheral buses and access Solid-State Circuits 44 (2009) 650–658. schemes; measured operating frequency; and power metrics. [14] D. Kim, G. Chen, M. Fojtik, M. Seok, D. Blaauw, D. Sylvester, A 1.85 fW/bit ultra low leakage 10T SRAM with speed compensation scheme, in circuits and systems (ISCAS), IEEE International Symposium on (2011) 69–72. 5. Conclusions [15] M. Qazi, M.E. Sinangil, A.P. Chandrakasan, Challenges and directions for low-voltage SRAM, IEEE Des. Test Comput. 28 (2011) 32–43. [16] E. Seevinck, F.J. List, J. Lohstroh, Static-noise margin analysis of MOS SRAM In this paper, we presented the ultra-low leakage 9T quasi- cells, IEEE J. Solid-State Circuits 22 (1987) 748–754. static RAM bitcell and provided an extensive analysis of this [17] B. Zhang, A. Arapostathis, S. Nassif M. Orshansky, ‘‘Analytical modeling of topology. Due to its quasi-static nature, the QSRAM bitcell dis- SRAM dynamic stability,’’ in: Computer-Aided Design, 2006. ICCAD ‘06. IEEE/ ACM International Conference on, pp. 315–322. plays somewhat depleted static noise margins; however, through [18] D.E. Khalil, M. Khellah, N.S. Kim, Y. Ismail, T. Karnik, V.K. De, Accurate dynamic stability analysis, the functionality of the cell was estimation of SRAM dynamic stability, IEEE Trans. Very Large Scale Integr. verified. We showed that the QSRAM bitcell is designated for VLSI Syst. 16 (2008) 1639–1647. [19] S. Nalam, V. Chandra, R.C. Aitken B.H. Calhoun, ‘‘Dynamic write limited low-voltage operation due to its robust write-ability and its non- minimum operating voltage for nanoscale SRAMs,’’ in: Design, Automation linear reduction of SNM with voltage lowering. Simulations of the and Test in Europe Conference and Exhibition (DATE), 2011, pp. 1–6. QSRAM cell were shown for a specified supply voltage of 400 mV, [20] S.O. Toh, Z. Guo, T.J.K. Liu, B. Nikolic, Characterization of dynamic SRAM under which it retained functionality under global and local stability in 45 nm CMOS, IEEE J. Solid-State Circuits 46 (2011) 2702–2712. [21] M. Sharifkhani, M. Sachdev, SRAM cell stability: a dynamic perspective, IEEE variations for the chosen 40 nm technology. J. Solid-State Circuits 44 (2009) 609–619. The QSRAM bitcell was compiled into an 8 kbit array and [22] Wei Dong, G.M. Huang, ‘‘SRAM dynamic stability: Theory, variability and fabricated as part of a test chip in a commercial TSMC 40 nm LP analysis,’’ in: Computer-Aided Design, 2008. ICCAD 2008. IEEE/ACM Interna- process. Measurement results show full functionality for the tional Conference on, 2008, pp. 378–385. [23] M. Wieckowski, D. Sylvester, D. Blaauw, V. Chandra, S. Idgunji, C. Pietrzyk R. specified 400 mV operating voltage at 1 MHz consuming Aitken, ‘‘A black box method for stability analysis of arbitrary SRAM cell 234 nW of power. Its static power figure was shown to typically structures,’’ in: Design, Automation and Test in Europe Conference and be 1.5 to 4 times lower than a standard 8T bitcell held at 400 mV. Exhibition (DATE), 2010, 2010, pp. 795–800. We would like to thank Mr. Nir Sever, the Zoran Corporation and [24] Yong Zhang, G.M. Huang, ‘‘Separatrices in high-dimensional state space: system-theoretical tangent computation and application to SRAM dynamic the Alpha Consortium for their help and support in the comple- stability analysis,’’ in: Design Automation Conference (DAC), 2010 47th ACM/ tion of this work. IEEE, 2010, pp. 567–572. [25] J. Wang, S. Nalam B.H. Calhoun, ‘‘Analyzing static and dynamic write margin for nanometer SRAMs,’’ in: Low Power Electronics and Design (ISLPED), 2008 References ACM/IEEE International Symposium on, pp. 129–134. [26] E. Grossar, M. Stucchi, K. Maex, W. Dehaene, Read stability and write-ability [1] N. Verma, J. Kwong, A.P. Chandrakasan, Nanometer MOSFET variation in analysis of SRAM cells for nanometer technologies, IEEE J. Solid-State Circuits minimum energy subthreshold circuits, IEEE Trans. Electron Devices 55 41 (2006) 2577–2588. (2008) 163–174. [27] T.H. Kim, J. Liu C.H. Kim, ‘‘An 8T subthreshold SRAM cell utilizing reverse [2] B.H. Calhoun, A.P. Chandrakasan, Static noise margin variation for short channel effect for write margin and read performance improvement,’’ sub-threshold SRAM in 65-nm CMOS, IEEE J. Solid-State Circuits 41 (2006) in: Custom Integrated Circuits Conference, 2007. CICC ‘07. IEEE, pp. 241–244. 1673–1679. [28] C.C. Enz, F. Krummenacher, E.A. Vittoz, An analytical MOS transistor model [3] B.H. Calhoun, A.P. Chandrakasan, A 256-kb 65-nm sub-threshold SRAM valid in all regions of operation and dedicated to low-voltage and low- design for ultra-low-voltage operation, IEEE J. Solid-State Circuits 42 (2007) current applications, Analog Integr. Circuits Signal Process. 8 (1995) 83–114. 680–688. [29] C.F. Hill, Noise margin and noise immunity in logic circuits, Microelectronics [4] N. Verma, A.P. Chandrakasan, A 256 kb 65 nm 8T subthreshold SRAM 1 (1968) 16–22, Apr. employing sense-amplifier redundancy, IEEE J. Solid-State Circuits 43 [30] G.M. Huang, W. Dong, Y. Ho P. Li, ‘‘Tracing SRAM separatrix for dynamic noise (2008) 141–149. margin analysis under device mismatch,’’ in: Behavioral Modeling and [5] T.H. Kim, J. Liu, C.H. Kim, A voltage scalable 0.26 V, 64 kb 8T SRAM with V Simulation Workshop, 2007. BMAS 2007. IEEE International, 2007, pp. 6–10. lowering techniques and deep sleep mode, IEEE J. Solid-State Circuits 44 [31] G.E. Moore, Cramming more components onto integrated circuits, Proc. IEEE (2009) 1785–1795. 86 (1998) 82–85. [6] J.P. Kulkarni, K. Kim, K. Roy, A 160 mV robust Schmitt trigger based [32] E.I. Vatajelu, G. Panagopoulos, K. Roy J. Figueras, ‘‘Parametric failure analysis subthreshold SRAM, IEEE J. Solid-State Circuits 42 (2007) 2303–2313. of embedded SRAMs using fast and accurate dynamic analysis,’’ in Test [7] A. Teman, L. Pergament, O. Cohen, A. Fish, A 250 mV 8 kb 40 nm ultra-low Symposium (ETS), 2010 15th IEEE European, 2010, pp. 69–74. power 9T supply feedback SRAM (SF-SRAM), IEEE J. Solid-State Circuits 46 [33] H. Khalil, Nonlinear Systems, Upper Saddle River, N.J: Prentice Hall, 2002. (2011) 2713–2726. [34] Y. Zhang, P. Li, G. Huang, Quantifying dynamic stability of genetic memory [8] B. Zhai, D. Blaauw, D. Sylvester S. Hanson, ‘‘A Sub-200 mV 6T SRAM in 0.13 mm CMOS,’’ in: IEEE Solid-State Circuits Conference, 2007. ISSCC 2007. circuits, IEEE/ACM Trans. Comput. Biol. Bioinf. 9 (3) (2012) 871–884. Digest of Technical Papers, pp. 332–606. [35] H. Mostafa, M.H. Anis, M. Elmasry, Analytical models accounting for [9] J.P. Kulkarni, K. Roy, Ultralow-voltage process-variation-tolerant Schmitt- die-to-die and within-die variations in sub-threshold SRAM cells, IEEE Trans. trigger-based SRAM design, IEEE Trans. Very Large Scale Integr. VLSI Syst. Very Large Scale Integr. VLSI Syst. 19 (2011) 182–195. (2011) 1. [36] Teman, A. Mordakhay, J. Mezhibovsky, A. Fish. A 40-nm sub-threshold 5T [10] T.H. Kim, J. Liu, J. Keane, C.H. Kim, A 0.2 V, 480 kb subthreshold SRAM with SRAM bit cell with improved read and write stability, in: IEEE TCAS- 1 k cells per bitline for ultra-low-voltage computing, IEEE J. Solid-State II:Express Briefs, pp 2013. Circuits 43 (2008) 518–529. [37] J. Mezhibovsky, A. Teman, A. Fish, State space modeling for sub-threshold [11] A. Teman, L. Pergament, O. Cohen, A. Fish, A minimum leakage quasi-static SRAM stability analysis, Proc. IEEE ISCAS (2012), pp.1823-1826, 20-23 May RAM bitcell, J. Low Power Electron. Appl. 1 (2011) 204–218. 2012.

3.4 A 40-nm Subthreshold 5T SRAM Bit Cell with Improved Read and Write Stability

The following paper was recently published in the IEEE publication, Transactions on Circuits and Systems II: Express briefs. It was included in the prestigious special issue on Ultra-Low-Voltage VLSI Circuits and Systems for Green Computing [96].

88 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 12, DECEMBER 2012 873 A 40-nm Sub-Threshold 5T SRAM Bit Cell With Improved Read and Write Stability Adam Teman, Student Member, IEEE, Anatoli Mordakhay, Janna Mezhibovsky, and Alexander Fish, Member, IEEE

Abstract—The need for power-efficient memories that are capa- techniques aimed at lowering the minimum operating voltage ble of operating at low supply voltages has led to the development (Vmin). Examples of previously proposed ST bit cells include of several alternative bit cell topologies. The majority of the 8T cells that utilize reverse short channel effect for increased proposed designs are based on the 6T bit cell with the addition of write-ability at lower voltages [2], [3]; a 10T bit cell with devices and/or peripheral techniques aimed at reducing leakage and enabling read and write functionality at lower operating reduced leakage in the read path [1]; a 9T bit cell with internal voltages. In this brief, we propose a reduced transistor count feedback for leakage reduction [4], [14]; and Schmitt–Trigger- bit cell that is fully functional in the sub-threshold (ST) region based bit cells [5]. Externally applied methods, such as in- of operation. This asymmetric 5T bit cell is operated through a creased word line (WL) voltage and detachable supply rails, single-ended read and differential write scheme, with an option for have also been proposed [1], [2], [6]. The majority of these operation as a two-port cell with single-ended write. The bit cell’s circuits add transistors to the standard 6T bit cell, resulting in a operating scheme provides a non intrusive read operation and significant increase in the die area. improved write margins for robust functionality. In addition, the circuit’s asymmetric characteristic provides a low-leakage state This paper presents a robust, low-voltage SRAM bit cell with an additional 5X static power improvement over the reduc- with a reduced transistor count, as compared to the standard tion inherently achieved through voltage lowering. The proposed 6T circuit. The proposed 5T bit cell is based on the circuit bit cell was designed and simulated in a 40-nm commercial CMOS introduced in [7] with a number of significant modifications process and is shown to be fully operational at ST voltages as low to enable low-voltage operation and dense implementation in as 400 mV under global and local process variations. At this supply nano-scale processes. The result is a bit cell of comparable size voltage, a 21X static power reduction is achieved, as compared to the industry-standard 6T bit cell, operated at its minimum supply to the 6T cell, which is shown to be fully operational at voltages voltage. as low as 400 mV in a commercial 40-nm CMOS process. At this supply voltage, the proposed bit cell provides 6σ stability Index Terms—CMOS memory integrated circuits, leakage sup- and an average static power reduction of 21X, as compared to a pression, SRAM, sub-threshold (ST) static random access memory (SRAM), ultra-low power. 6T cell operating at its voltage limit. The rest of this paper is constructed as follows. Section II de- I. INTRODUCTION scribes the circuit design and operation of the proposed cell and an extensive discussion of circuit stability, including simulation HE CONTINUOUSLY growing demand for low-power results, is presented in Section III. Section IV describes circuit T embedded memory has been a driving force in the devel- implementation and performance, and Section V summarizes opment of new static random access memory (SRAM) designs the paper. and techniques. Aggressive power reduction can be achieved by operating at sub-threshold (ST) voltages; however, operation at these reduced voltages degrades robustness, due to depleted II. THE 5T SRAM BIT CELL noise margins and higher susceptibility to process variations and device mismatch. The severe variations present in nano- Operation of a standard 6T bit cell at low voltages is limited scaled process technologies cause standard SRAM implementa- by both its read and write margins, due to process and mismatch tions, such as the single-port 6T bit cell, to fail at voltages below variations. The drive strength of MOSFET devices becomes an 600–700 mV [1]. Previous work in this field has introduced exponential function of the device’s threshold voltage (VT),as modifications to the 6T cell by using additional devices and the supply voltage nears the ST region, causing variation to increase dramatically. As described in [1], a 65-nm 6T cell fails due to read margin deterioration at approximately 800 mV, and Manuscript received June 26, 2012; revised September 13, 2012; accepted October 21, 2012. Date of publication January 9, 2013; date of current version due to depleted write margins at approximately 700 mV. Device February 1, 2013. This brief was recommended by Associate Editor M. Alioto. mismatch, and therefore cell failure probability, only gets worse A. Teman and J. Mezhibovsky are with the Low Power Circuits and Systems with technology scaling. Lab (LPC&S), VLSI Systems Center, Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel 84105 In order to address these problems, we propose a novel 5T (e-mail: [email protected]). bit cell, employing a single-ended read and differential write A. Mordakhay is with the Faculty of Engineering, Bar-Ilan University, Ramat with an option for single-ended write. The proposed circuit, Gan, Israel 52900. A. Fish is with the Faculty of Engineering, Bar-Ilan University, Ramat Gan, shown in Fig. 1, resembles the circuit described in [7], with Israel 52900, and also with the Low Power Circuits and Systems Laboratory a few major modifications to enable dense layout, as well as (LPC&S), VLSI Systems Center, Department of Electrical and Computer functionality at low supply voltages in nano-scale technologies. Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel 84105. These modifications include removing the back gate control Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org. of the right access transistor (M4) and manipulating multi- Digital Object Identifier 10.1109/TCSII.2012.2231020 threshold devices, as will be explained.

1549-7747/$31.00 © 2013 IEEE 874 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 12, DECEMBER 2012

Fig. 1. Schematic of the proposed 5T bit cell. In general, the proposed cell structure is similar to a standard 6T bit cell, enabling a dense, regularly structured layout. How- ever, as opposed to a cross-coupled inverter circuit, the 5T cell Fig. 2. (a) QB steady-state voltage in the SF corner under various supply is lacking an NMOS pull-down under the right data node (QB). voltages. (b) Monte Carlo distribution of QB steady-state voltages under a To compensate for this, M4 is implemented as a low-threshold 400-mV supply. (LVT) device and the right bit line (BLb) is held low during at low voltages, this is often the case, limiting the operating standby and read operations. Accordingly, reads are achieved voltage to strong-inversion operation. To address this issue, single-endedly through the left access transistor (M2) according many alternative bit cells have employed a nonintrusive readout to an independent Read Word Line (RWL) signal. Writes are through a read-buffer [1], [2], [4]; however, this results in a achieved differentially by asserting both RWL and the Write significant increase in cell area. The proposed 5T cell solves this Word Line (WWL). To further improve cell functionality, the hazard by simply removing the destructive positive feedback pull-up PMOS devices are implemented with high-threshold during a read operation. Lacking a pull-down network at node (HVT) devices. The following sub-sections will describe the QB, a rise in the level of Q cannot cause a cell flip. Therefore, cell’s hold states and its read and write access operations. read operations are implemented single-endedly on the Q node, thus providing a significant improvement in read stability. A. Hold States The read access of the 5T cell is initiated by precharging the For the trivial state of holding a logical “0”, the operation BL signal, while holding BLb discharged (its standby state). of the proposed cell is similar to a standard cross-coupled latch. Subsequently, RWL is asserted, resulting in a single-ended By discharging node Q, M5 is turned on (VSG5 = VDD), allow- readout of node Q.IfQ is high (the hold “1” state), there is no ing node QB to be fully charged to VDD. Accordingly, M1 is voltage drop over M2 and all voltage levels remain unchanged. turned on (VGS1 = VDD), ensuring that Q remains discharged. If Q is low (the hold “0” state), charge sharing is initiated The lack of a pull-down device under QB results in a very between BL and Q, discharging BL and resulting in a “0” robust hold “0” state; however, it severely impedes the opposite readout. As with a 6T readout, the voltage level at Q rises, (hold “1”) state. In this state, Q is charged to VDD and QB is lowering the overdrive voltage of M5, potentially cutting off the discharged to ground. M3 is turned on (VSG3 = VDD), holding pull-up of QB. However, QB is left at a high state, as there is no Q high, but nothing would seem to be holding QB low. By active pull-down network to discharge it (the leakage pull-down maintaining a stronger leakage current from QB to ground than to BLb takes much longer than the read access time). Therefore, from VDD to QB, a stable state is ensured. This is achieved once the read access is completed and RWL is lowered, M1 through three mechanisms: implementing M4 with a double- (with VGS = QB ≈ VDD) will quickly discharge Q back to width LVT device; implementing M5 with an HVT device; and its original state. In fact, the feedthrough to Q on the falling biasing BLb at ground at all times other than when writing a edge of RWL actually improves this effect. As a result, the read “1”. Fig. 2 shows the steady-state voltage of node QB at the stability of the 5T cell is much higher than that of its 6T coun- worst-case Slow NMOS-Fast PMOS (SF) corner under various terpart, providing robust ST readability, as shown in Section III. supply voltages [Fig. 2(a)], as well as the statistical distribution of the steady state under a 400-mV supply [Fig. 2(b)]. In all C. Write Operation cases, the node voltage is kept low, providing the necessary The proposed 5T cell employs a typical differential write stability, as will be further shown in Section III. operation, during which the bit lines are driven to opposite lev- els and, subsequently, the WLs (both WWL and RWL) are as- B. Read Operation serted. A similar differential write scheme is applied to standard Under a standard differential read-scheme, such as that com- 6T cells, in an attempt to break the positive feedback of the cir- monly used with a 6T bit cell, both bit lines are precharged and cuit and bring it into a mono-stable state that represents the data the access transistors are turned on. This invasive state causes to be stored. To ensure readability when the access transistors a rise in the level of the low internal data node (i.e., the node are enabled, the circuit is sized to retain bi-stability when the bit holding a “0”). If the voltage rise crosses the trip voltage of lines are precharged. This requirement essentially disables the the cross-coupled inverter structure, the positive feedback loop pull-up path during a write, leaving the majority of the write op- is triggered, resulting in a cell flip. Under mismatch variations eration to the pull-down side. Under global variations combined TEMAN et al.: A 40-nm SUB-THRESHOLD 5T SRAM BIT CELL 875 with device mismatch, this can lead to write failures, as the pull- up PMOS is strengthened, as compared to the pull-down access transistor. The increased variation at reduced voltages limits the write-ability of a 6T cell to approximately 700 mV [1]. The single-ended read operation of the proposed 5T cell es- sentially removes the read-sizing constraint of the right access transistor (M4). In fact, the 5T cell enhances the efficiency of the pull-up operation through M4, as node QB has no pull- down network to contend with. Therefore, by charging BLb and asserting WWL, QB is easily pulled up past the threshold voltage of M1, enabling the pull-down network of node Q.This write “0” operation can be achieved single-endedly; however, by discharging BL and asserting RWL, a faster and more robust write operation is achieved. Writing a “1” is very similar to a 6T write operation. BL is charged, BLb is discharged, and both WLs are asserted. To Fig. 3. Hold SNM distribution of the proposed 5T bit cell, as compared to a standard 6T bit cell. successfully flip the cell state, QB must be discharged past the switching threshold of the left inverter (made up of M1 and the cell. This metric is overly pessimistic, as it models an M3), while Q must be charged high enough to cut off M5. From infinite nonphysical noise source. However, measuring hold a topological standpoint, this operation is less robust than that of SNM across a large distribution provides a good basis for yield a 6T cell, as the lack of positive feedback necessitates a higher estimation. Generally, a 6σ non-negative hold SNM is required voltage rise on Q before the write operation is successful. for consideration of a high-density SRAM bit cell. However, as described above, the right side of the cell is im- Whereas a standard cross-coupled inverter structure presents plemented to ensure a positive pull-down leakage ratio during an almost symmetric SNM calculation, when discussing asym- the hold “1” state, by using an LVT implant and double width metric bit cells, such as the proposed 5T, it is important to for M4, and an HVT implant on M3. Therefore, the pull-down address each state separately. For the hold “0” state, the lack path to BLb is much stronger than the pull-up path through of a pull-down device provides an increased margin, as there M5, ensuring a successful discharge of QB, even under extreme is no positive feedback to flip the cell. Alternatively, the hold variations. In fact, the write “1” operation can also be achieved “1” state is only maintained by leakage current ratios and a single-endedly, albeit at the expense of an increased access time rise in the low voltage can trigger destructive positive feedback. and slightly degraded write margins. Indeed, implementation Therefore, the SNM of this state is inherently lower and it of a single-ended write access scheme requires no structural is more sensitive to process variations. However, a choice of changes in the cell. Therefore, the proposed 5T bit cell can sizing and threshold implants can provide the required stability, be used as a very dense two-port SRAM, simply by adjusting as illustrated in Fig. 3. This figure plots the Monte Carlo (MC) the access control. However, with an emphasis on low-voltage distribution of both hold states of the proposed cell, as well as operation, we chose to present differential write access, as the 6T SNM distribution under a 400-mV supply. Clearly, the shown in Section III. It is important to note that the 5T bit hold “0” state provides improved robustness, while the stability cell does not natively support column multiplexing, due to its of the hold “1” state is impeded; however, the required 6σ single-ended write-ability. However, by applying a write-back positive margin is achieved, nonetheless. Note that throughout control scheme, this can be implemented, at the expense of a this paper, a 6T cell comprising standard threshold (SVT) performance and power penalty. transistors and minimal lengths was used for comparison.

III. 5T CELL STABILITY B. Read Stability One of the primary concerns in nano-scale SRAM design is Measurement of the read margin is frequently carried out in stability. This has been a major focus of recent research, as tra- a similar fashion to hold with the bit lines and WL tied to VDD. ditional static noise margin (SNM) criteria have been found to This metric is also overly pessimistic, as it assumes an infinitely be over-aggressive (for hold and read operations) and optimistic long read access pulse and neglects bit line discharge. When for write operations [8]. When considering a non standard applying the Read SNM (RSNM) metric to the proposed 5T bit bit cell, such as the proposed 5T circuit, the existing metrics cell, the result is actually an improved margin, as M2 assists have to be carefully analyzed, as they may not be suitable for in holding Q high. RSNM measurement for the hold “0” state its characteristics. The following sub-sections will present the also shows robustness; however, this is limited to just above 4σ issues in stability analysis of the proposed cell, as well as a under a 400-mV supply (whereas at this voltage, the 6T fails at comparison to the 6T bit cell, according to various metrics. under 3σ). However, a closer look shows that these infrequent failures are actually due to the nature of the metric and would only occur if, in fact, RWL would be asserted for an infinite du- A. Hold Stability ration. Under extreme mismatch, the voltage at Q could rise to a The SNM metric, first defined by Seevinck et al. [9], is the level high enough to weaken M5, such that the leakage through most commonly used measure of SRAM stability. This method M4 would pull down QB and flip the bit. This ratioed contest measures the largest serial dc voltage that can be oppositely between leakage currents would only culminate after a very applied to a bit cell’s internal data nodes without flipping long duration; much longer than a low-frequency read pulse. 876 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 12, DECEMBER 2012

Fig. 5. Dual BL Sweep write margin comparison of the 5T cell with the 6T cell across process corners.

Fig. 4. Cell state distribution during a read operation. The figure shows the voltages of nodes Q and QB following a read operation applied upon a 5T cell holding a “1” and a “0”, as well as a 6T cell holding a “0”.

Instead of presenting the RSNM distribution, Fig. 4 shows the distribution of the cell state following a 100-μs read pulse, assuming a fully charged bit line for 5000 MC samples with a 400-mV supply. This is still a very extreme situation, as the frequency is very low, and the disturbance would severely Fig. 6. Proposed bit cell layout. deteriorate, as the bit line discharges. However, as expected, the internal state of the 5T cell is hardly disrupted. On the other hand, for the 6T cell, which was read in a hold “0” state, the figure shows several instances of cell flips (the gray circles at the bottom right part of the graph).

C. Write Stability Static write margin (WSNM), as defined by Seevinck et al. [9], is the minimum serial voltage necessary to drive the bit cell into a mono-stable state during a write operation. This metric is overly optimistic, as it assumes an infinitely long write pulse. In addition, various problems in the actual measurement of WSNM, as well as a dispute as to its accuracy, have led to several alternative metrics [10], [11]. To measure the write stability of the proposed 5T cell, we applied a modified version Fig. 7. Write access time comparison of the 5T and 6T cells across process of the BL Sweep method [12], applying a depleted write voltage corners. to both bit lines, to address to the asymmetric nature of the cell. IV. IMPLEMENTATION AND PERFORMANCE This method provides a write margin comparison to the 6T cell, as presented in Fig. 5. The 5T cell shows a significant advantage The proposed 5T bit cell was designed and simulated in a at all but the Fast NMOS-Slow PMOS (FS) corner, including the bulk 40-nm LP CMOS process for preliminary proof of con- SF corner, at which the 6T cell fails to write (negative margin). cept, prior to silicon measurements. Simulations were carried The Dual BL Sweep method of Fig. 5 has two drawbacks. out with Cadence Spectre, based on full device models that First, it is a static measurement, assuming an overly pessimistic have previously been validated at the relevant voltages [4], [14]. infinitely long write pulse. Second, to measure the write margin, Device sizes are shown in Fig. 1. Circuit layout was carried out it is necessary to know the trip point of the cell [13], which according to standard design rules. The resulting layout, shown changes at each statistical corner. To address these issues, we in Fig. 6, fit into the same 0.572-μm2 footprint as a reference applied 5000 MC transient writes with a long enough write 6T cell that was laid out according to the same design rules. pulse (100 μs), and measured the final state. As expected, The write access time of the implemented bit cell was all runs were successful, due to the enhanced write-ability of compared to a reference 6T cell at major process corners, as the proposed topology. For the 6T cell, 2.74% of the write presented in Fig. 7. As expected, the write times of the 5T operations failed, again showing the superior write-stability of cell are better than the 6T at all corners, including the SF the 5T topology over its 6T counterpart. corner, at which the 6T cell is non writeable under a 400-mV TEMAN et al.: A 40-nm SUB-THRESHOLD 5T SRAM BIT CELL 877

V. C ONCLUSION In this paper, we presented a novel 5T bit cell with low- voltage, ST functionality. The asymmetric operation of the 5T bit cell was demonstrated, providing a low-leakage state with an additional 5X improvement over a standard implementation at the same operating voltage. The 5T bit cell was implemented in a 40-nm LP bulk CMOS process and shown to be fully functional at ST voltages, as low as 400 mV under global and local variations. The layout of the cell was achieved without any area overhead, as compared to the industry standard 6T bit cell. The circuit was also shown to operate as a two-port SRAM bit cell without any modifications to the basic topology. Future work includes fabrication of a test chip including a functional array composed of the proposed bit cell.

Fig. 8. Leakage current distribution of the 5T cell’s states, as compared to a REFERENCES standard 6T cell. [1] B. H. Calhoun and A. P. Chandrakasan, “A 256-kb 65-nm sub-threshold TABLE I SRAM design for ultra-low-voltage operation,” IEEE J. Solid-State Cir- cuits, vol. 42, no. 3, pp. 680–688, Mar. 2007. FIGURES OF MERIT [2] N. Verma and A. P. Chandrakasan, “A 256 kb 65 nm 8T subthreshold SRAM employing sense-amplifier redundancy,” IEEE J. Solid-State Cir- cuits, vol. 43, no. 1, pp. 141–149, Jan. 2008. [3] T. H. Kim, J. Liu, and C. H. Kim, “An 8T subthreshold SRAM cell uti- lizing reverse short channel effect for write margin and read performance improvement,” in Proc. IEEE CICC, 2007, pp. 241–244. [4] A. Teman, L. Pergament, O. Cohen, and A. Fish, “A 250 mV 8 kb 40 nm ultra-low power 9T Supply Feedback SRAM (SF-SRAM),” IEEE J. Solid- State Circuits, vol. 46, no. 11, pp. 2713–2726, Nov. 2011. [5] J. P. Kulkarni and K. Roy, “Ultralow-voltage process-variation-tolerant schmitt-trigger-based SRAM design,” IEEE Trans. Very Large Scale In- tegr. (VLSI) Syst., vol. 20, no. 2, pp. 319–332, Feb. 2012. supply. In addition, write access was measured for a single- [6] H. Pilo, C. Barwin, G. Braceras, C. Browning, S. Lamphier, and F. Towler, “An SRAM design in 65-nm technology node featuring read and write- ended operation, albeit at a significantly increased delay. This assist circuits to expand operating voltage,” IEEE J. Solid-State Circuits, operation, however, could be improved by widening the right vol. 42, no. 4, pp. 813–819, Apr. 2007. access transistor (M4). [7] C. Chuang, J. J. Kim, and K. Kim, “Back-gate controlled asymmetrical and memory using the cell,” Patent 7313012, Dec. 25, 2007. A final important metric for a high-density, low-voltage [8] S. Toh, G. Zheng, T.-J. K. Liu, and B. Nikolic, “Characterization of SRAM bit cell is its static power dissipation. For the hold “0” dynamic SRAM stability in 45 nm CMOS,” IEEE J. Solid-State Circuits, state, M2 and M4 are the main ST leakage sources, as the vol. 46, no. 11, pp. 2702–2712, Nov. 2011. bit lines are biased at the opposite level to that stored in the [9] E. Seevinck, F. J. List, and J. Lohstroh, “Static-noise margin analysis of MOS SRAM cells,” IEEE J. Solid-State Circuits, vol. SSC-22, no. 5, adjacent data nodes. On the other hand, in the hold “1” state, pp. 748–754, Oct. 1987. the voltage drop over both access transistors is close to zero, [10] J. Wang, S. Nalam, and B. H. Calhoun, “Analyzing static and dynamic significantly reducing the leakage power. Fig. 8 plots the MC write margin for nanometer SRAMs,” in Proc. IEEE ISLPED, 2008, pp. 129–134. statistical distribution of the 5T cell’s leakage current in both [11] H. Makino, S. Nakata, H. Suzuki, S. Mutoh, M. Miyama, T. Yoshimura, stable states. The leakage power of the hold “1” state is on S. Iwade, and Y. Matsuda, “Reexamination of SRAM cell write margin average 5X lower than that of the reference 6T cell at 400 mV, definitions in view of predicting the distribution,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 4, pp. 230–234, Apr. 2011. providing an ultra-low-power stable state that can be exploited [12] K. Zhang, U. Bhattacharya, C. Zhanping, F. Hamzaoglu, D. Murray, on the architectural level. Dynamic power can also be signifi- N. Vallepalli, Y. Wang, B. Zheng, and M. Bohr, “A 3-GHz 70-mb SRAM cantly reduced, as compared to standard implementations, not in 65-nm CMOS technology with integrated column-based dynamic only due to the reduced supply voltage, but also due to the power supply,” IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 146–151, Jan. 2006. asymmetric single-ended readout. However, dynamic power is [13] Y. Zhang, P. Li, and G. M. Huang, “Separatrices in high-dimensional state highly dependent on complete array architecture, as will be space: System-theoretical tangent computation and application to SRAM presented in a future work. dynamic stability analysis,” in Proc. ACM/IEEE DAC, 2010, pp. 567–572. [14] A. Teman, L. Pergament, O. Cohen, and A. Fish, “A minimum leakage The proposed cell’s primary figures of merit are summarized quasi-static RAM bitcell,” J. Low Power Electron. Appl., vol. 1, no. 1, in Table I. pp. 204–218, May 2011.

Chapter 4 Low Power Gain Cell Embedded DRAMS

4.1 Introduction

Many ultra-low power (ULP) systems, such as biomedical sensor nodes and implants, are expected to run on a single cubic-millimeter battery charge for days or even for years, and therefore are required to operate with extremely low power budgets. Aggressive supply voltage scaling, leading to near-VT or even to sub-VT circuit operation, is widely used in this context to lower both active energy dissipation and leakage power consumption; albeit, at the price of severely degraded on/off current ratios (Ion/Ioff) and increased sensitivity to process variations [97]. The majority of these biomedical systems require a considerable amount of embedded memory for data and instruction storage, often amounting to a dominant share of the overall silicon area and power. Typical storage capacity requirements range from several kb for low-complexity systems [98] to several tens of kb for more sophisticated systems [99].

Over the last decade, robust, low-leakage, low-power sub-VT memories have been heavily researched [33, 34, 46]. In order to guarantee reliable operation in the sub-VT domain, many new SRAM bitcells consisting of 8 [87, 100], 9 [46, 49, 101], 10 [32, 102], and up to 14 [98] transistors have been proposed. These bitcells utilize the additional devices to solve the predominant problems of write contention and bit-flips during read, and, in addition, some of the designs reduce leakage by using transistor stacks. All these state-of-the-art sub-VT memories are based on static bitcells, while the advantages and drawbacks of dynamic bitcells for operation in the sub-VT regime have not yet been studied. 40 Conventional 1-transistor-1-capacitor (1T-1C) embedded DRAM (eDRAM) is incompatible with standard digital CMOS technologies due to the need for high-density stacked or trench capacitors. Therefore, it cannot easily be integrated into a ULP system-on- chip (SoC) at low cost. Moreover, low-voltage operation is inhibited by the offset voltage of the required sense amplifier, unless special offset cancellation techniques are used [103]. Gain-cells are a promising alternative to SRAM and to conventional 1T-1C eDRAM, as they

94 are both smaller than any SRAM bitcell, as well as fully logic-compatible. Much of the previous work on gain-cell eDRAMs focuses on high-speed operation, in order to use gain- cells as a dense alternative to SRAM in on-chip processor caches [104, 105], while only a few publications deal with the design of low-power near-VT gain-cell arrays [106-108].

4.2 Review of Recent Gain Cell eDRAM Implementations

Gain cells (GC) are dynamic memory bitcells comprised of 2–3 standard logic transistors and optionally an additional MOSCAP or diode. The additional devices (as compared totheir 1T counterparts) are used to both increase the in-cell storage capacitance, as well as amplify the readout charge flow as compared to the stored charge level, thus providing the name “gain” cells [109]. The reduced device count results in a much higher bitcell density, as compared to a standard SRAM, while the decoupled read port provides both a non- destructive read operation and two-ported functionality. Neither read nor write operations suffer from the ratioed contention between devices in a 6T SRAM, resulting in increased margins and enabling voltage scaling [57, 108]. Finally, leakage power is highly reduced, as fewer devices suffer from DIBL, and scaled supply voltages reduce other leakage components. Despite these favorable features, gain cells suffer from a number of drawbacks. The primary concern is the small internal storage capacitor that results in short retention times, requiring power-hungry refresh operations. In addition, the depleted storage voltages following a long retention period result in poor read performance. These characteristics are highly dependent on process-voltage-temperature (PVT) variations, thereby requiring careful margin distribution, cell tracking, and reference voltage control [105]. Appendix B includes a paper overviewing the various gain cell implementation options and analyzing the resulting trade-offs. Methods for contending with the drawbacks and improving the performance of the circuits are presented, and as a result, the compatibility of the existing designs to various target applications is discussed, according to the energy-efficiency aspects of these implementations.

95

4.3 Minimum Voltage Gain Cell Operation

The possibility of operating gain-cell arrays in the sub-VT regime for high-density, low- leakage, and voltage-compatible data storage in ULP sub-VT systems has not been exploited yet. One of the main objections to sub-VT gain-cells are the degraded Ion/Ioff current ratios, leading to rather short data retention times compared to the achievable data access times. However, the studies carried out within the framework of this dissertation show that these current ratios are still high enough in the sub-VT regime to achieve short access and refresh cycles and high memory availability, at least down to 0.18 µm CMOS nodes. While gain- cells are considerably smaller than robust sub-VT SRAM bitcells, they also exhibit lower leakage currents, especially in mature CMOS nodes where sub-VT conduction is the dominant leakage mechanism. Recent studies for above-VT, high-speed caches show that gain-cell arrays can even have lower retention power (leakage power plus refresh power) than SRAM (leakage power only) [109]. This study is summarized in Appendix C in a paper was recently published in the up and coming, open-access, Journal of Low-Power Electronics and Applications, published by MDPI. This article was published as part of the special issue "Selected Papers from SubVt 2011 Conference" (http://www.mdpi.com/journal/jlpea/special_issues/subvt_2011). This is an extended version of the paper presented at the IEEE Subthreshold Microelectronics Conference in Needham, MA, USA [57].

4.4 Extending the Retention Time of Gain Cell Arrays

The all-PMOS 2T GC circuit (Figure 12) is comprised of a write transistor (MW), a read transistor (MR), and a storage capacitor (CSN), which is made up of the parasitic capacitances. Data is written to the cell by applying an underdrive voltage (VNWL) to the write wordline (WWL) that transfers the biasing level of WBL to the storage node (SN). This level can be read out by pre-discharging the read bitline (RBL) and subsequently raising the read wordline (RWL), which conditionally charges the RBL if the voltage level stored on the SN is low. The circuit’s leakage power that is shown to be dominated by a subthreshold conduction at sub-micron process technologies [25] is extremely low, since, during standby

96 and write, the drain-to-source voltage of MR is zero, and the subthreshold leakage through MW is limited to (dis)charging the storage capacity of SN. The obvious issue is that any leakage to or from the SN results in a degradation of the stored data level, requiring periodic refresh cycles. Therefore, the standby, or retention power of a GC-eDRAM is given by (10):

Pretention P leakage  P refresh  V DD I leak  E refresh t ret ( 10 ) where Ileak is the standby leakage current, tret is the retention time, and Erefresh is the energy required to refresh the entire array. The immediate conclusion from (10) is that it is essential to maximize the retention time for a low power operation. Various metrics have been used for simulating the retention time of a bitcell [21, 57, 108], but the unequivocal definition of this important parameter is the time at which the voltage written to CSN degrades to the point where it results in an incorrect readout. This time is set by four primary factors: the initial level stored on CSN following a write, the size of CSN, the leakage currents to and from SN, and the readout mechanism. All of these factors are significantly affected by both the environmental and manufacturing variations, as demonstrated in measurements by [108]. This results in a large spread of the retention time distribution [110], and necessitates the design for the worst cell as with any memory array. However, in addition to the effects of the PVT variations, SN leakage currents are highly sensitive to the biasing level of the WBL. For a stored ‘1’, the highest discharge leakage occurs when the WBL is low, while the worst case for a stored ‘0’ occurs when the WBL is high. As shown in bitcell [21, 57, 108], the worst- case biasing of a stored ‘0’ exhibits an orders-of-magnitude lower retention time as compared to that of a ‘1’ for an all-PMOS cell. Consequently, retention time is calculated, assuming that the WBL is constantly held low. However, this situation would only occur if a write ‘0’ operation was executed on a given column during every clock cycle, leading to early, power consuming refresh operations in any typical scenario. In the framework of this research, in collaboration with the TCL group at EPFL, I fabricated an 0.18μm test chip, a.k.a. "GREENBELT", including two 2 kb GC-eDRAM arrays and several other test circuits. The test arrays contain two novel methods for retention time extension:

97

Figure 12: Schematic of the all-PMOS 2T gain cell with I/O write transistor (MW), including biasing levels for access operations. 1) The application of bulk biasing to increase the threshold voltage of the write transistor,

thereby reducing the data degradation through sub-VT leakage. 2) The implementation of a novel replica cell concept to track the data level degradation in real time and to dynamically apply refresh cycles according to global process variations and array access statistics. The fabricated test chip was received in January 2013, and in-depth testing and measurements have been applied to both arrays and the additional test circuits. A first manuscript has been accepted for publication in the IET Journal of Engineering (JoE) and an additional manuscript is currently under review.

98

Chapter 5 Low-Power Low-Cost NVM for RFID Tag

5.1 A Low-Power DCVSL-Like GIDL-Free Voltage Driver for Low- Cost RFID Nonvolatile Memory

The following paper, published in the industry leading IEEE Journal of Solid State Circuits, describes a novel circuit technique for voltage multiplexing, designed, fabricated, and measured as part of the Low-Cost, Passive RFID demonstrator research project, carried out with support of the Alpha Consortium. This paper is the first journal publication to come out of this project, following two conference papers, presented at IEEE ISCAS 2012 in Seoul, Korea, that included a pre-fabrication brief description of the GIDL-Free voltage drivers [47] and the NVM memory architecture [59]. This paper presents one of the main implementation challenges in the operation of the TowerJazz C-Flash memory cell and the various techniques used to overcome them. The paper also includes an in depth theoretical analysis of design considerations for standard level-shifters that was developed in the framework of this project.

99 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013 1497 A Low-Power DCVSL-Like GIDL-Free Voltage Driver for Low-Cost RFID Nonvolatile Memory Hadar Dagan, Adam Teman, Student Member, IEEE, Evgeny Pikhay, Vladislav Dayan, Anatoli Mordakhay, Yakov Roizin, and Alexander Fish, Member, IEEE

Abstract—The realization of a low-cost passive radio frequency general, NVM arrays are fabricated as stand-alone blocks in identification (RFID) tag requires the ability to fabricate the dedicated processes, requiring multiple nonstandard masks and system in a bulk CMOS process without any additional process process steps that substantially increase the manufacturing cost steps. A recently presented single-poly C-Flash memory bit- cell provides an ultralow-power option for implementation of [4]. In addition, these memories usually require high voltages a nonvolatile memory array for use in an RFID system, using ( 10 V) to initiate the tunneling currents necessary for pro- only core masks. This cell requires the application of a 10-V gramming and erasing the memory. Delivering high voltages potential difference between the cell’s control lines for program to the memory cells often requires special devices to eliminate and erase operations. Providing the required voltages, while high leakage currents, such as those caused by gate-induced usingonlystandarddevicesresults in several design challenges for the voltage drivers, such as the elimination of gate-induced drain leakage (GIDL) [5], as well as reliability problems. Man- drain leakage (GIDL) currents. In this paper, we present a pair ufacturing these devices further increases the chip fabrication of voltage driver architectures that utilize novel techniques to costs. In order to manufacture a minimum-cost RFID tag, it is overcome these challenges. In addition, for the first time, we essential to integrate an embedded NVM array, fabricated ex- present an in-depth analysis of the dynamic behavior of standard clusively with core CMOS masks [6]. level shifters. This analysis is applied to our proposed GIDL-free level shifters to provide a sizing methodology for optimization of Recently, TowerJazz presented an ultralow-power the area, energy-per-operation, and delay of these circuits. The single-poly C-Flash bitcell that complies with the afore- drivers were designed and fabricated in a TowerJazz 0.18- m mentioned requirements [7]. By applying opposite-polarity bulk CMOS technology, providing the required functionality 5-V signals to isolated P-wells (IPWs), the 10-V potential with a low static-power figure of 47–49 pW and 0.03–0.36 pJ difference necessary for Fowler–Nordheim (F-N) injection is energy-per-operation. achieved. In addition, this cell provides a fully digital readout Index Terms—C-flash, differential cascode voltage switch logic through an integrated CMOS inverter, thus eliminating the (DCVSL), grid-induced drain leakage (GIDL), level shifter, low need for power consuming analog readout circuitry. The cost, low power, nonvolatile memory (NVM), optimization, phase portrait, radio frequency identification (RFID), voltage driver. C-Flash bitcell is fabricated using a standard 0.18- mCMOS process and is therefore a perfect candidate for integration in a low-cost, low-power, passive RFID tag. However, the cell I. INTRODUCTION operation requires a comprehensive control scheme, using several voltages (from 5 V to 5 V) that are applied upon a pair of shared buses. Standard analog voltage multiplexing HE key factors in widespread adoption of radio frequency implementations require large, power-hungry circuits, such as identification (RFID) tags remain cost minimization and T digital-to-analog converters (DACs), operational amplifiers, low-power operation [1]–[3]. The incorporation of read-write and switched capacitors [8] that are infeasible for integration memories into RFID tags provides the opportunity to realize in a row-wise manner in these low-power, low-cost devices. many advanced applications [2]; however, integration of an em- Therefore, the required voltage multiplexing is carried out by a bedded nonvolatile memory (NVM) array into the integrated pair of drivers that are solely comprised of standard devices. circuit (IC) is one of the major obstacles to cost reduction. In A. Contribution Manuscript received December 13, 2012; revised February 17, 2013; accepted March 08, 2013. Date of publication April 02, 2013; date of current In this paper, we present the circuit implementation of novel version May 22, 2013. This paper was approved by Associate Editor Hideto low-power voltage drivers for delivering the required voltages Hidaka. This work was supported by the Alpha Consortium of the office of the Chief Scientist of Israel. for programming, erasing, and reading from a C-Flash-based H. Dagan and A. Teman are with the Low Power Circuits and Systems Lab NVM array. These drivers are implemented with standard de- (LPC&S), VLSI Systems Center, Ben-Gurion University of the Negev, Be’er vices, thus enabling low-cost integration of an NVM array into Sheva 84105, Israel (e-mail: [email protected]). E. Pikhay, V. Dayan, and Y. Roizin are with Tower Semiconductor Ltd., a passive RFID chip. In order to overcome the inherent chal- Migdal Haemek 23105, Israel. lenges in designing these drivers, a number of circuit techniques A. Mordakhay is with the Faculty of Engineering, Bar-Ilan University, Ramat are proposed, and a novel sizing methodology was developed. Gan52900,Israel. A. Fish is with the LPC&S of the VLSI Systems Center, Ben-Gurion Uni- This methodology is based on an in-depth, dynamic analysis of versity of the Negev, Be’er Sheva 84105, Israel, and also with the Faculty of standard level shifters. This analysis is presented here for the Engineering, Bar-Ilan University, Ramat Gan 52900, Israel. first time and is shown to be applicable to other level-shifter Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. topologies, such as the GIDL-free drivers that we propose here. Digital Object Identifier 10.1109/JSSC.2013.2252524 Finally, the drivers were fabricated and tested, showing full

0018-9200/$31.00 © 2013 IEEE 1498 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 1. C-flash memory cell design and schematic.

TABLE I Any subsequent read biasing charges the BL to represent a logic DRIVER OPERATING MODES. “1.” The internal state of the cell is stored upon a floating gate (FG) that is formed between two capacitors: the tunneling gate (TG) capacitor and the control gate (CG) capacitor. These capacitors are formed by a poly-oxide–isolated p-well structure, as illus- trated in Fig. 1(a). Changing the state of the cell (i.e., the of the inverter) is achieved by tunneling charge onto or off-of the FG via the F-N tunneling mechanism. Contrary to stan- dard NVM implementations that require special high-voltage- functionality according to the NVM requirements, while pro- tolerant devices for tunneling, the C-Flash cell uses standard viding a very low power figure. I/O devices by applying opposite-polarity 5-V biases to the ca- The remainder of this paper is constructed as follows. The pacitors’ IPW terminals (CG and TG in Fig. 1). Due to the fact NVM array architecture and resulting driver requirements, that the CG capacitor is approximately 10 larger than the TG including the inherent design challenges, are presented in capacitor, this biasing scheme causes an approximately 9.5 V Section II. Section III presents the memory architecture and the potential to fall on the TG capacitor. A positive-polarity tunnels detailed driver circuit designs. An in-depth dynamic analysis charge onto the FG, resulting in a program operation, while a of level shifters and the resulting sizing methodology are negative polarity removes the charge for an erase operation. For presented in Section IV. Section V presents the measurement any nonselected cell in write mode, the maximum voltage dif- results, and Section VI concludes the paper. ference that falls on the TG capacitor is 6.5 V, which is not suf- ficient for initiating F-N tunneling. For additional details about II. MEMORY ARCHITECTURE AND DRIVER REQUIREMENTS the cell operation, the reader is encouraged to turn to [7].

A. C-Flash Memory Cell Overview B. Memory Array Architecture and Operation Modes The previously presented C-Flash memory cell [7], is a low- The low-cost fabrication requirements of the C-Flash cell cost NVM bit cell, compatible with single-poly, standard logic make it a viable candidate for implementation of the NVM twin-well CMOS processes, as illustrated in Fig. 1. Similar to memory core of a passive RFID tag. A passive RFID system other NVM bit cells, the C-Flash cell provides a programmed architecture, including a 256-b C-Flash-based NVM array is and an erased state that are toggled through peripheral biasing, illustrated in Fig. 2(a). In addition to the NVM array, the system as summarized in Table I. Data are read out of the cell through comprises an energy harvesting and voltage rectifying unit to a standard CMOS inverter (N1 and P1) that is gated by a pass supply the operating power; an ASK modem and digital control gate (N2 and P2). When the wordlines (WL and )areas- unit for protocol realization; an oscillator for synchronization; serted, the data stored in the cell are driven onto the bit line and switch cap based DC/DC Converters to supply the NVM (BL), for a static digital readout. Differentiating between the operating voltages [9]. Operation of the C-Flash cell requires “0” and “1” states of the cell is achieved by essentially modi- a complex biasing scheme, requiring the propagation of var- fying the switching threshold ( ) of the readout inverter. In its ious voltages to each cell according to the operating mode (see initial, erased state is lower than the digital supply voltage Table I). This is achieved through the detailed array architecture ( ), such that, upon application of read biases (Table I), BL is illustrated in Fig. 2(b). This architecture employs a low-power discharged.1 Programming the cell causes the inverter’s voltage row decoder for row selection and two designated drivers for transfer characteristic (VTC) to shift, raising above . driving the biasing voltages. The tunneling gate driver drives the horizontally routed TG signals independently to each row, 1The digital supply voltage ( ) for this technology is 1.8 V; however, de- while the control gate driver drives the vertically routed CG pending on the proximity from the RFID reader, this voltage can drop as low as 1.2 V. The drivers are designed to function at this entire range of voltages, as signals independently to each column. Data are fed in serially we will present later. to a serial-input parallel-output (SIPO) write register, thus DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1499

Fig. 2. Architectural block diagrams of (a) a passive RFID system and (b) a C-Flash-based NVM macro. enabling random access for programming, by independently inverse-proportional to the distance from the RFID reader, as applying the appropriate CG signals to the selected row. In described in Friis transmission equation [10] this architecture, an entire row is read out simultaneously and the bit-line data is sampled by a parallel-input serial-output (PISO) read register. The TG driver and CG driver transfer the appropriate voltages to the output according to the input data, (1) address, and mode. A detailed description of the operation of the TG driver and CG driver is given in Section III. where and are the power received by the RFID tag and the power transmitted by the reader, respectively, C. Design Challenges and are the gain of the RFID tag antenna and the reader The TG and CG drivers can be considered analog multi- antenna, respectively, is the wavelength of the transmitted plexers, as they are required to transfer various voltages to wave, and is the distance between the reader and the RFID bias lines according to a number of digital input signals, as tag. Note that the available energy of the tag’s source is defined in Table I. These voltages include 5-V and 5-V biases less than due to the energy losses in the energy harvesting that are considerably higher than the standard supply voltage unit (ac–dc converter). (1.8 V). Multiplexing such voltages in a bulk CMOS process The quadratic inverse-proportional nature of (1) requires without mask adders faces several design challenges. The target short distances between the tag and the reader, in order to TowerJazz 0.18- m process includes two main categories of build up the high 5-V biases. This requirement is sufficient MOSFET devices: standard logic (1.8 V) and I/O (3.3 V) for infrequent program and erase operations; however, larger transistors. The I/O devices meet the high-voltage requirements distance operability is required for read operations. Therefore, of this design, as they can withstand gate voltages of up to during read operations, the on-chip 5-V and 5-V dc–dc 5 V for multiple programming durations. However, application converters are shut down, and these voltages are not generated. of a large drain-to-gate potential ( ) results in high GIDL Consequently, the outputs of the 5-V voltage sources are currents due to band-to-band tunneling (BTBT). These currents driven to a high-impedance state. In addition, the level of the can cause the drivers to consume a significant amount of power standard 1.8-V supply voltage is also affected by the distance for extended durations—which is an intolerable requirement in from the reader. Therefore, the drivers are required to operate passive RFID tag applications. Therefore, the design must not with a supply voltage ranging from 1.2 to 1.8 V, both when the include a voltage drop of 5.5 V for NMOS devices or 5-V biases are available, as well as when they are not. 5.5 V for PMOS devices during any operating mode or while switching between modes. III. DRIVER IMPLEMENTATION The second challenge originates from the passive nature of an RFID tag with an embedded NVM array. A passive RFID tag The previous section described the NVM array architecture utilizes energy harvesting from electromagnetic waves to build and operating modes, defined the basic requirements for the up the supply voltage. The power of these waves is quadratically voltage drivers, and presented the challenges that are inherent 1500 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 3. (a) Schematics of TG driver. (b) Eight rows of the TG driver (layout). to the design of these drivers. Based on the NVM architec- ture and operating modes, here, we describe the detailed imple- mentations of the TG and CG drivers that overcome the design challenges.

A. TG Driver The horizontally routed TG signal is driven by the TG driver analog multiplexer, as shown in Fig. 2(b). A close study of this driver’s output biases (Table I) shows that, for all of the de- fined operating modes, the driver multiplexes between a positive voltage (5 or 1.8 V) and a negative voltage ( 5 V), depending Fig. 4. Positive driver/pre-multiplexer. on whether or not the row is selected. This requirement led to implementation of the driver at two levels: global and local. The resulting architecture of the driver is illustrated in Fig. 3(a). The maximum voltage between the two bias inputs and passes it to global level comprises a positive driver that selects between the the node. This can be explained as follows: if system’s positive bias voltages (1.8 and 5 V) according to the 5 V, M6 is cut off, and M5 passes the high voltage to . operating mode. The selected positive voltage is globally driven When is floating, it will discharge until M5 is cut off and to all of the row peripherals as an input to the local levels of M6 passes to . The central block of the positive driver the driver. This level is made up of a row specific Logic Block (M1–M4) is a standard differential cascode voltage switch logic and End Driver pair. The Logic Block receives the WL signal (DCVSL) level shifter [11], with output nodes marked for row selection and the mode signal that defines the system and . DCVSL circuits are appropriate for low-power de- operating mode. Accordingly, it selects the positive or negative sign, as they enable level shifting with ideally zero static cur- voltage to be driven from the End Driver onto the adjacent row’s rent consumption (neglecting leakage currents, such as GIDL). TG signal. In read mode, when the 5-V bias is not available and The result of this setup is a standard level shifter with as its considered as a high-impedance line, the Positive Driver drives high voltage; therefore, the and nodes are shifted 1.8 V and the End Driver selects the positive voltage (1.8 V), versions of the ERS and signals, respectively. The output whether the cell is selected or not. The layout of eight rows of of the Positive Driver is driven through a pair of PMOS devices the TG driver is shown in Fig. 3(b). (M7 and M8) that function as an output buffer, driving or Delivering the required output during read mode is a non- with minimum output resistance. Note that in program trivial design challenge, as the high voltage (5 V) input to the mode, despite the fact that is 5 V, is driven to , block is floating. To deal with this state, the schematic shown as required. in Fig. 4 is proposed. The block receives a digital input signal Using the concept presented for the Positive Driver, one could (ERS, denoting an erase operation) and two bias voltages, , assume that a standard DCVSL level shifter would be suitable the standard 1.8-V supply voltage, and ,the5-Vbias for second level multiplexing (i.e., implementation of the End that is floated during read and standby modes. The upper block Driver). Let us assume, for the sake of simplicity, that the End of the circuit, comprising M5 and M6, essentially selects the Driver has to select between (1.8 V) and 5V.This DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1501

Fig. 5. (a) Standard DCVSL level shifter. (b) GIDL-free level shifter. situation is illustrated in Fig. 5(a), showing that the output ports drives during all modes other than the erase mode. would indeed be driven to 5Vand 1.8 V Therefore, its implementation was realized with the same (or vice versa, when the input port voltages are reversed). How- GIDL-free level-shifter circuit that was described for the TG ever, in its static state, this circuit conducts high GIDL current, driver [Fig. 5(b)]. since 5.5 V and 5.5 V. In order to The CG driver’s positive voltages, on the other hand, are solve this problem, an additional bias level was added, forming produced independently for each column. Each local column a GIDL-free level shifter, as shown in Fig. 5(b). The gates of driver comprises a Logic Block,aPositive Driver, and a P-type the additional devices (M3–M6) aretiedtogroundtocreatea pass transistor ( ). An additional global component selects the stacked buffer between the positive and negative voltages and maximum positive voltage ( )and the output ports. This forces and drives it as an input to the local column drivers. The Positive to be lower than 5.5 V. This is possible due to the fact that M3 Driver is implemented with the same concepts used in the de- cannot charge node to a potential higher than ,andM6 sign of the TG positive driver, driving either 0 V or to the cannot discharge node to a potential lower than (where gate of , according to the mode and (output from and are the threshold voltages of NMOS and PMOS the Write Register) signals. The drain terminal of is also con- devices, respectively). Therefore, this circuit does not suffer nected to , such that in program, read,andstandby modes, from the high GIDL currents present in the standard DCVSL im- this bias is driven to the output, and during erase mode, is plementation. An output buffer, similar to that shown in Fig. 5(a) cut off, presenting a high-impedance output. for the Positive Driver, is added to the End Driver for mini- The outputs of the local positive blocks and the global Neg- mizing output resistance. A detailed analysis of transistor sizing ative Driver are multiplexed through an anti-GIDL block,sim- considerations is given in Section IV. ilar to the mechanism implemented inside the GIDL-free level shifter, as shown in Fig. 6. This prevents the potentials of B. CG Driver for NMOS devices and for PMOS devices from exceeding 5.5 V, while driving the final CG signals to their respective Following the detailed overview of the challenges and solu- columns. tions in the design of the TG driver from the previous subsec- tion, it would seem that the design of the CG driver would be IV. DESIGN CONSIDERATION AND SIZING METHODOLOGY trivial. However, a closer examination of Table I, shows that the CG driver actually has a more complex biasing scheme, as The previous section introduced the architectures and circuit each column requires an independent positive voltage, thus ren- implementations of the TG and CG voltage drivers. In this sec- dering a global positive voltage inapplicable. Employing a full, tion, a more comprehensive analysis of the GIDL-free driver two-stage driver for each column would be area and power con- is presented, along with a novel optimization approach for suming; therefore, a different approach was used, as illustrated DCVSL-like circuits. The similarity between this DCVSL-like in Fig. 6. circuit and SRAM bit cells is first discussed. In this context, In this case, a global Negative Driver was employed, pro- optimization of the driver circuits using phase portrait plots is viding the CG signal’s 5-V bias during erase mode. However, presented. This analysis is the first time standard DCVSL-based in order to ensure correct operation during other modes and, level shifters have been discussed in terms of dynamic analysis, in particular, when high voltages are not available, this block providing a sizing methodology for these common circuits. 1502 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 6. Schematic of the CG driver.

Fig. 7. (a) Standard DCVSL level-shifter transient illustration. (b) Equivalent circuit for level shifters.

A. Phase Portrait as an Efficient Optimization Tool causing the discharge current of M2 to weaken. As a result, the voltage of will increase further, until reaching its final value In order to understand the design considerations of the of 1.8 V. GIDL-free level shifter, we will first discuss a standard DCVSL Clearly, transistor sizing plays an important role in proper op- level shifter. Unlike the nonratioed CMOS logic family, eration of the cell. If the transistors are not sized well, the final DCVSL gates require proper transistor sizing for robust func- values of and will converge to a meta-stable point, rather tionality, and, when this circuit style is used to implement a than the target stable point of 1.8 V; 1.8 V. level shifter, careful sizing is mandatory. This state is reached due to the feedback nature of the level Fig. 7(a) illustrates the transient behavior of a DCVSL level shifter which is demonstrated in the small-signal circuit model shifter, with standard digital (1.8 V, 0 V) input signals and of Fig. 7(b). Each transistor is modeled as a voltage-dependent shifted(1.8V, 1.8 V) output signals (nodes and ). current source, charging or discharging the parasitic node capac- In order to avoid GIDL currents, let us examine the interval itors or . The currents sourced by and depend on [ 1.8 V, 1.8 V] for standard level shifting. When the gate the voltages of both and . Metastability is reached when, voltages of M3 and M4 are flipped to 1.8 V; at an intermediate point, the sum of all currents at and is 0V,M4startscharging , but M2 also drives zero. The transient region around the metastable point will hold current (in triode mode), discharging this node due to the initial for a substantial time period, resulting in increased power dissi- condition of . If M4 drives more current than M2 does, pation and long propagation delays and thus should be avoided. the voltage of will start increasing. With the increase of Sizing methodologies for DCVSL circuits have previ- the voltage on , M1 starts conducting and discharging , ously been shown [12], [13] with the intent of optimizing DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1503

Fig. 8. (a) 6T-SRAM cell. (b) Standard DCVSL level shifter. (c) GIDL-free level shifter. the propagation delay of these cells, using an approximation As a result, the system’s stable points can be derived from its for the delay affected by the input signal slope. Unlike these phase portrait by finding the points at which the voltage gradient methodologies, the goals of the proposed sizing methodology is substantially small relative to the maximum voltage gradient are reduction of the energy consumption and chip area. The proposed methodology is based on the similarity between the transient state of a DCVSL level shifter and a write operation of a standard SRAM cell. These two circuits are illustrated in (4) Fig. 8(a) and (b), respectively. In order to demonstrate proper and improper sizing of a A close look at the level shifter of Fig. 8(b) shows that this DCVSL level shifter, a phase portrait plot can be used in a sim- structure is actually a memory circuit comprising M1 and M2. ilar fashion to that applied on a 6T SRAM cell in [18] and [19]. Realizing that toggling a DCVSL level shifter is very similar The plots were constructed by simulating a toggling bias with to a single-ended SRAM write operation allows us to adopt initial conditions in the range of 1.8 V,1.8 V . SRAM design methodologies for optimization of the DCVSL Voltage differences ( ) were measured after a sufficiently level shifter. The traditional metric of robustness for SRAM small time step ( ), during which the node voltages were cells is static noise margin (SNM), as described extensively in approximately linear. Fig. 9(a) plots the phase portrait of the literature [14], [15]. However, a better and more accurate un- a DCVSL level shifter with minimum sizes ( ; derstanding of the dynamic nature of an SRAM cell is achieved ) for all devices. The resulting state space shows through the analysis of dynamic noise margins (DNMs) with three stable points, with the bottom right stable point repre- control theory tools [16], [17]. The basic equations for this anal- senting the desired final value. It is evident from the phase ysis are derived from the small-signal model of Fig. 7(b) as and magnitude of the arrows that starting from the initial condition of 1.8 V; 1.8 V, the system would converge to one of the metastable points, resulting in (2) large static currents, increased delay, and possible erroneous digital readout. Proper sizing, on the other hand, results in a where is the lumped capacitances of the two internal nodes, single stable point, as shown in Fig. 9(b) for PMOS devices assumed to be equivalent ( ). Equation (2) sized and NMOS devices sized . suggests that knowing the current differences at each node leads Indeed, this is the expected result, as increasing the pull-up to the voltage derivatives of the two nodes. Therefore, by taking drive (as compared to the pull-down drive) assists the circuit to a linear approximation of the derivative, the voltage difference overcome the positive feedback of its initial state. of each node ( )forasufficiently small time step can be While the phase portrait is an efficient tool for cell sizing, ex- written as tracting this plot with an accurate circuit simulator is extremely time consuming. Therefore, a more suitable approach would be to use an approximate model for the transistors conductance, and to calculate the phase portrait with a numerical solver, such (3) as MATLAB. An appropriate model for this approximation is the 1504 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 9. Simulated phase portrait, for DCVSL level shifter. (a) An improperly sized cell. (b) A properly sized cell.

Fig. 10. Calculated phase portrait, for DCVSL level shifter. (a) Improperly sized cell. (b) Properly sized cell. well-known EKV model for MOS transistor current [20], since where is the transistor’s threshold voltage. It is worth men- it is a continuous, single-equation model for all the modes of tioning that, when the gate voltage is higher than the threshold operation (i.e., subthreshold, triode, and saturation). According voltage, (5) reduces into the Shockley model [21]. In the same to this model, the transistor current of an NMOS device is given way, for subthreshold operation, this expression coincides with by the expression for subthreshold current [22]. Estimation of the parameters , ,and can be established using the nonlinear least squares (NLLS) method [23]. Table II summarizes the parameters for the curve-fitting process and the calculation times of the phase portrait using the different approaches. Fig. 11 shows the curve-fitting results for (5) NMOS transistor currents using 400 total measurement points of and with 0 V. The same process was used to where is the thermal voltage, , ,and extract the parameter of a PMOS device. Once accurate current are the gate, source, and drain potentials, respectively, models were extracted, phase portraits were plotted, as shown in is the drive strength of the transistor, is the body Fig. 10 for the same sizing choices as in Fig. 9. The differences effect (slope) factor, and is described as between the simulated and the calculated portraits are mainly due to the nonideal linear approximation [see (3)] which can be (6) corrected by using smaller time steps ( ). DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1505

Fig. 11. NMOS current curve fitting.

TABLE II Fig. 12. Standard DCVSL level-shifter energy/operation and cell area as a TRANSISTOR-CURRENT CURVE-FITTING PARAMETERS. function of and .

Table II emphasizes the benefit of phase portrait calculation over full simulation, showing a speedup of approximately 250 . The minimal differences between the two methods, evi- dent from a comparison of Figs. 9 and 10 make the calculated phase portrait an efficient tool for sizing evaluation, especially for optimization of cell area and energy-per-operation. Dif- ferent choices of sizing should be evaluated until a sufficient phase portrait is achieved, i.e., displaying a single stable point at the bottom-right (top-left) corner with large-magnitude right (left)-pointing arrows on the left (right) side of the plot. A Fig. 13. Transistor chain current for 0V; 1.8 V. choice of and , for instance, produces a proper phase portrait, as seen in Figs. 9(b) and 10(b). This sizing also provides a satisfactory tradeoff between delay is around 0.9 ns, which is a reasonable delay for RFID cell area and energy-per-operation, as illustrated in Fig. 12. applications. This figure presents the area and energy-per-operation of the cell, as a function of transistor sizing. It can be deduced from B. Using Phase Portrait for GIDL-Free Driver Optimization this figure that increasing and substantially decreases The previous subsection argued that the phase portrait is an the energy-per-operation, until a point where the energy con- efficient optimization tool for a standard DCVSL level shifter in sumption starts to rise. The reason for this is that increasing terms of energy, area, and delay. However, the proposed GIDL- and below certain values reduces the wasted energy free level shifter [see Fig. 5(b)] is a more complex circuit than consumption caused by the previous state of the cell, but at the standard level shifter, as it comprises six internal nodes in- the same time increases the node capacitances. Therefore, stead of two. One reasonable approach for handling this circuit the “desirable” energy consumption, which is proportional to would be the construction of a phase portrait with more than two , increases. dimensions. However, in this approach, the calculated voltage To conclude, the sizing of the cell should be as minimal as differences ( ) cannot assume equal node capacitances ( ), possible, while still producing a proper phase portrait. Doing as in the previous analysis. so will lead to a small overall area and low energy consump- For the standard DCVSL level shifter (Fig. 8), each node de- tion, as well as sufficiently low propagation delay. In this ex- termines the operating mode of all of the transistors in the cir- ample, for and , the propagation cuit. The situation is different for the GIDL-free level shifter, 1506 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 14. GIDL-free level shifter: calculated phase portrait for (a) improper sizing and (b) proper sizing. where the internal nodes and do not directly affect the than that of PMOS transistors. Consequently, the voltage at operating modes of the opposite branch transistors (M4, M6, is very close to . Moreover, for voltages below 3V, and M8). The same holds true for nodes and ,andtran- is very low and relatively constant; hence, its current can sistors M3, M5, and M7. In other words, the feedback mecha- be approximated as nism of this circuit primarily depends on the voltages of nodes and . Additionally, taking into account the fact that phase portraits with three or more dimensions are nonintuitive (and impossible to visualize), a more acceptable approach would be (8) lumping the devices in the pull-up chains into simplified equiv- alent circuits. Using this approach, each pull-up chain (com- where represents the transconductance of a NMOS device. prising two PMOS devices and one NMOS device) is consid- Since is relatively constant when is lowered below ered as a single three-terminal device. After expressing the cur- 3V, increases linearly as decreases. As a rent of these “new” three-terminal devices, construction of the result, lowering below 3 V linearly increases the equiva- phase portrait is achieved exactly as described before. lent device current. However, below 4.5 V, slight current satu- To derive the current expressions of the equivalent devices, it ration occurs. This is due to a very high overdrive voltage , is important to realize that the currents that charge or discharge combined with a very low voltage, resulting in high ver- the node capacitances of and can be described as the dif- tical fields and considerably low horizontal fields that cause the ferences between the pull-up chain currents and the pull-down NMOS transistor current to saturate. transistor currents as One of the most important conclusions derived from Fig. 13 is that the equivalent transistor chain current is linearly dependent (7) on the PMOS transistor widths ( ) and is hardly affected by the width of the NMOS ( ), in particular for voltages below and, as before, (3) and (4) hold true. 1.3 V. Assuming 0 V provides the two seri- Fig. 13 shows the – curves of a transistor chain for sev- ally connected PMOS devices with the same gate voltages, we eral sizing cases, with 1.8 V and 0V.Thetran- can size them with equivalent widths and consider them as a sistor chain and its equivalent circuit are shown in the inset of single PMOS with . As before, a low (highly negative) Fig. 13. Interestingly but not surprisingly, the transistor chain voltage pulls down , biasing the equivalent PMOS in produces an – curve of serially connected PMOS and diode. saturation mode ( 1.8 V .Asa The “diode” behavior of 1V,0V is due to the fact result, neglecting channel length modulation, the current of the that for higher than (i.e., 0.7 V for an I/O equivalent PMOS can be expressed as transistor) the NMOS transistor can only drive a subthreshold or near-threshold current, thereby limiting the current of the equiv- (9) alent device. Furthermore, for voltages below 3V,the current curve is approximately linear, since the NMOS tran- where represents the transconductance of a PMOS device. sistor is in the triode region. The reason for this is that in order This current also equals the NMOS current given in (8), and the to attain a current equality between the NMOS and PMOS de- fact that the term is independent of suggests vices, the voltage of node has to be relatively low, since the that increasing decreases at the same rate, thereby overdrive voltage and mobility of NMOS transistors are higher keeping the current of the transistor chain constant. However, it DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1507

Fig. 15. GIDL-free level-shifter performance as a function of and . (a) Energy/operation and cell area. (b) Propagation delay. can be noticed that increasing slightly decreases the point where the NMOS starts to drive current. The above results substantially simplify the GIDL-free level shifter optimization, as only two sizing parameters need to be determined: and , exactly as in the DCVSL stan- dard level-shifter optimization. In order to minimize area, energy-per-operation, and delay, the proposed approach targets the minimum sizing that produces a proper phase portrait. Fig. 14(a) and (b) presents the GIDL-free level-shifter phase portraits for cases of improper and proper sizing, respectively. In Fig. 14(b), the PMOS widths were set to , the width of the NMOS of the three-transistor branch was kept minimal, and the length of the bottom NMOS transistors were set to . It is evident that there is only one stable point in Fig. 14(b), which is close to the point 5V . This result coincides with the presented theory, as well as the Fig. 16. CG driver layout. analysis illustrated in Fig. 5(b), where the voltages of nodes and varied from 5Vto . The total cell area, as a function of and , is described in the following V. M EASUREMENT RESULTS expression: The TG and CG drivers were implemented in a bulk CMOS (10) TowerJazz 0.18- m technology, using a standard twin-well process without any additional process steps. Fig. 16 shows the Fig. 15(a) and (b) shows that a choice of top part of the CG driver layout and a photograph of the same and provides a good tradeoff between the cell chip area. This displayed block includes the global positive area, energy-per-operation, and propagation delay. This result driver above the local driver of the top row. The layout and is well understood by observing the phase portrait in Fig. 14(b). photograph of the left part of the TG driver are displayed It is also worth mentioning the exponential decay of both en- in Fig. 17. The large metal pads (approximately 90 m 90 ergy-per-operation and cell delay as a function of the sizing. m) were used to connect the test circuits to the measurement This result stems from the fact that proper sizing substantially scheme, illustrated in Fig. 18. The cascade prober comprises decreases the time period during which the cell struggles to the microscope and a 32-channel probe card. The probe card is change its mode, while spending an abundant amount of en- located below the chip, connecting to selected metal pads on the ergy for this purpose. Moreover, sizing beyond a given value surface. The cascade prober was connected to the low-leakage leads to a negligible benefit, as shown in Fig. 15(b). Determining switching matrix that switches between delivering the output instead of leads to a delay reduction signals to the HP-415x signal analyzer, and delivering the of only 1.58 ns, at the cost of approximately doubling the total signals created in the NI-6133 generator card to the probes cell area. connected to the chip. Furthermore, the switching matrix 1508 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

Fig. 17. TG driver layout. Fig. 20. Output voltage and current dissipation of the CG driver for full range of required supply voltages.

Fig. 18. TowerJazz laboratory measurement setup.

Fig. 21. Measured CG driver wave diagrams. Fig. 19. Measured GIDL currents of a PMOS I/O transistor in the TowerJazz 0.18- m technology.

grounded. The exponential nature of the bulk to drain current as and the analyzer were controlled with the TowerJazz internal a function of emphasizes the need to limit these biases, as Chameleon program, created in the Lab-View environment. elaborated in Section III. The problematic GIDL currents due to a high potential drop Functionality of the CG driver for the full range of read volt- between gate and drain are measured in Fig. 19. This figure ages is shown in Fig. 20. The driver sufficiently output the full plots the leakage currents of a PMOS I/O transistor as a func- value of for supplies of 1.2 V to 1.8 V according to the re- tion of an applied negative drain voltage with all other terminals quirements of the energy harvesting unit during read operations DAGAN et al.: LOW-POWER DCVSL-LIKE GIDL-FREE VOLTAGE DRIVER FOR LOW-COST RFID NONVOLATILE MEMORY 1509

TABLE III and delay. This methodology can also serve as a useful tool FIGURES OF MERIT for other ratioed logic families and subthreshold logic design. The drivers were implemented and fabricated in a bulk CMOS TowerJazz 0.18- m process without any additional masks or process steps. The drivers were tested for functionality and resulting waveforms were presented. Future work includes the fabrication of the full NVM array and the complete RFID tag.

ACKNOWLEDGMENT The authors would like to thank M. O. Naor for his contribu- tions in designing and testing the presented circuits.

REFERENCES [1] S. Sarma, “Towards the five-cent tag,” MIT Auto ID Center, Tech. Rep. MIT-AUTOID-WH-006, 2001, . lacking a high (5 V) and low ( 5 V) supply. The low static cur- [2] R. Want, “An introduction to RFID technology,” IEEE Pervasive rents, ranging from 15 to 20 pA per driver, coincide with the Comput., vol. 5, no. 1, pp. 25–33, 2006. [3] N. Wu, M. Nystrom, T. Lin, and H. Yu, “Challenges to global RFID driver requirements. Indeed, the maximum energy dissipated adoption,” Technovat., vol. 26, no. 12, pp. 1317–1323, 2006. for the worst case operating mode is lower than 0.36 pJ, and [4] A. Atrash et al., “Zero-cost MTP high density NVM modules in a this only occurs when higher energy is available due to close CMOS process flow,” in Proc. IEEE IMW, 2010, pp. 1–4. [5] J. Chen, T. Chan, I. Chen, P. Ko, and C. Hu, “Subbreakdown drain proximity to the reader (during program/erase cycles). Mea- leakage current in MOSFET,” IEEE Electron Device Lett.,vol.8,no. sured waveforms of the CG driver are shown in Fig. 21. The 11, pp. 515–517, Nov. 1987. hashed lines denote the initiation of a program operation. This is [6] A. Strum, T. Mahlen, and Y. Roizin, “Non-volatile memories in the foundry business,” in Proc. IEEE IMW, 2010, pp. 1–5. achieved by asserting PRG and SEL to “1” with ERS (not shown [7] Y. Roizin, E. Aloni, A. Birman, V. Dayan, A. Fenigstein, D. Nahmad, here) constantly held low. Accordingly, the internal node E. Pikhay, and D. Zfira, “C-flash: An ultra-low power single poly (referred to as ) is pulled up to 5 V, and, subsequently, the logic NVM,” in Proc. NVSMW/ICMTD NVM Wkshp and Int. Conf. on Memory Technol. and Design., 2008, pp. 90–92. CG output signal is pulled up close to 5 V. Together with the [8] S. Ben-Yaakov and M. Evzelman, “Generic and unified model of appropriate TG output signal, this voltage is sufficient for pro- switched capacitor converters,” in Proc. IEEE ECCE, 2009, pp. gramming the selected C-Flash cell. 3501–3508. [9] H. Dagan, A. Teman, A. Fish, E. Pikhay, V. Dayan, and Y. Roizin, “A Table III summarizes the main features of the TG and CG low-cost low-power non-volatile memory for RFID applications,” in drivers according to post-silicon measurements. Both drivers IEEE ISCAS Tech. Dig., May 2012, pp. 1827–1830. have a very low static power figure, as expected through the em- [10] K. Chang, RF and Microwave Wireless Systems.NewYork,NY, USA: Wiley, 2002. ployment of GIDL-free circuits. The dynamic energy-per-oper- [11] L.Heller,W.Griffin, J. Davis, and N. Thoma, “Cascode voltage switch ation shown in Table III represents the switching operation that logic: A differential CMOS logic family,” in IEEE ISSCC Tech. Dig., led to the highest energy consumption, both in the presence of Feb. 1984, vol. XXVII, pp. 16–17. [12] M. Shams, Modeling and Optimization of CMOS Logic Circuits with 5-V/ 5-V supplies, as well as when only the 1.8-V supply is Application to Asynchronous Design. Waterloo, ON, Canada: Univ. generated. As expected, the CG driver consumes slightly more of Waterloo, 1999. energy-per-operation than the TG driver, due to its larger com- [13] N. Masoumi, J. Ghasemi, M. Ahmadian, F. Raissi, and M. Masoumi, “Enhancing performance and saving energy in CMOS DCVSL gates plexity. However, in both cases the dynamic energy consump- by using a new transistor sizing algorithm,” in Proc. Int. Wkshp SoC tion is considerably low. for Real-Time Applicat., 2005, pp. 283–288. [14] J. Lohstroh, “Static and dynamic noise margins of logic circuits,” IEEE VI. CONCLUSION J. Solid-State Circuits, vol. SSC-14, no. 3, pp. 591–598, Jun. 1979. [15] E. Seevinck, F. J. List, and J. Lohstroh, “Static-noise margin analysis This paper presented a pair of low-power voltage drivers of MOS SRAM cells,” IEEE J. Solid-State Circuits, vol. 22, no. 5, pp. foraC-Flash-basedNVMarraytobeintegratedinapassive 748–754, Oct. 1987. [16] M. Sharifkhani and M. Sachdev, “SRAM cell stability: A dynamic per- RFID system. The driver requirements include multiplexing a spective,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 609–619, wide range of voltages to their output nodes, while maintaining Feb. 2009. low static and dynamic power consumption. Designing these [17] W. Dong, P. Li, and G. Huang, “SRAM dynamic stability: Theory, variability and analysis,” in Proc. IEEE/ACM ICCAD, Nov. 2008, pp. drivers required the development of several novel techniques 378–385. to solve challenges that arose from the necessity to retain [18] J. Mezhibovsky, A. Teman, and A. Fish, “Low voltage SRAMs and the ultra-low manufacturing costs. These challenges included scalability of the 9T supply feedback SRAM,” in Proc. IEEE SOCC, 2011, pp. 136–141. floating bias voltages during read and standby modes and high [19] A. Teman, A. Mordakhay, J. Mezhibovsky, and A. Fish, “A 40-nm sub- GIDL currents that result from large potential drops over stan- threshold 5T SRAM bit cell with improved read and write stability,” dard process devices. The resulting architectures and circuits IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 12, pp. 873–877, Dec. 2012. developed were presented. An in-depth analysis of the dynamic [20] C. C. Enz, F. Krummenacher, and E. A. Vittoz, “An analytical MOS behavior of DCVSL level shifters and the proposed GIDL-free transistor model valid in all regions of operation and dedicated to low- level shifters was presented for the first time. This analysis voltage and low-current applications,” Analog Integr. Circuits Signal Process., vol. 8, no. 1, pp. 83–114, 1995. resulted in a methodology for sizing these level shifters to [21] W. Shockley, “A unipolar field-effect transistor,” Proc. IRE, vol. 40, optimize the tradeoffs between area, energy-per-operation, no. 11, pp. 1365–1376, Nov. 1952. 1510 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 6, JUNE 2013

[22] S.Fisher,A.Teman,D.Vaysman,A. Gertsman, O. Yadid-Pecht, and A. Vladislav Dayan received the B.Sc. degree in Fish, “Digital subthreshold logic design – motivation and challenges,” physics and M.Sc. degree in theoretical physics from in Proc. IEEE, Dec. 2008, pp. 702–706. the National University of Uzbekistan, Tashkent, [23] D. W. Marquardt, “An algorithm for least-squares estimation of non- Usbekistan, in 2003 and 2005, respectively. linear parameters,” J. Soc. Ind. Appl. Math., vol. 11, no. 2, pp. 431–441, He is currently the Expert Device Engineer in the 1963. R&D Department, TowerJazz, Migdal HaEmek, Is- rael. His research interests include the full flow of nonvolatile memory design, from the development of single memory devices to end-user IP blocks. He has authored and coauthored several patents and papers in the NVM design field.

Anatoli Mordakhay received the B.Sc. degree in electrical engineering from Ben-Gurion University, Be’er Sheva, Israel, in 2012. He is currently working toward the M.Sc. degree at Bar-Ilan University, Ramat Gan, Israel. Hadar Dagan received the B.Sc. degree in electrical His research interests include analog circuit de- engineering (summa cum laude) from Ben-Gurion sign for low-power applications, image sensors, and University, Be’er Sheva, Israel, in 2010, where he memory arrays, as well as low-power digital circuit, is currently working toward the M.Sc. degree in SRAM, and eDRAM design. He has authored and electrical engineering. coauthored three scientific papers and participated During the course of his education, he has authored in the tapeout of several integrated circuit test chips. and coauthored four scientific papers and has con- ducted tapeouts of ten custom test-chips. His research interests include digital and analog circuit design for low-power applications, radio-frequency identifica- Yakov Roizin received the Ph.D. degree degree tion (RFID) devices, countermeasures against side- from the Institute of Semiconductor Physics, USSR channel attacks for secured cryptographic systems, and signal processing. Academy of Sciences, Novosibirsk, Russia, and the Mr. Dagan was the recipient of an Award of Merit for outstanding projects in D.Sc. Habilitation degree from the Moscow Institute the BGU Department of Electrical Engineering for the 2009–10 academic year of Electronic Technology, Moscow, Russia. for his senior project. He has more than 30 years of semiconductor de- vice and technology development experience. For the last 16 years, he has been with TowerJazz (formerly Tower Semiconductor), Migdal HaEmek, Israel, de- Adam Teman (S’10) received the B.Sc. degree in veloping specialty CMOS technologies. He currently electrical engineering and M.Sc. degree from Ben- holds the position of TowerJazz Fellow and Director Gurion University, Be’er Sheva, Israel, in 2006 and of Emerging Technologies. He has authored and coauthored more than 200 pa- 2011, respectively, where he is currently working to- pers and 30 patents in the field of semiconductor devices and materials. ward the Ph.D. degree. He was a Design Engineer with Marvell Semi- conductors from 2006 to 2007, with an emphasis on physical implementation. His research interests Alexander Fish (M’06) received the B.Sc. degree in include low-voltage digital design, energy-effi- electrical engineering from the Technion, Israel In- cient SRAM, NVM, and eDRAM memory arrays, stitute of Technology, Haifa, Israel, in 1999, and the low-power CMOS image sensors, and low-power M.Sc. and Ph.D. (summa cum laude) degrees from design techniques for digital and analog VLSI chips. He has authored 23 Ben-Gurion University, Be’er Sheva, Israel, in 2002 scientific papers and two patent applications, and has presented excerpts from and 2006, respectively. his research at a number of international conferences. He was a Postdoctoral Fellow with the ATIPS Lab- Mr. Teman was the recipient of the Electrical Engineering Department’s oratory, University of Calgary, Calgary, AB, Canada, Teaching Excellence recognition at Ben-Gurion University in 2010–2012, from 2006 to 2008. From 2008 to 2013, he headed and in 2011 the BGU’s Outstanding Project award. He received the Yizhak the Low Power Circuits and Systems Lab (LPC&S), Ben-Ya’akov HaCohen Prize in 2010, the BGU Rector’s Prize for Outstanding VLSI Systems Center, Department of Electrical and Academic Achievement in 2012, and the Wolf Foundation Scholarship for Computer Engineering, Ben-Gurion University, Be’er Sheva, Israel. In 2012, excellence of 2012. He is conducting his doctoral studies under a Kreitman he joined the faculty of Bar-Ilan University, Ramat Gan, Israel, as an Associate Foundation Fellowship. Professor and the head of the Faculty of Engineering’s Nano-Electronics Track and the Energy Efficient Electronics and Applications ( ) Labs. His research interests include low-voltage digital design, energy-efficient SRAM and NVM memory arrays, low-power CMOS image sensors, and low-power design tech- Evgeny Pikhay received the B.Sc. degree from Tech- niques for digital and analog VLSI chips. He has authored and coauthored over nion, Israel Institute of Technology, Haifa, Israel, in 90 scientific papers and patent applications as well as two book chapters. He 2005, and the M.Sc. degree from Tel Aviv Univer- serves as an Editor-in=Chief for the Journal of Low Power Electronics and Ap- sity, Tel Aviv, Israel, in 2010, both in electrical engi- plications. neering. He is currently working toward the Ph.D. de- Dr. Fish was a coauthor of two papers that won the Best Paper Finalist awards gree in electrical engineering at Technion, Israel In- at ICECS’04 and ISCAS’05 conferences. He was also awarded the Young Inno- stitute of Technology. vator Award for Outstanding Achievements in the field of Information Theories He currently the Staff Device Engineer in the R&D and Applications by ITHEA in 2005. In 2006 and 2012, he was the recipient of Department of TowerJazz, Migdal HaEmek, Israel. the Engineering Faculty Dean Teaching Excellence recognition at Ben-Gurion His research interests include nonvolatile memories University. He served as an associate editor for the IEEE SENSORS JOURNAL and sensors embedded in CMOS process flows . He and the IEEE ACCESS JOURNAL. He has also been a co-organizer of many spe- has authored and coauthored more than 30 papers and patents. cial issues and sessions for IEEE journals and conferences.

Chapter 6 Low Power Techniques for Image Sensors

6.1 Leakage Reduction in Advanced Image Sensors Using an Improved AB2C Scheme

The following paper, published as part of the special issue on Low Power Arrays in the very prestigious IEEE Sensors Journal, describes a novel circuit technique for leakage reduction in "smart" image sensors. This design included both the improvement of the Advance Bulk Biasing Control scheme, which was developed in the VLSI Systems Center at BGU, in addition to the implementation of the scheme on an embedded SRAM that operates in sync with a CMOS imager, as part of a smart image sensor system. The project was originally presented as part of the special session on Low Power Arrays at the IEEE Sensors Conference in Christchurch, New Zealand, in 2009 [52].

114 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012 773 Leakage Reduction in Advanced Image Sensors P Using an Improved ef g Scheme Adam Teman, Student Member, IEEE, Orly Yadid-Pecht, Fellow, IEEE, and Alexander Fish, Member, IEEE

Abstract—Static leakage currents in advanced CMOS processes research has been commenced in recent years to develop power have become the main source of power consumption in many of reduction techniques and methodologies for advanced image today’s systems. This is especially true for systems with a large sensors [6]–[9]. As with digital logic and other VLSI arrays, number of devices that are in a stable state for most of their operation time, such as image sensors and memory arrays. This such as embedded memories, image sensors present more and paper introduces an improved adaptive bulk biasing control more substantial leakage currents with technology scaling. (efPg) scheme for reduction of leakage currents during these While many leakage components can be identified in advanced “standby” periods in serially accessed arrays, while enabling sub-micron technologies, the subthreshold current is still the device acceleration during active cycles. We provide a theoretical most significant leakage component. analysis of the efPg operation, showing its advantages and limitations. The proposed scheme has been integrated with a A variety of techniques have been proposed for subthreshold test-case advanced wide dynamic range (WDR) image sensor with leakage reduction in VLSI circuits. These include aggressive an on-chip memory. The scheme was applied to both the pixel supply voltage reduction [10], adaptive voltage scaling [11], and bitcell arrays, providing configurable leakage reduction and “stacking” cutoff transistors [12], utilization of high threshold performance enhancement. An 80 nm test chip was fabricated voltage (HVT) devices, and application of negative voltages to with a 10 k pixel/bitcell test-case system and successfully tested, the wells or substrate. The latter technique can be applied to showing a 21% power reduction compared to a standard system and up to 44% compared to an accelerated system. dynamically raise the threshold voltage ( ) by applying a re- verse body bias (RBB) or lower it by applying a forward body Index Terms—Advanced bulk biasing control, advanced image bias (FBB). Varying the threshold voltage can enable leakage sensors, forward body biasing, leakage reduction, low power image sensors, low power SRAM, reverse body biasing. reduction during idle periods on the one hand, and acceleration of the device speed during operational cycles. It can also be used to control device parameters in lieu of process variations. This I. INTRODUCTION technique has been incorporated in many systems, applications, and studies, such as those described in [13]–[24]. HE introduction of the imager based on the active pixel The employment of body biasing for leakage reduction in sensor (APS) paved the way for the development of T image sensors is the basis for the advance bulk biasing control advanced image sensors [1]. These systems-on-a-chip (SOC), ( ) technique, originally proposed by Fish et al. [25]. Ac- fabricated in standard CMOS processes, integrate several com- cording to this technique, an RBB is dynamically applied using a ponents into the sensor pixel and periphery, enabling complex network of resistors to an entire row of pixels during integration functionality, as opposed to traditional CCD sensor arrays, periods, when the row is not accessed. Due to the large bulk ca- generally used for imaging alone. Advanced image sensors pacitances, dynamic charging and discharging has several side can perform various functions, such as image processing [2], effects, such as additional power consumption, bulk charging target tracking [3], [4], and dynamic range expansion [5] with delay, and noise or interference. The scheme deals with embedded devices and components, providing advantages in these problems by applying the voltages gradually, as will be cost, speed, and power. One of the most attractive character- described in Section II. istics of the CMOS Imager in general, and specifically, the The scheme was originally intended for integration “smart” Image Sensor, is its power saving capability. Low with image sensors operating in the rolling shutter operation power systems and components are becoming more attractive, mode. In this paper, we show that it can also be applied to a se- as the market for portable, battery operated devices continues rially accessed memory array, and as a test case, we integrated it to increase, many of which incorporate image sensors. Broad into a wide dynamic range (WDR) advanced image sensor em- ploying an on-chip SRAM. In addition, we propose a modified Manuscript received February 19, 2011; accepted May 09, 2011. Date of pub- method with improved functionality, performance, ro- lication May 19, 2011; date of current version February 08, 2012. The associate editor coordinating the review of this paper and approving it for publication was bustness, and flexibility over the original. The system, including Prof. Kaushik Roy. the improved circuit, was successfully fabricated in an 80 A. Teman and A. Fish are with the Low Power Circuits and Systems Lab, nm CMOS process. Measurement results from a test chip show VLSI Systems Center, Ben-Gurion University, Be’er Sheva 84105, Israel that the proposed method provides a 21% reduction in static (e-mail: [email protected]; afi[email protected]). O. Yadid-Pecht is with the Electrical Engineering Department, University power dissipation at 1.1 V without any degradation in image of Calgary, Calgary, AB T2N1N4, Canada (e-mail: orly.yadid.pecht@ucalgary. quality or SRAM performance. Moreover, utilization of FBB ca). for the active rows of the imager and SRAM arrays along with Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. RBB for the non-accessed rows enables improvement of the per- Digital Object Identifier 10.1109/JSEN.2011.2157123 formance of the discussed WDR imager.

1530-437X/$26.00 © 2011 IEEE 774 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012

The paper is constructed as follows: Section II describes the concept of leakage reduction and performance enhancement through body biasing. The improved scheme and power analysis are presented in Section III. Section IV presents the system architecture of the test case WDR imager employing the improved scheme. Test chip measurements and system performance are shown in Section V. Section VI concludes the paper.

II. EFFECTS OF BODY BIASING

A. Power Consumption The power consumption of a standard CMOS circuit is given by the well-known equation: Fig. 1. Body biasing impact on threshold voltage at various process nodes. The figure is plotted for minimum sized nMOS devices and shows the percentage of (1) change in threshold voltage for commercially available 0.18 "m, 80 nm, and 40 nm CMOS processes. where is the power consumed during transient switching activities and is the constant power consumed loss of performance that accompanies it. An alternative way is to due to biasing and/or leakage currents. For a CMOS imager’s dynamically change the threshold voltage through body biasing. pixel array, this can be expanded to (2) [6]: B. Body Biasing The threshold voltage of a transistor can be calculated ac- (2) cording to the well-known equation [29]: where is the imager’s frame rate; and are the number (5) of rows and columns, respectively; is the energy required for pixel reset; is the energy dissipated due to signal where is the zero biasing threshold voltage, set during fab- readout during a single frame; is the energy dis- rication; is the body-to-source voltage; is the body-ef- sipated by in-pixel analog and/or digital processing during a fect coefficient; and is the Fermi potential. Accordingly, the single frame; is the supply voltage; and is the threshold can be changed by creating a potential between the in-pixel leakage current. Similarly, the power dissipation of an source and body terminals of the transistor. This happens in- SRAM array is given by [26] herently in many circuit topologies, and is commonly known as the body effect; however, it is generally a parasitic phenom- (3) enon. Using a standard triple-well fabrication process, can be controlled dynamically, enabling leakage reduction (RBB) or where is the current driving an active bitcell (read or performance enhancement (FBB). write current) and is a bitcell’s leakage current. The body biasing technique was a popular method for leakage As technology scales and channel lengths decrease, the reduction in digital circuits for several process nodes up until leakage currents in the above equations tend to increase, such 65 nm. As technology scales, the effectiveness of the technique that in modern processes, often the leakage power component degrades, due to several factors, such as the higher influence is as large as or even larger than the dynamic power compo- of drain induced barrier lowering (DIBL) and other parame- nent. This current consists of many factors [27], but is largely ters on subthreshold leakage [23] and the increase of band-to- dominated by the subthreshold current of cutoff transistors, band Ttnneling current (BTBT) as a result of applying RBB given by (4) [28]: [30]. However, this technique is still quite attractive for both improving variation sensitivity as well as performance enhance- ment in advanced processes [21], [24]. In addition, image sen- (4) sors are sensitive to technology scaling and so are frequently where is the transistor’s transconductance coefficient; is fabricated in larger node technologies [31], [32]. Therefore, uti- transistor’s gate-to-source voltage; is transistor’s drain-to- lization of body biasing for leakage reduction is still an attractive source voltage; is threshold voltage; is thermal technique for advanced image sensors. Fig. 1 shows the effect voltage; is the DIBL coefficient; and is the subthreshold of body biasing on the threshold voltage of a minimum sized slope coefficient of the transistor. transistor at various commercial process nodes. It is clear from Equation (4) shows that the subthreshold current is exponen- this figure that the impact of RBB on threshold voltage degrades tially dependent on the transistor’s threshold voltage. Accord- as technology advances. ingly, raising the threshold voltage is an efficient way of re- As mentioned above, body biasing can be used for perfor- ducing the leakage power. This can be achieved through doping mance enhancement by lowering the threshold voltage to in- during fabrication, but many applications cannot tolerate the crease a device’s ON current. This is done through application TEMAN et al.: LEAKAGE REDUCTION IN ADVANCED IMAGE SENSORS USING AN IMPROVED SCHEME 775

Fig. 2. Body biasing effectiveness for performance enhancement at different process nodes. The figure is plotted for minimum sized devices with † a † a † .

Fig. 3. Step charging the bulk of a pixel/bitcell row. (a) RC modeling of a step of a forward body bias, and has the positive side-effect of re- charge operation. (b) Possible disturbances that can be caused to a pixel or bitcell ducing the standard deviation of the threshold voltage distribu- during readout. tion [22], [23]. This has led to increased popularity of FBB ap- plication in modern processes, especially for high performance bias to an entire row, column, or complete array. In fact, the circuits with sensitivity to process variations. Fig. 2 shows the area penalty for biasing an entire row is very small and only re- effect of forward body biasing on the saturation current (with quires the propagation of the bias signal and some additional ) of a minimum sized nMOS transistor at well contacts. various (low power) process nodes. The figure shows the drain For arrays, such as those found in image sensors or random current increase, as compared to a zero-biased equivalent tran- access memories, a row-wise addressing is commonly used for sistor. For the 0.18 and 80 nm processes, an increase of ap- cell access. Accordingly, the application of an RBB on inactive proximately 10% can be achieved, whereas at the 40 nm node, rows would seem an attractive solution for leakage reduction, this figure is closer to 30%. It should be noted that the for- and an FBB on the active row could be utilized for performance ward biasing voltage is limited by the source-to-body diode of enhancement. The pixel/bitcell rows would be laid out with well the transistor that turns on, drawing a body-to-source current sharing between rows, such that the n-wells and p-wells of all as the FBB voltage approaches 700 mV. The plotted current is cells in a certain row would be common. These wells could the transistor’s drain current, excluding this parasitic body cur- be modeled as a lumped RC network with a large capacitance. rent from the calculation; however, it should be added for power Charging the active row’s capacitance to FBB (from its previous considerations. RBB potential) could be modeled as a simple step charging pro- cedure, as shown in Fig. 3(a). C. Application of Body Biasing Given the total capacitance of the bulk of a row, , and In the previous subsection, it was shown that dynamic control the potential difference between RBB and FBB, the energy con- of body biasing can be manipulated to reduce leakage on the one sumed during each operation (e.g., forward biasing the active hand (RBB), and enhance performance and reduce sensitivity to row and reverse biasing the previous active row) would simply process variations on the other (FBB). In addition, this technique be has been used for other purposes, such as adapting circuits for (6) performance under process variations [33]. However, in the pre- vious discussion, application of such a bias was given without where is the potential difference between the forward consideration of trade-offs, such as power requirements, delay, and reverse body biases [i.e., in coupling noise, and area. Fig. 3(a)]. In a serially accessed array, where a different row From an area perspective, body biasing comes at a high price; is asserted every operation, the power described by (6) would a separate well is required for each transistor or island with a dif- be consumed every clock cycle. In addition to this, large drivers ferent body potential. This generally makes the implementation would be required to apply the step charge, trading off additional of interesting techniques, such as DTCMOS [34], non-practical power with the delay of the assertion. in most cases, as the size requirements for each transistor are As mentioned above, the well of an array’s row is a large very large. Consequentially, the application of a body bias can capacitance and due to its proximity to other conductive sur- usually only be considered when it is applied to a specific cir- faces, substantial coupling capacitances can result. This is es- cuit with a limited number of devices, or to an entire block of pecially relevant when signals are routed horizontally over the devices with a common well potential. In the case of a pixel array, such as the word lines of a memory or the reset and select or bitcell array with thousands or millions of devices, only the lines of an imager. Therefore, the assertion of an abrupt transi- latter can be considered, suggesting application of a common tion to the well bias can result in coupling noise or interference 776 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012 that could result in many unexpected phenomena, such as those illustrated in Fig. 3(b). The coupling to a select line could cause an overshoot, resulting in an unintentional reset. The coupling to the internal nodes of a memory cell could cause a temporary degradation of noise margins, especially during sensitive oper- ations such as bitcell readout. The capacitive coupling between wells and to local supply rails could cause unintended biasing fluctuations and even result in latch-up. Finally, the spectral dis- tribution of a step charge could cause unwanted noise in analog signals. The consequence of these phenomena that are hard to fully predict and prevent is that dynamically step charging large wells is rare and even avoided. A slow transition upon the wells could be applied; however, this would result in a generally unaccept- able delay penalty. Therefore, alternative methods for gradually charging the wells need to be considered, such as the scheme, described hereafter.

III. BODY BIASING WITH SCHEME

A. Gradual Bulk Biasing When taking into consideration a large number of equally sized capacitors that require charging, an alternative approach to the traditional step charging discussed above could be used. Let us consider the circuit shown in Fig. 4(a), with capac- itors (each representing the bulk capacitance of a single row) connected in parallel with approximately equal resistances be- tween them. By applying a potential, , to one of the middle capacitors, a constant voltage drop will be symmetrically ap- plied to the remaining capacitors, as they progress towards the opposite voltage supply. This setup can be further extended by connecting the two end nodes and grounding them, as shown in Fig. 4(b). Doing this creates a ring of RC circuits symmetrically biased around the virtual line connecting the biased node with Fig. 4. RC modeling of gradual step charging setup. (a) Application of bias the grounded one. Now, the biasing point can easily be moved to voltage to the middle row of resistive connected parallel bulk capacitances. (b) a different bulk without upsetting the symmetry, by changing the Extension of the bulk biasing setup to a symmetric ring. connection node of both the bias voltage and the ground supply. This is analogous to changing the bias point of the active row, as will be shown below. Now the bias point is moved towards the lower voltage (node Application of such a scheme results in a gradual voltage drop 2). The final voltage (after the transition) will be for both between the two biasing points. A transition to an adjacent row nodes, resulting in an energy figure of causes a slight voltage change of at each node. This small change has a very small disturbance effect as compared to the full step charge, especially since the change is gradual at the (8) non-directly biased nodes. Therefore, a forward biased node can be set at one end of the circuit and a reverse bias can be set at the Therefore, the energy dissipated by such a transition is opposite end, without the need for large drivers or the danger of applying a large abrupt voltage step. With the setup shown in Fig. 4(b), the power consumed (9) during a bias point transition can be easily calculated. Choosing An example transition is illustrated in Fig. 4(b), by moving two nodes on opposite sides of the symmetry line, with a the two bias points clockwise according to the dotted “transi- voltage drop between them (i.e., node 1 is biased at tion” arrows. Two arbitrary nodes with a voltage drop be- and node 2 is biased at ), the energy accumulated on tween them were chosen and marked as node 1 and node 2 ac- the two capacitances is simply cording to the definitions above. The voltage at node 1 changes from to , and the voltage at node 2 changes from to . A closer look at the entire ring circuit reveals that there are exactly pairs of capacitors that make such a transition (ne- (7) glecting the two end capacitors). Substituting for its calcu- TEMAN et al.: LEAKAGE REDUCTION IN ADVANCED IMAGE SENSORS USING AN IMPROVED SCHEME 777 lated value of , we find that the power dissipated by the entire network during a single transition is

(10)

Considering a large (large number of rows), the energy consumed during such an operation is much lower than that of a step charge, as calculated in (6). This circuit can also be re- alized in a relatively low cost and simple way, as described in Section III. The result is a gradual voltage drop on the bulk ca- pacitances between the two bias points, providing FBB and/or RBB, on the active/inactive rows, as required for circuit opera- Fig. 5. Schematic of improved ef g circuit. tion. The circuit described above comes with a small static power penalty, due to the static current running between the two sup- digital outputs from an adjacent shift-register, marked here as plies ( and ground): and , respectively. The third device is the resistive device, connected between the bulk voltages of two adjacent sub-circuits (marked and ) and biased by a gate (11) voltage, . Each sub-circuit is implemented in a separate with representing the resistance of the resistors that sepa- p-well that is shared with the adjacent, pitch-fitted row, such that rate the bulk capacitances. Assuming a large , and keeping all transistors are body biased (at voltage ) along with the large, the resulting power can be small compared to the power array’s rows. saved by the application of RBB. The circuit of Fig. 5 is controlled by a pair of shift registers, connected to the and signals, used to set B. Improved Scheme the active row and the opposite row. The standard setting would The theoretical basis presented in the previous sections shows be to load these registers with a single “1” at a distance of that by integration of an scheme with a given circuit, a rows between the two shift registers. This applies the substantial reduction in leakage and/or a performance enhance- potential at the active row (where the FBB shift register’s “1” ment can be achieved. Equations (10) and (11) show that as long is set) and the potential at the opposite side of the array. as the number of rows (capacitors) and the resistance between The pair of shift registers can be replaced by a single one, with the capacitors remain large, the dynamic and static power con- each output hard-wired to and (e.g., for 1024 sumed by the circuitry is quite low. Accordingly, it can be rows, would be connected to , and and concluded that the concept can be efficiently integrated would be connected to and ). This with large components comprising many leaking devices, such would save the area and power required by the additional shift that the leakage reduction will be higher than the power required register; however, the pair of shift registers provides additional to operate the additional circuitry. In addition, this concept can flexibility for “window definition”, as will be explained later. only be applied to a component that operates in a serial access The gate voltage of the resistive devices ( ) provides scheme, such that the order of access is preset. an additional level of flexibility for circuit operation. Changing As shown by Fish et al. [25], a CMOS image sensor is a per- the gate voltage changes the resistance of the devices. This en- fect candidate for integration of an circuit. The improved ables setting a post-production trade-off between power con- circuit, presented hereafter, provides three major advan- sumption and performance, as a higher resistance (lower biasing tages over the original circuit. First, the resistor network is now voltage) results in a lower static current, at the cost of a longer connected in a ring structure, providing improved performance, transient. These resistors are nonlinear, such that the biasing especially at the ends of the arrays, where the original circuit lost voltage changes the shape of the bulk potential map, as shown correct biasing structure. Second, the resistors are replaced with in Section IV. This additional level of flexibility provides an ef- constant biased nMOS transistors. This creates adaptable, high ficient mechanism for dealing with process variations, as the bi- resistance, small area resistors, and enables further flexibility, asing voltage can be tuned according to performance and power as will be shown herein. Third, the improved concept is consumption. suitable for utilization in SRAM arrays. In this paper we show A simulated bulk potential map of a 600 row circuit an example of its application to a serially accessed SRAM array is shown in Fig. 6. In this example, is the active row that is operated in cohesion with a pixel array, as part of a wide with and . The result is a dynamic range advanced image sensor. gradual potential drop around the active row, with a “window” The schematic for the improved circuit is shown in of reversed biased rows around the opposite row ( ). Fig. 5. The circuit is made up of sub-circuits comprising 3 Modifying changes the shape of this map, creating nMOS transistors each, as follows. During the operation, only a more gradual or a more abrupt voltage drop, achieving a one transistor is connected as a pass gate to the forward biasing higher leakage reduction or a lower transient noise figure. As voltage ( ) and another is connected to the reverse biasing mentioned above, this voltage map can be further modified, if voltage ( ). The gates of these transistors are controlled by two shift registers are used to control the biasing scheme. In this 778 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012

Fig. 6. Simulated bulk potential map for a 600 row ef g circuit. case, a limited active window can be produced by applying two RBB signals, symmetrically placed around the FBB signal. This causes the voltage drop to occur around a smaller number of rows with all the additional rows (outside the window) biased at the maximum RBB, providing better leakage reduction.

IV. APPLICATION EXAMPLE The circuitry shown in the previous section is a gen- eral topology that can be integrated with various components for power reduction and/or performance enhancement. In the scope of this work, we integrated the circuit with a wide dynamic range advanced image sensor, proposed by Belenky et al. [35]. This image sensor uses both an in-pixel memory bit and an on-chip SRAM array to store the number of resets ap- plied to each pixel during a single frame. To achieve this, the sensor requires associative memory bits for each pixel. Coupled with the serial access scheme of the image sensor, when oper- ated in a rolling shutter mode, this system is a perfect candidate for leakage reduction by employment of the improved Fig. 7. (a) Schematic of WDR pixel. (b) Schematic of on-chip SRAM bitcell. circuit, on both the image sensor and memory array.

A. Pixel and SRAM Bitcell already forward biased. Forward biasing of the selected row re- The schematics of the WDR pixel and SRAM bitcell are given duces the threshold voltages of transistors N1-N4 (5) improving in Fig. 7(a) and (b), respectively. The memory cell is a standard the shutter efficiency and increasing the swing of the signal. 6T SRAM bitcell with minimum sized pMOS and access tran- For the SRAM bitcell in Fig. 7(b), the leakage paths to ground sistors (M1, M3, M5, and M6) and slightly larger pull-down through M2 an M4 are reduced during hold cycles, as well as bit- nMOS (M2 and M4). The same sizes are used for the 1-bit line leakage through M5 and M6, providing further static power in-pixel memory implemented in the WDR pixel. The WDR reduction and improving noise margins. A comparison of the Pixel consists of a photodiode connected to a memory driven various techniques discussed is given in Table I. The table shows reset scheme. The photodiode voltage is reset only when the the change in percentage of several metrics when each technique in-pixel memory has been set to “1”, according to the dynamic is applied as compared to a zero-biased system. range extension scheme. For additional information about the An interesting discussion can also be introduced, regarding operating principle of the WDR image sensor and its compo- the stability of the SRAM bitcell under body biasing. On the nents, see [35]. one hand, application of a reverse body bias increases the vari- In both figures, the connects to the circuit at the ability of the bitcell parameters [30], but it has been shown that node, creating a body bias for all relevant transistors. During in- this reduces read and hold failures due to the rise in threshold tegration periods, transistors M2, M4, M5, N1, N2, N3, and N4 voltage [19], [20]. In contrast, FBB decreases the variability of the pixel [Fig. 7(a)] all receive the RBB voltage, cutting off and the threshold voltage [22], reducing write and access fail- leakage paths. When the certain pixel row is selected, the bulk is ures. Therefore, application of FBB during cell access and RBB TEMAN et al.: LEAKAGE REDUCTION IN ADVANCED IMAGE SENSORS USING AN IMPROVED SCHEME 779

TABLE I COMPARISON OF VARIOUS METHODS TO A NON-BIASED BULK

Fig. 8. Layout methodology of pixel array. A deep N-well was implanted un- derneath the entire array, and adjacent rows were flipped to share N-wells and P-wells, sharing a mutual body bias as well. during hold cycles provides a good overall tradeoff for stability. Detailed discussions of SRAM stability under body biasing are given by many researchers, such as [19], [20], and [30]. The layout of the arrays requires implementation of nMOS devices inside isolated P-wells. This generally comes at the ex- pense of a substantial increase in area due to the design rule spacing constraints of differently potentiated wells. However, Fig. 9. Architecture of the column-wise WDR image sensor with ef g cir- by implementing interlaced horizontal N-wells and P-wells, this cuits. overhead can be minimized or even negated. Depending on the pitch of each well, additional gaps may be required to adhere to the spacing constraints. These gaps can be filled with body scheme, this coupling has hardly any impact on the adjacent contacts for shielding; however, the more efficient solution is bulk, as the voltage drop between them is small and the tran- to flip adjacent rows and share a single bulk potential between sitions are gradual. them. In addition, use of an to p-well diode should be con- B. Access Scheme sidered to minimize the need for separate deep n-wells that are characterized by a large spacing requirement. In our implemen- As mentioned previously, each pixel in the array is desig- tation, a single deep n-well was implanted beneath the entire nated a specific set of memory bits for storage of the number array, and row biases from the circuitry were shared by of resets performed during the current cycle. Distribution of flipping adjacent rows. Considering the WDR pixel requires less the bitcells according to the physical location of their relevant pMOS than nMOS devices, the pixels were fit together as shown pixels ensures a serial row-by-row access scheme for the in Fig. 8 to minimize wasted space. The final pixel layout pro- memory array, rather than a general random access scheme. vided a 54% fill factor on a pixel sized 4.375 3.85 . Accordingly, the SRAM array’s row decoder can be replaced It should be noted that the analysis given in Section III as- with a shift register, and an circuit can be integrated into sumed that the wells were disconnected capacitors, where in the memory. In addition to the benefits of the scheme, fact, the entire array could be better modeled as a distributed net- discussed in Section III, the shift register provides further work of RC elements. However, there is a reverse-biased diode benefits over the standard row decoder, such as reduced area, between every N-well and P-well, as shown in Fig. 8. This re- power consumption, and access time. Furthermore, the shift sults in a very high resistance between the wells, essentially register itself should be biased by the circuit, reducing disconnecting them, such that the original assumption was ap- its static power consumption at virtually no additional cost. propriate. The capacitive coupling between adjacent P-wells is Realizing that the access scheme of the image sensor and the shielded by the constantly biased N-well in between them; how- memory array are now identical and synchronized, an attrac- ever, this coupling capacitance should be taken into considera- tive option would be to line up the two arrays in a horizontal tion, as well. In the case of the gradual biasing of the configuration using a single row select shift register, providing 780 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012 additional savings in area and power. However, in this partic- ular case, we chose to use a vertical peripheral sharing scheme that enables parallel readout of an entire row into or from its matching row in the other array. In this way, during integration, the intermediate pixel level is read out to column-wise circuitry and converted into a write command for the selected set of bit- cells, and a complete row of bitcells is written at once. During read out, the pixel value is read out and subsequently, the associ- ated bitcells are read out to create the final readout value; again, for the entire row in parallel. Two shift registers and cir- cuits are required for this configuration, as shown in Fig. 9, but this is highly advantageous over the row-wise setup with inde- pendent pixel/bitcell access. It should be noted that depending on the configuration of the WDR image sensor, more than one bitcell may be required for reset storage. This should not affect the column pitch fitting (or row pitch fitting if a horizontal con- figuration was chosen), as the size of a bitcell is much smaller than a pixel, allowing a number of bits to be laid out beneath a single column. If this is still insufficient, multiple memory rows can be activated in synch with a specific pixel row.

V. I MPLEMENTATION,PERFORMANCE, AND MEASUREMENTS

A. Test Chip and Measurement Setup The proposed WDR advanced image sensor employing an improved scheme was implemented and fabricated in an 80 nm low-power twin-well CMOS process. This test system included a 100 100 (10 k) pixel array with a single SRAM bitcell for each pixel. The arrays were laid out in a vertical con- figuration with column-wise peripheral sharing, as described in Section IV. Fig. 10. Post silicon testing setup, using a custom designed board and mea- Post silicon testing was performed with a 1.1 V supply at suring power consumption with the Agilent B1500a semiconductor device ana- lyzer. (a) Block diagram of test setup. (b) Photograph of test chip and evaluation 27 , employing a custom designed board (see Fig. 10) and board (camera removed). assuming sensor operation at 30 FPS. All control signals and biases were generated externally using a Pulse Instruments 4000 Series Test System. Static and dynamic power consump- Section II to be unacceptable. Integration of the scheme tion were measured using the Agilent B1500a Semiconductor does not achieve the full potential reduction, as the non-active Device Analyzer and compared with similar circuits without rows are not fully reverse biased, but the gradual biasing makes body biasing. Various biasing voltages were tested to find the it an applicable alternative. Simulation results of the integra- optimal setup for minimum power dissipation of the complete tion of the proposed scheme in an 800 600 pixel sensor system. These included reverse bulk biasing up to , for- showed a 26% static power reduction, as we presented in [36]. ward bulk biasing up to 700 mV, and resistor biasing (biasing For a smaller sensor, the power of the circuit is more sig- voltages of bulk separation transistors) of 0 V to 1.1 V. Similar nificant, resulting in a 21% static power reduction for the fabri- tests were done at simulation level using Cadence Spectre for cated 100 100 pixel sensor, as shown in Fig. 12. In this figure, an 800 600 (480 k) pixel sensor. the increase/reduction in static power is shown as measured at various values of (as defined in Fig. 5 with )as B. Static Performance compared with not applying any bulk bias (i.e., zero-bias state). Fig. 11 shows the simulation results of the independent effect For this setup, we would choose 280 mV as the optimal RBB of body biasing on the proposed pixel and bitcell, i.e., without level, as it provides the minimal power consumption in this con- considering the power consumption of the peripheral circuitry figuration. A larger reverse bias results in higher static current in needed for implementation. Applying a 350 mV RBB achieves the circuit than the power saved by the additional leakage a 25% leakage reduction for the bitcell and a 28% reduction reduction. for the pixel. Alternatively, applying a 100 mV FBB for per- To accelerate the performance of the system, a forward bias formance enhancement comes at the expense of a 20% increase could be applied; for example, the image sensor’s swing can be in leakage power for the pixel and a 17% increase for the bit- increased by as much as 14%. However, forward bulk biasing cell. Dynamic application of these biases can only be achieved significantly increases the leakage currents of the off transistors. by introducing a step charging scheme, which was shown in In this case, the application of the scheme is much more TEMAN et al.: LEAKAGE REDUCTION IN ADVANCED IMAGE SENSORS USING AN IMPROVED SCHEME 781

Fig. 11. Effect of body biasing on leakage power of bitcell and pixel used in WDR image sensor. Figure shows the simulated ratio of leakage power at var- Fig. 13. Measured static power consumption with application of forward bulk ef g ious biasing voltages as compared to zero-biasing of the same circuit. biasing as compared to the original system without the integration of an scheme.

Fig. 12. Effect of body biasing on static power of column-wise WDR image ef g sensor with on-chip memory employing a pair of circuits. Graph shows Fig. 14. Effectiveness of ef g integration at various temperatures. The graph the measured percentage of increase or decrease in leakage currents of the shows the percentage of static power reduction with the integration of an ef g system under various biases as compared to application of a forward bias (FBB) scheme (with ‚ff a PVH m†, pff a H) as compared to the original ef g without employing the scheme. The minimum point of each graph system (without ef g integration) according to simulation results. shows the tradeoff between leakage power reduction and ef g circuitry static power consumption as a function of biasing voltage. substantial performance improvement while keeping specified effective, showing a 44% static power reduction for a 200 mV power limitations. FBB. In this case, as shown in Fig. 12, the optimal would be 240 mV. It should be noted that these optimal bias levels C. Process Variations are calculated for the chosen configuration and array size and Leakage currents are highly affected by temperature, as should be measured for any specific system. For example, for the shown by (4). As such, integration of an scheme be- simulated 800 600 pixel sensor, the optimal bias was 350 mV. comes more effective as the temperature rises. This is clearly Fig. 13 shows the efficiency of integrating an scheme shown in Fig. 14. The figure shows the simulated leakage when applying a forward bulk bias. The figure compares the reduction of the WDR imager with integration of the power consumption of the test case application without the as compared to the original imager at the same temperature. scheme as compared to application of a forward bulk The leakage reduction reaches as high as 42% at the maximum bias when the scheme is integrated. A forward bias of simulated temperature (125 ). 600 mV can be applied through the scheme without The same conclusions can be reached when testing ef- resulting in any static power increase. This can lead to a ficiency under process corners, as shown in Fig. 15. The scheme 782 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012

Fig. 15. Static power consumption under process variations. The Fast (FF, RH g), Slow (SS, 125 g), and Nominal (TT, 27 g) corners are shown under various RBB voltages of the ef g without a forward bias. Results are according to simulation.

Fig. 17. (a) Layout of the test case WDR image sensor with on-chip memory employing ef g circuits as part of an 80 nm test chip. (b) Example image from fabricated sensor.

Fig. 16. Total power consumption of the system under various reverse bulk gradual charging property, the voltage swing between cycles is biases (measured with pff a H †). very small, resulting in a minimal charge time and negligible performance loss. If a given system requires a faster charge time, this can be achieved by upsizing the forward biasing has a strong effect at slow corners and less effect at fast corners. switches of Fig. 5. Integrating the with the advanced image sensor system The dynamic power consumed by the bulk charge has been enabled a static power reduction of 49% for the SS corner at shown to be inversely proportional to the number of rows [(10)]. 125 . Note that the FS and SF corners were not included, as For the fabricated sensor, operating at 30 FPS, with a theoretical the was implemented solely on nMOS transistors and so 8-bit WDR extension, the dynamic power of the scheme the results are similar to the FF and SS corners, respectively. is a number of orders of magnitude lower than the static power. Fig. 16 shows the measured power consumption of the fabri- D. Dynamic Performance cated sensor at different levels of with .As expected, the optimal bias voltage was found to be One of the major advantages of the integration of the im- resulting in a 21% power reduction. proved circuit into a given system is that it comes at The layout of the fabricated sensor, as part of an 80 nm test a negligible dynamic performance/power penalty. As shown chip, is shown in Fig. 17(a), and an example image is given in above, the performance of a system can be enhanced by the ap- Fig. 17(b). The figures of merit for the test chip are given in plication of a forward bias, whereas the zero-bias state provides Table II. It should be noted that the fabricated test case featured identical performance to the original system as long as the bulk a small (100 100) sensor with a 1-bit WDR extension. For a of the active row has charged to the voltage. Due to the larger sensor with a larger bit extension, the power consumption TEMAN et al.: LEAKAGE REDUCTION IN ADVANCED IMAGE SENSORS USING AN IMPROVED SCHEME 783

TABLE II [8] F. Tang and A. Bermak, “A 4T low-power linear-output current-medi- TEST CHIP FIGURES OF MERIT ated CMOS image sensor,” IEEE Trans. Very Large Scale (VLSI) In- tegr. Syst., to be published. [9] V. Gruev, Y. Zheng, and J. Van der Spiegel, “Low power linear current mode imager with 1.5 transistors per pixel,” in IEEE Int. Symp. Circuits and Systems (ISCAS 2008), 2008, pp. 2142–2145. [10] B. H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and sizing for minimum energy operation in subthreshold circuits,” IEEE J. Solid- State Circuits, vol. 40, pp. 1778–1786, 2005. [11] M. Elgebaly and M. Sachdev, “Variation-aware adaptive voltage scaling system,” IEEE Trans. Very Large Scale (VLSI) Integr. Syst., vol. 15, pp. 560–571, 2007. [12] Y. Xu, Z. Luo, Z. Chen, and X. Li, “Minimum leakage pattern genera- tion using stack effect,” in Proc. 5th Int. Conf. ASIC, 2003, vol. 2, pp. 1239–1242. [13] S. Narendra, D. Antoniadis, and V. De, “Impact of using adaptive body bias to compensate die-to-die †t variation on within-die †t varia- tion,” in Proc. Int. Symp. Low Power Electronics and Design, 1999, pp. 229–232. [14] A. Hokazono, S. Balasubramanian, K. Ishimaru, H. Ishiuchi, C. Hu, and T.-K. Liu, “Forward body biasing as a bulk-Si CMOS tech- nology scaling strategy,” IEEE Trans. Electron Devices, vol. 55, pp. 2657–2664, 2008. [15] C. H. Kim, J. J. Kim, S. Mukhopadhyay, and K. Roy, “A forward body-biased low-leakage SRAM cache: Device, circuit and architec- ture considerations,” IEEE Trans. Very Large Scale (VLSI) Integr. Syst., and area of the circuitry become much less significant, vol. 13, pp. 349–357, 2005. resulting in improved efficiency. [16] K. von Arnim, E. Borinski, P. Seegebrecht, H. Fiedler, R. Brederlow, R. Thewes, J. Berthold, and C. Pacha, “Efficiency of body biasing in 90 nm CMOS for low power digital circuits,” in Proc. 30th Eur. Solid- VI. CONCLUSIONS State Circuits Conf. (ESSCIRC 2004), 2004, pp. 175–178. An improved scheme was presented, offering adapt- [17] H. Jeon, Y. Kim, and M. Choi, “Standby leakage power reduction technique for nanoscale CMOS VLSI systems,” IEEE Trans. Instrum. able leakage reduction and/or performance enhancement when Meas., vol. 59, pp. 1127–1133, 2010. integrated with existing serially accessed systems. A theoret- [18] A. Bonnoit and L. Pileggi, “Reducing variability in chip-multiproces- ical analysis of the dynamic and static power consumption of sors with adaptive body biasing,” in Proc. ACM/IEEE Int. Symp. Low- Power Electronics and Design (ISLPED), 2010, pp. 73–78. the circuitry was given. The improved circuit was [19] S. Mukhopadhyay, K. Kang, H. Mahmoodi, and K. Roy, “Reliable integrated with an advanced WDR image sensor, by applying and self-repairing SRAM in nano-scale technologies using leakage and the scheme on both the pixel array and the on-chip SRAM. A delay monitoring,” in Proc. IEEE Int. Test Conf. (ITC 2005), 2005, pp. 10–1135. 100 100 pixel imager was implemented and fabricated, and [20] S. Mukhopadhyay, Q. Chen, and K. Roy, “Memories in scaled tech- measurements were presented. A 21% leakage reduction was nologies: A review of process induced failures, test methodologies, and achieved for the fabricated sensor as compared to an equiva- fault tolerance,” in Proc. IEEE Design and Diagnostics of Electronic Circuits and Systems (DDECS ’07), 2007, pp. 1–6. lent non-biased sensor with a negligible increase in dynamic [21] G. Gammie, A. Wang, M. Chau, S. Gururajarao, R. Pitts, F. Jumel, S. power or performance reduction. A 44% leakage reduction was Engel, P. Royannez, R. Lagerquist, H. Mair, J. Vaccani, G. Baldwin, achieved for a performance enhanced system employing for- K. Heragu, R. Mandal, M. Clinton, D. Arden, and U. Ko, “A 45 nm 3.5 G baseband-and-multimedia application processor using adaptive ward body biasing. body-bias and ultra-low-power techniques,” in Proc. IEEE Int. Solid- State Circuits Conf. (ISSCC 2008), Digest of Technical Papers, 2008, pp. 258–611. [22] J. W. Tschanz, J. T. Kao, S. G. Narendra, R. Nair, D. A. Antoniadis, EFERENCES R A. P. Chandrakasan, and V. De, “Adaptive body bias for reducing im- [1] O. Yadid-Pecht and R. Etienne-Cummings, CMOS Imagers: From Pho- pacts of die-to-die and within-die parameter variations on micropro- totransduction to Image Processing. Norwell, MA: Springer, 2004. cessor frequency and leakage,” IEEE J. Solid-State Circuits, vol. 37, [2] K. Ito, B. Tongprasit, and T. Shibata, “A computational digital pp. 1396–1402, 2002. pixel sensor featuring block-readout architecture for on-chip image [23] C. Neau and K. Roy, “Optimal body bias selection for leakage im- processing,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, pp. provement and process compensation over different technology gen- 114–123, 2009. erations,” in Proc. Proc. 2003 Int. Symp. Low Power Electronics and [3] A. Teman, S. Fisher, L. Sudakov, A. Fish, and O. Yadid-Pecht, Design (ISLPED ’03), 2003, pp. 116–121. “Autonomous CMOS image sensor for real time target detection and [24] K. Nii, M. Yabuuchi, Y. Tsukamoto, Y. Hirano, T. Iwamatsu, and Y. tracking,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS Kihara, “A 0.5 V 100 MHz PD-SOI SRAM with enhanced read sta- 2008), 2008, pp. 2138–2141. bility and write margin by asymmetric MOSFET and forward body [4] Q. Lin, W. Miao, and N. Wu, “A high-speed target tracking CMOS bias,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC 2010), Di- image sensor,” in Proc. IEEE Asian Solid-State Circuits Conf. (ASSCC gest of Technical Papers, 2010, pp. 356–357. 2006), 2006, pp. 139–142. [25] A. Fish, T. Rothschild, A. Hodes, Y. Shoshan, and O. Yadid-Pecht, [5] A. Spivak, A. Belenky, A. Fish, and O. Yadid-Pecht, “Wide-dynamic- “Low power CMOS image sensors employing adaptive bulk biasing range CMOS image sensors-Comparative performance analysis,” IEEE control (AB2C) approach,” in Proc. IEEE Int. Symp. Circuits and Sys- Trans. Electron Devices, vol. 56, pp. 2446–2461, 2009. tems (ISCAS 2007), 2007, pp. 2834–2837. [6] A. Fish and O. Yadid-Pecht, “Low-power “Smart” CMOS image sen- [26] J. M. Rabaey, A. Chandrakasan, and B. Nikolic´, Digital Integrated sors,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS 2008), Circuits: A Design Perspective, 2nd ed. Englewood Cliffs, NJ: Pren- 2008, pp. 1408–1411. tice-Hall, 2003, p. 761. [7] K. Cho, D. Lee, J. Lee, and G. Han, “Sub-1-V CMOS image sensor [27] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage using time-based readout circuit,” IEEE Trans. Electron Devices, vol. current mechanisms and leakage reduction techniques in deep-submi- 57, pp. 222–227, 2010. crometer CMOS circuits,” Proc. IEEE, vol. 91, pp. 305–327, 2003. 784 IEEE SENSORS JOURNAL, VOL. 12, NO. 4, APRIL 2012

[28] A. Wang, B. H. Calhoun, and A. P. Chandrakasan, Sub-Threshold De- Orly Yadid-Pecht (S’90–M’95–SM’01–F’07) re- sign for Ultra Low-Power Systems, ser. Integrated Circuits and Sys- ceived the B.Sc. degree in electrical engineering and tems. Secaucus, NJ: Springer-Verlag, 2006. the M.Sc. and D.Sc. degrees from Technion—Israel [29] K. von Arnim, E. Borinski, P. Seegebrecht, H. Fiedler, R. Brederlow, Institute of Technology, Haifa, Israel, in 1984, 1990, R. Thewes, J. Berthold, and C. Pacha, “Efficiency of body biasing in and 1995, respectively. 90-nm CMOS for low-power digital circuits,” IEEE J. Solid-State Cir- She was a National Research Council (USA) Re- cuits, vol. 40, pp. 1549–1556, 2005. search Fellow from 1995–1997 in the areas of ad- [30] A. Khajeh, A. M. Eltawil, and F. J. Kurdahi, “Effect of body biasing vanced image sensors at the Jet Propulsion Labora- on embedded SRAM failure,” in Proc. 2010 IEEE Int. Symp. Circuits tory (JPL), California Institute of Technology (Cal- and Systems (ISCAS), 2010, pp. 2350–2353. tech). In 1997, she joined the Ben-Gurion University, [31] A. Wang, P. R. Gill, and A. Molnar, “An angle-sensitive CMOS im- Be’er Sheva, Israel, as a member in the Electrical and ager for single-sensor 3D photography,” in Proc. IEEE Int. Solid-State Electro-Optical Engineering Departments. There she founded the VLSI Sys- Circuits Conf. (ISSCC 2011), Digest of Technical Papers, 2011, pp. tems Center, specializing in CMOS Image Sensors. From 2003–2005, she was 412–414. affiliated with the ATIPS laboratory at the University of Calgary, Calgary, AB, [32] C. Veerappan, J. Richardson, R. Walker, D.-U. Li, M. W. Fishburn, Y. Canada, promoting the area of integrated sensors. Since September 2009, she Maruyama, D. Stoppa, F. Borghetti, M. Gersbach, R. K. Henderson, is the iCORE Professor of Integrated Sensors, Intelligent Systems (ISIS) at the and E. Charbon, “A 160 128 single-photon image sensor with University of Calgary. Her main subjects of interest are integrated CMOS sen- on-pixel 55 ps 10 b time-to-digital converter,” in Proc. IEEE Int. sors, smart sensors, image processing hardware, micro and biomedical system Solid-State Circuits Conf. (ISSCC 2011), Digest of Technical Papers, implementations. She has published over 100 papers and patents and has led 2011, pp. 312–314. over a dozen research projects supported by government and industry. Her work [33] T. Chen and S. Naffziger, “Comparison of adaptive body bias (ABB) has over 600 external citations. In addition, she has co-authored and co-edited and adaptive supply voltage (ASV) for improving delay and leakage the first book on CMOS Image Sensors: CMOS Imaging: From Photo-Trans- under the presence of process variation,” IEEE Trans. Very Large Scale duction to Image Processing (Norwell, MA: Kluwer, 2004). She also serves as (VLSI) Integr. Syst., vol. 11, pp. 888–899, 2003. a director on the board of two companies. [34] L. Wei, Z. Chen, M. Johnson, K. Roy, and V. De, “Design and opti- Dr. Yadid-Pecht has served as an Associate Editor for the IEEE mization of low voltage high performance dual threshold CMOS cir- TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS (2001–2003) cuits,” in Proc. Design Automation Conf., 1998, pp. 489–494. and an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND [35] A. Belenky, A. Fish, A. Spivak, and O. Yadid-Pecht, “Global shutter SYSTEMS—PART I (2004–2005). She has also been on the CAS BoG (2004). CMOS image sensor with wide dynamic range,” IEEE Trans. Circuits She was the CAS representative for the IEEE Sensors Council (2006–2010), Syst. II, Exp. Briefs, vol. 54, pp. 1032–1036, 2007. a member of the Neural Networks, Biomedical Circuits, Nanoelectronics and [36] A. Teman, O. Yadid-Pecht, and A. Fish, “An improved ef g scheme Giga scale systems committees and the Sensory Systems committee which she for leakage power reduction in image sensors with on-chip memory,” chaired during 2003–2004. Currently, she serves as VP Publications of the in Proc. IEEE Sensors, 2009, pp. 193–196. IEEE Sensors Council, and an Associate Editor for the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS. She was an IEEE Distinguished Lecturer of the Circuits and Systems Society in 2005. She was also the general chair of the IEEE International Conference on Electronic Circuits and Systems (ICECS) and a current member of the steering committee of this conference.

Alexander Fish (M’06) received the B.Sc. degree in electrical engineering from the Technion—Israel In- stitute of Technology, Haifa, Israel, in 1999 and the M.Sc. degree and the Ph.D. (summa cum laude) de- gree from Ben-Gurion University, Be’er Sheva, Is- rael, in 2002 and 2006, respectively. He was a postdoctoral fellow in the ATIPS lab- oratory at the University of Calgary, Calgary, AB, Canada, from 2006–2008. In 2008, he joined the Ben- Gurion University as a faculty member in the Elec- Adam Teman (S’10) received the B.Sc. degree in trical and Computer Engineering Department. There electrical engineering and the M.Sc. degree from he founded the LPCAS laboratory, specializing in low power circuits and sys- Ben-Gurion University, Be’er Sheva, Israel, in 2006 tems. His research interests include low voltage digital design, energy efficient and 2011, respectively. He is pursuing the Ph.D. SRAM and Flash memory arrays, low power CMOS image sensors, and low degree under Dr. A. Fish as part of the Low Power power design techniques for digital and analog VLSI chips. He has authored Circuits and Systems (LPC&S) lab in Ben-Gurion over 60 scientific papers and patent applications. He has also published two University’s VLSI Systems Center. book chapters. He worked as a Design Engineer at Marvell Semi- Dr. Fish was a co-author of two papers that won the Best Paper Finalist awards conductors from 2006 to 2007, with an emphasis at ICECS’04 and ISCAS’05 conferences. He was also awarded the Young Inno- on physical implementation. His research interests vator Award for Outstanding Achievements in the field of Information Theories include low voltage digital design, energy efficient and Applications by ITHEA in 2005. In 2006, he was honored with the Engi- SRAM and Flash memory arrays, low power CMOS image sensors, and low neering Faculty Dean “Teaching Excellence” recognition at Ben-Gurion Uni- power design techniques for digital and analog VLSI chips. He has authored versity. He serves as an Editor-in-Chief for the MDPI Journal of Low Power ten scientific papers and two patent applications, and has presented excerpts Electronics and Applications (JLPEA) and as an Associate Editor for the IEEE from his research at a number of international conferences. SENSORS JOURNAL. He was also a co-organizer of special sessions on “smart” Mr. Teman was honored with the Electrical Engineering Department’s CMOS Image Sensors at IEEE Sensors Conference 2007, on low power “Smart” “Teaching Excellence” recognition at Ben-Gurion University in 2010. He is Image Sensors and Beyond at the IEEE ISCAS 2008, and on Design Method- a recipient of the Kreitman Foundation Fellowship for Doctoral Studies and ologies for Advanced Ultra Low Power Sensor and Memory Arrays at the IEEE received the Yizhak Ben-Ya’akov HaCohen Prize in 2010. Sensors conference 2009.

Chapter 7 Summary

This dissertation presents the core achievements of the research that I carried out during my combined graduate studies. In this research a cross-field study of power reduction in VLSI arrays was conducted at various design-levels, starting with the circuit-technology level through to the system-algorithm level. The research methodology included an extensive literature survey in each design field, followed by the recognition of power saving opportunities in each and the development of circuit to system level techniques to exploit these opportunities. The study focused on four major array types: 1. SRAM – the primary embedded memory and one of the major power consuming components of modern VLSI systems. 2. Gain Cell Embedded DRAM – a logic-compatible candidate for the replacement of the SRAM in low-power systems. 3. C-Flash NVM – a low-power logic-compatible non-volatile read-write memory, applicable for low-cost systems, such as passive RFID tags. 4. Smart Imagers – image sensors integrated with in-pixel and out-of-pixel sub- components to achieve advanced functionality. In each of these fields, my goal was to understand the features and characteristics of each array with an emphasis on their power consumption mechanisms, and, accordingly, to develop techniques to lower this consumption. Although many of the research projects were carried out in parallel, the overall research can be broken down into several phases: 1. In the first phase of the project, integrated circuit arrays were considered as a single entity, in order to recognize the similarities between the array types and an attempt to exploit them. This work resulted in a review article [42], which describes the similarities between the image sensors and the SRAM arrays, and which introduces the concept of peripheral sharing between several types of arrays for power and area efficiency. Peripheral sharing was later exploited in the smart imager of [48].

128

2. The second phase of the project focused on low power techniques for smart image sensors. In this phase three major concepts were developed. First, an autonomous star- tracking imager was engendered with ultra-low power consumption through Window- of-Interest and Winner-Take-All integration [51]. This was followed by the improvement of the AB2C concept for the low power operation of serially accessed arrays, which were integrated with a smart imager and on-chip SRAM, including the application of peripheral sharing [48, 52]. Finally, low power concepts were applied to the Wide-Dynamic-Range imagers, including voltage scaling, as described in [1] and [2]. 3. Following the implementation of a low power SRAM in the framework of smart image sensor design, I changed my focus to the study of the subthreshold SRAM operation for ultra-low power operation. The study of the subthreshold circuit operation led to three early publications [50, 53, 65], and continued into the development of a pair of novel SRAM bitcells for sub- to near-threshold operation. These circuits were integrated into a 40nm test chip ("RAMBO"), which was fabricated, measured, and published in four publications [46, 49, 54, 55, 63] and two patents [61, 62]. This work led to the study of the dynamic stability of bi-stable circuits, (Stage 4 or this research), and the further development of low-voltage SRAM solutions, including [60]. 4. During the exploration of low-voltage SRAM, I realized that the standard metrics used for the analysis of bi-stable circuits are insufficient when considering deeply scaled process technologies and low operating voltages. This led to the fourth phase of my research, which focuses on SRAM stability and data retention voltage. The techniques developed for evaluating the dynamic stability of SRAM circuits were applied to my low-voltage SRAM cells [46, 49, 56, 60, 63], and ported to non-SRAM circuits with similar features [54, 64]. In addition, the concept of DRV was studied and exploited for minimum voltage standby of SRAM arrays through runtime measurement [58]. 5. A parallel phase of my research involved leading a research team with the goal of designing the core of a low-cost low-power passive RFID tag, including an NVM array based on the TowerJazz C-Flash bitcell. This research included the development of

129

novel circuit techniques to produce and drive the non-standard voltages required to control the array, while maintaining a low-power figure. In order to achieve these goals, the dynamic stability methods that were developed in Phase 4 were applied to the developed circuits in order to ensure robust functionality. The initial results, describing this multi-year research project, were published in [47, 59, 64]. 6. The sixth and final stage of this project was initiated as an alternative to low-voltage SRAM for the realization of embedded memory. In this research, a GC-eDRAM was considered for low-power, high-density memory in ultra-low power applications. The primary method for achieving such a low-power operation lies in the extension of the data retention time of these memories, which thereby lowers the frequency of power hungry refresh operations., Carried out in collaboration with the TCL group at EPFL in Switzerland, this research started with an extensive study of the previous work in the field [21], and the analysis of operating voltage limitations of GC-eDRAM in light of technology scaling [25, 57], including a subthreshold GC-eDRAM array. This research led to the development of several methods for the extension of the array's data retention time that were designed and fabricated in an 0.18μm test chip intended for ULP applications. The research has so far resulted in ten journal and thirteen conference paper publications, with several other papers in the final stages of preparation or under review. Three of the journal papers have been included in special issues of international journals and three of the conference papers were presented as part of special sessions at IEEE conferences. Two patents were submitted based on the novel low-voltage SRAM designs. A total of four test- chips were fabricated in 0.18μm, 80nm, and 40nm technologies. Finally, my research has presented the basis for the theses of five MSc students (two have already graduated) and 15 undergraduate senior projects. In addition, based on this research, I have personally been the recipient of four prestigious student achievement awards which were granted by the Yitzhak Ben-Ya'akov HaCohen prize, the BGU Rector's award for academic excellence, the Wolf Foundation scholarship for research students, and the Intel Prize.

130

Bibliography

7.1 List of Publications

7.1.1 Papers Published in Peer-Reviewed Journals 1) H. Dagan, A. Teman, E. Pikhay, V. Dayan, A. Mordakhay, Y. Roizin and A. Fish, “A GIDL Free Tunneling Gate Driver for a Low Power Non-Volatile Memory Array”, IEEE JSSC, vol. 40, no. 6, pp. 1497-1510, 2013. 2) P. Meinerzhagen, A. Teman, R. Giterman, A. Burg, A. Fish, “Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications under Technology Scaling,” MDPI JLPEA, vol.3, no.2, pp. 54-72, 2013. 3) A. Teman, A. Mordakhay, and A. Fish “Functionality and Stability Analysis of a 400mV Quasi-Static RAM (QSRAM) Bitcell,” Elsevier Microelectronics Journal, vol.44, no.3, pp. 236-247, 2013. 4) A. Teman, A. Mordakhay, J. Mezhibovsky, A. Fish “A 40 nm Sub-Threshold 5T SRAM Bit Cell with Improved Read and Write Stability,” IEEE TCAS-II (Special issue on Ultra-Low Voltage VLSI Circuits and Systems for Green Computing), vol.59, no.12., pp. 873-877, Dec. 2012 5) A. Spivak, A. Teman, A. Belenky, O. Yadid-Pecht, and A. Fish, “Low-Voltage 96 dB Snapshot CMOS Image Sensor with 4.5 nW Power Dissipation per Pixel,” MDPI Sensors, vol. 12, no. 8, pp. 10067-10085, 2012. 6) A. Teman, O. Yadid-Pecht, and A. Fish “Leakage Reduction in Advanced Image Sensors using an Improved AB2C Scheme,” IEEE Sensors Journal, vol. 12, no. 4, pp. 773-784, 2012. (In the list of Top Accessed Articles, February 2012) 7) A. Teman, L. Pergament, O. Cohen and A. Fish, "A 250mV 8kb 40nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM)," J. Solid State Circuits, vol.46, no.11, pp.2713-2726, Nov. 2011. 8) A. Teman, L. Pergament, O. Cohen and A. Fish, "A Minimum Leakage Quasi-Static RAM Bitcell," J. Low Power Electron. Appl., vol. 1, pp. 204-218, 2011.

131

9) A. Spivak, A. Teman, A. Belenky, O. Yadid-Pecht and A. Fish, "Power-Performance Tradeoffs in Wide Dynamic Range Image Sensors with Multiple Reset Approach," J. Low Power Electron. Appl., vol. 1, pp. 59-76, 2011. 10) A.Teman, O.Yadid-Pecht, A.Fish, “Large VLSI Arrays – Power and Architectural Perspectives”, IJ ITK, vol.4, pp. 76, 2010.

7.1.2 Papers Published in Peer-Reviewed Conference Proceedings: 1) A. Teman, P. Meinerzhagen, A. Burg, and A. Fish “Review and Classification of Gain Cell eDRAM Implementations” IEEEI 2012, Nov. 2012 2) N. Edri, S. Fraiman, A. Teman, and A. Fish “Data Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays” IEEEI 2012, Nov. 2012 3) P. Meinerzhagen, A. Teman, A. Burg, and A. Fish “A Sub-VT 2T Gain-Cell Memory for Biomedical Applications,” IEEE Sub-Threshold Microelectronics Conference, Oct. 2012 4) J. Mezhibovsky, A. Teman, and A. Fish, "State Space Modeling for Sub-Threshold SRAM Stability Analysis," ISCAS 2012, pp. 1823-1826, May 2012. 5) H. Dagan, A. Teman, et al., "A GIDL Free Tunneling Gate Driver for a Low Power Non-Volatile Memory Array," ISCAS 2012, pp. 452-455, May 2012 6) H. Dagan, A. Teman, et al., "A Low-Cost Low-Power Non-Volatile Memory for RFID Applications," ISCAS 2012, pp. 1827-1830, May 2012 7) J. Mezhibovsky, A. Teman, A. Fish, “Low Voltage SRAMs and the Scalability of the 9T Supply Feedback SRAM” SOC Conference (SOCC), 2011 IEEE International, pp.136-141, 26-28 Sept. 2011. 8) I. Schwartz, A. Teman, L. Pergament, R. Dobkin, A. Fish, “Near-threshold 40nm Supply Feedback C-element” Quality Electronic Design (ASQED), 2011 3rd Asia Symposium on, pp.74-78, 19-20 July 2011.

9) A.Teman, A.Fish, “Sub-threshold and Near-threshold SRAM Design” 2010 IEEE 26th Convention of Electrical and Electronics Engineers in Israel, IEEEI 2010, p 608- 612, 2010 10) A.Teman, O.Yadid-Pecht, A.Fish, “An Improved AB2C Scheme for Leakage Power Reduction in Image Sensors with On-Chip Memory,” IEEE Sensors 2009, pp. 193- 196, Christchurch, New Zealand, Oct. 2009

132

11) S.Fisher, A.Teman, D.Vaysman, A.Gertsman, O.Yadid-Pecht, A.Fish, “Ultra-Low Power Subthreshold Flip-Flop Design,” ISCAS 2009, pp. 1573-1576, Taipei, Taiwan, May 2009 12) S.Fisher, A.Teman, D.Vaysman, A.Gertsman, O.Yadid-Pecht, A.Fish, “Digital Subthreshold Design – Motivation and Challenges,” IEEEI 2008, pp. 702-706, Eilat, Israel, November 2008 13) A.Teman, S. Fisher, L.Sudakov, A.Fish, O.Yadid-Pecht, “Autonomous CMOS image sensor for real time target detection and tracking,” ISCAS 2008, pp. 2138-2141, Seattle, Washington, USA, May 2008

7.2 References

[1] A. Spivak, A. Teman, A. Belenky, O. Yadid-Pecht and A. Fish, "Power-Performance Tradeoffs in Wide Dynamic Range Image Sensors with Multiple Reset Approach," JLPEA, vol. 1, pp. 59-76, 2011. [2] A. Spivak, A. Teman, A. Belenky, O. Yadid-Pecht and A. Fish, "Low-Voltage 96 dB Snapshot CMOS Image Sensor with 4.5 nW Power Dissipation per Pixel," Sensors, vol. 12, pp. 10067-10085, 2012. [3] G.E. Moore, "Cramming More Components onto Integrated Circuits," Electronics, vol. 38, pp. 114-117, April 1965. 1965. [4] S. Borkar, "Obeying Moore's Law Beyond 0.18 Micron [Microprocessor Design]," in ASIC/SOC Conference, 2000 Proceedings. 13th Annual IEEE International, pp. 26-31, 2000. [5] P.P. Gelsinger, "Microprocessors for the New Millennium: Challenges, Opportunities, and New Frontiers," in Solid-State Circuits Conference, 2001. Digest of Technical Papers. ISSCC. 2001 IEEE International, pp. 22-25, 2001. [6] J.M. Rabaey, A. Chandrakasan and B. Nikolić, Digital Integrated Circuits: A Design Perspective, 2nd Edition, Prentice Hall, 2003, [7] M. Horowitz, D. Stark and E. Alon, "Digital Circuit Design Trends,"Solid-State Circuits, IEEE Journal of, vol. 43, pp. 757-761, 2008. [8] S. Borkar, "Design Challenges of Technology Scaling," Micro, IEEE, vol. 19, pp. 23-29, 1999. [9] C. Piguet, Low-Power Electronics Design, CRC, 2005. [10] A. Wang, B.H. Calhoun and A.P. Chandrakasan, Sub-threshold Design for Ultra Low- Power Systems (Series on Integrated Circuits and Systems), Secaucus, NJ, USA: Springer-Verlag New York, Inc, 2006, .

133

[11] N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir and V. Narayanan, "Leakage Current: Moore's Law Meets Static Power," Computer, vol. 36, pp. 68-75, Dec. 2003. 2003. [12] ITRS, "International Roadmap for Semiconductors, http://www.itrs.net/," 2011 Edition. [13]M. Sharifkhani and M. Sachdev, "A Low Power SRAM Architecture Based on Segmented Virtual Grounding," in Low Power Electronics and Design, 2006.ISLPED '06. Proceedings of the 2006 International Symposium on, pp. 256-261, 2006. [14] B.-Yang, "A Low-Power SRAM Using Bit-Line Charge-Recycling for Read and Write Operations," Solid-State Circuits, IEEE Journal of, vol. 45, pp. 2173-2183, 2010. [15] Byung-Do Yang and Lee-Sup Kim, "A Low-Power SRAM Using Hierarchical Bit line and Local Sense Amplifiers," Solid-State Circuits, IEEE Journal of, vol. 40, pp. 1366- 1376, 2005. [16] Keejong Kim, H. Mahmoodi and K. Roy, "A Low-Power SRAM Using Bit-Line Charge-Recycling," Solid-State Circuits, IEEE Journal of, vol. 43, pp. 446-459, 2008. [17] C.C. Wang, C.L. Lee and W.J. Lin, "A 4-kb Low-Power SRAM Design with Negative Word-line Scheme," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, pp. 1069-1076, 2007. [18] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami, T. Arakawa and H. Hamano, "A Low Power SRAM Using Auto-Backgate-Controlled MT-CMOS," in Low Power Electronics and Design, 1998. Proceedings. 1998 International Symposium on, pp. 293-298, 1998.

[19] M. Powell, Se-Hyun Yang, B. Falsafi, K. Roy and T.N. Vijaykumar, "Gated-Vdd: a Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories," in Low Power Electronics and Design, 2000. ISLPED '00. Proceedings of the 2000 International Symposium on, pp. 90-95, 2000. [20] K. Itoh, VLSI Memory Chip Design, Springer Series in Advanced Microelectronics, Springer-Verlag Berlin, Germany, 2001, . [21] A. Teman, P.A. Meinerzhagen, A.P. Burg and A. Fish, "Review and Classification of Gain Cell eDRAM Implementations," 2012. [22] Ki Chul Chun, P. Jain and C.H. Kim, "Logic-Compatible Embedded DRAM Design for Memory Intensive Low Power Systems," in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 277-280, 2010. [23] Yoonmyung Lee, Mao-Ter Chen, Junsun Park, D. Sylvester and D. Blaauw, "A 5.42nW/kB Retention Power Logic-Compatible Embedded DRAM with 2T Dual-Vt Gain Cell for Low Power Sensing Applications," in Solid State Circuits Conference (A- SSCC), 2010 IEEE Asian, pp. 1-4, 2010.

134

[24] Ki Chul Chun, P. Jain, Jung-Hwa Lee and C.H. Kim, "A 3T Gain Cell Embedded DRAM Utilizing Preferential Boosting for High Density and Low Power On-Die Caches,", Solid-State Circuits IEEE Journal of, vol. 46, pp. 1495-1505, 2011. [25] P. Meinerzhagen, A. Teman, R. Giterman, A. Burg and A. Fish, "Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications Under Technology Scaling," Journal of Low Power Electronics and Applications, vol. 3, pp. 54-72, 2013. [26] A. Fish and O. Yadid-Pecht, "Low Power CMOS Imager Circuits," in Circuits at the Nanoscale, K. Iniewski, CRC Press, 2008, ch. 24, pp. 455-484. [27] A. El Gamal and H. Eltoukhy, "CMOS Image Sensors," Circuits and Devices Magazine, IEEE, vol. 21, pp. 6-20, 2005. [28] M. Bigas, E. Cabruja, J. Forestb and J. Salvi, "Review of CMOS Image Sensors," Microelectron.J., vol. 20, pp. 1-19, 2005. [29] E.R. Fossum, "CMOS Image Sensors: Electronic Camera-on-a-Chip," Electron Devices, IEEE Transactions on, vol. 44, pp. 1689-1698, 1997. [30] E.R. Fossum, "Active Pixel Sensors: Are CCD's Dinosaurs?" 1993. [31] A. Fish, “Smart Active Pixel Sensors for Ultra Low-Power Applications," Ph.D. Thesis, 2006. [32] I.J. Chang, J.J. Kim, S.P. Park and K. Roy, "A 32 kb 10T Sub-Threshold SRAM Array With Bit-Interleaving and Differential Read Scheme in 90 nm CMOS," Solid-State Circuits, IEEE Journal of, vol. 44, pp. 650-658, Feb. 2009. 2009. [33] B.H. Calhoun and A.P. Chandrakasan, "A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low-Voltage Operation," IEEE J Solid State Circuits, vol. 42, pp. 680-688, March 2007. [34] N.Verma and A.P. Chandrakasan, "A 256 kb 65 nm 8T Subthreshold SRAM Employing Sense-Amplifier Redundancy," Solid-State Circuits, IEEE Journal of, vol. 43, pp. 141- 149, Jan. 2008. [35] Kyung Ki Kim and Yong-Bin Kim, "Optimal Body Biasing for Minimum Leakage Power in Standby Mode," in Circuits and Systems,2007. ISCAS 2007. IEEE International Symposium on, pp. 1161-1164, 2007. [36] A. Kumar, H. Qin, P. Ishwar, J. Rabaey and K. Ramchandran, "Fundamental Bounds on Power Reduction during Data-Retention in Standby SRAM," in Circuits and Systems, 2007.ISCAS 2007.IEEE International Symposium on, pp. 1867-1870, 2007. [37] J.Wang, A. Singhee, R.A. Rutenbar and B.H. Calhoun, "Statistical Modeling for the Minimum Standby Supply Voltage of a Full SRAM Array," in Solid State Circuits Conference, 2007.ESSCIRC 2007.3rd European, pp. 400-403, 2007.

135

[38] Jiajing Wang and B.H. Calhoun, "Techniques to Extend Canary-Based Standby Scaling for SRAMs to 45 nm and Beyond," Solid-State Circuits, IEEE Journal of, vol. 43, pp. 2514-2523, 2008. [39] A. Karandikar and K.K. Parhi, "Low Power SRAM Design Using Hierarchical Divided Bit-line Approach," in Computer Design: VLSI in Computers and Processors, 1998. ICCD '98. Proceedings. International Conference on, pp. 82-88, 1998. [40] B.H. Calhoun and A. Chandrakasan, "Analyzing Static Noise Margin for Sub-threshold SRAM in 65nm CMOS," in Solid-State Circuits Conference, 2005. ESSCIRC 2005. Proceedings of the 31st European, pp. 363-366, 2005. [41] B.H. Calhoun and A.P. Chandrakasan, "Static Noise Margin Variation for Sub-threshold SRAM in 65-nm CMOS," Solid-State Circuits, IEEE Journal of, vol. 41, pp. 1673- 1679, 2006. [42]A. Teman, O. Yadid-Pecht and A. Fish, "Large VLSI Arrays – Power and Architectural Perspectives," IJ ITK, vol. 4, pp. 76-88, 2010. [43] A. Fish and O. Yadid-Pecht, "Low-Power “Smart” CMOS Image Sensors," in Circuits and Systems, 2008. ISCAS2008. IEEE International Symposium on, pp. 1408-1411, 2008. [44] O. Yadid-Pecht and R. Etienne-Cummings, CMOS Imagers: From Phototransduction to Image Processing, Norwell, MA: Springer, 2004, . [45]E. Culurciello and A.G. Andreou, "16x16 Pixel Silicon on Sapphire CMOS Digital Pixel Photosensor Aray," Electronics Letters, vol. 40, pp. 66-68, 2004. [46]A. Teman, L. Pergament, O. Cohen and A. Fish, "A 250mV 8kb 40nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM)," IEEE Journal of Solid State Circuits (JSSC), vol. 46, pp. 2713-2726, 2011. [47] H. Dagan, A. Teman, A. Fish, E. Pikhay, V. Dayan and Y. Roizin, "A GIDL Free Tunneling Gate Driver for a Low Power Non-volatile Memory Array," in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 452-455, 2012. [48] A.Teman, O. Yadid-Pecht and A. Fish, "Leakage Reduction in Advanced Image Sensors Using an Improved AB2C Scheme," Sensors Journal, IEEE, vol. 12, pp. 773-784, 2012. [49] A. Teman, A. Mordakhay and A. Fish, "Functionality and Stability Analysis of a 400 mV Quasi-Static RAM (QSRAM) Bitcell," Microelectron.J., vol. 44, pp. 236-247, 3. 2013. [50] S. Fisher, A. Teman, D. Vaysman, A. Gertsman, O. Yadid-Pecht and A. Fish, "Ultra- Low Power Subthreshold Flip-flop Design," in Circuits and Systems, 2009. ISCAS 2009. IEEE International Symposium on, pp. 1573-1576, 2009. [51] A. Teman, S. Fisher, L. Sudakov, A. Fish and O. Yadid-Pecht, "Autonomous CMOS Image Sensor for Real Time Target Detection and Tracking," in Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, pp. 2138-2141, 2008. 136

[52] A. Teman, O. Yadid-Pecht and A. Fish, "An Improved AB2C Scheme for Leakage Power Reduction in Image Sensors with On-Chip Memory," in Sensors, 2009 IEEE, pp. 193-196, 2009. [53] A. Teman and A. Fish, "Sub-threshold and Near-threshold SRAM Design," in Electrical and Electronics Engineers in Israel (IEEEI), 2010 IEEE 26th Convention of, pp. 608- 612, 2010. [54] I. Schwartz, A. Teman, R. Dobkin and A. Fish, "Near-threshold 40nm Supply Feedback C-element," in Quality Electronic Design (ASQED), 2011 3rd Asia Symposium on, pp. 74-78, 2011. [55] J. Mezhibovsky, A. Teman and A. Fish, "Low Voltage SRAMs and the Scalability of the 9T Supply Feedback SRAM," in SOC Conference (SOCC), 2011 IEEE International, pp. 136-141, 2011. [56] J. Mezhibovsky, A. Teman, and A. Fish, "State Space Modeling for Sub-threshold SRAM Stability Analysis," in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 1823-1826, 2012. [57] P. Meinerzhagen, A. Teman, A. Mordakhay, A. Burg and A. Fish, "A Sub-VT 2T Gain- Cell Memory for Biomedical Applications," in Subthreshold Microelectronics Conference (SubVT), 2012 IEEE, pp. 1-3, 2012. [58] N. Edri, S. Fraiman, A. Teman and A. Fish, "Data Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays," in Electrical & Electronics Engineers in Israel (IEEEI), 2012 IEEE 27th Convention of, pp. 1-5, 2012. [59] H. Dagan, A. Teman, A. Fish, E. Pikhay, V. Dayan and Y. Roizin, "A Low-Cost Low- Power Non-Volatile Memory for RFID Applications," in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 1827-1830, 2012. [60] A. Teman, A. Mordakhay, J. Mezhibovsky and A. Fish, "A 40-nm Sub-Threshold 5T SRAM Bit Cell With Improved Read and Write Stability," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 59, pp. 873-877, 2012. [61] A. Teman, L. Pergament, O. Cohen and A. Fish, "ULTRA LOW POWER SRAM CELL CIRCUIT WITH A SUPPLY FEEDBACK LOOP FOR NEAR AND SUB THRESHOLD OPERATION," vol. US Patent 20,120,281,458, 2012. [62] A. Teman, L. Pergament, O. Cohen and A. Fish, "ULTRA LOW POWER MEMORY CELL WITH A SUPPLY FEEDBACK LOOP CONFIGURED FOR MINIMAL LEAKAGE OPERATION," vol. US Patent 20,120,281,459, 2012. [63] A. Teman, L. Pergament, O. Cohen and A. Fish, "A Minimum Leakage Quasi-Static RAM Bitcell," pJ. Low Power Electron. Appl., vol. 1, pp. 204-218, 2011. [64] H. Dagan, A. Teman, E. Pikhay, V. Dayan, A. Mordakhay, Roizin and Fish, "A Low- Power DCVSL-Like GIDL-Free Voltage Driver for Low-Cost RFID Nonvolatile Memory," Solid-State Circuits, IEEE Journal of, vol. PP, pp. 1-14, 2013.

137

[65] S. Fisher, A. Teman, D. Vaysman, A. Gertsman, O. Yadid-Pecht and A. Fish, "Digital Subthreshold Logic Design - Motivation and Challenges," in Electrical and Electronics Engineers in Israel, 2008. IEEEI 2008. IEEE 25th Convention of, pp. 702-706, 2008. [66] V. De and S. Borkar, "Technology and Design Challenges for Low Power and High Performance [Microprocessors]," in Low Power Electronics and Design, 1999. Proceedings. 1999 International Symposium on, pp. 163-168, 1999. [67] J. Kao, S. Narendra and A. Chandrakasan, "Subthreshold Leakage Modeling and Reduction Techniques [IC CAD Tools]," in Computer Aided Design, 2002. ICCAD 2002. IEEE/ACM International Conference on, pp. 141-148, 2002. [68] M. J. Deen and Z. X. Yan, "DIBL in Short-Channel NMOS Devices at 77 K," Electron Devices, IEEE Transactions on, vol. 39, pp. 908-915, 1992. [69] C. Neau and K. Roy, "Optimal Body Bias Selection for Lakage Improvement and Process Compensation over Different Technology Generations," in Low Power Electronics and Design, 2003. ISLPED '03. Proceedings of the 2003 International Symposium on, pp. 116-121, 2003. [70] N. Balley and B. Baylac, "Analytical Modelling of Depletion-Mode MOSFET with Short- and Narrow-Channel Effects," Solid-State and Electron Devices, IEE Proceedings I, vol. 128, pp. 225-238, 1981. [71] C.-. Lu and J.M. Sung, "Reverse Short-Channel Effects on Threshold Voltage in Submicrometer Salicide Devices," Electron Device Letters, IEEE, vol. 10, pp. 446-448, 1989. [72] E.P. Vandamme, P. Jansen and L. Deferm, "Modeling the Subthreshold Swing in MOSFET's," Electron Device Letters, IEEE, vol. 18, pp. 369-371, 1997. [73] R.M. Rao, J.L. Burns, A. Devgan and R.B. Brown, "Efficient techniques for gate leakage estimation," in Low Power Electronics and Design, 2003. ISLPED '03. Proceedings of the 2003 International Symposium on, pp. 100-103, 2003. [74] S. H. Voldman, J. A. Brachitta and D .J. Fitzgerald, "Band-to-Band Tunneling and Thermal Generation Gate-Induced Drain Leakage," Electron Devices, IEEE Transactions on, vol. 35, pp. 2433, 1988. [75] Xiaobin Yuan, Jae-Eun Park, Jing Wang, Enhai Zhao, D.C. Ahlgren, T. Hook, Jun Yuan, V. Chan, Huiling Shang, Chu-Hsin Liang, R. Lindsay, Sungjoon Park and Hyotae Choo, "Gate-Induced-Drain-Leakage Current in 45-nm CMOS Technology," Device and Materials Reliability, IEEE Transactions on, vol. 8, pp. 501-508, 2008. [76] J. Zhu, R.A. Martin and J.Y. Chen, "Punchthrough Current for Submicrometer MOSFETs in CMOS VLSI," Electron Devices, IEEE Transactions on, vol. 35, pp. 145- 151, 1988. [77] Kuan-Yu Fu and Y.L. Tsang, "On the Punchthrough Phenomenon in Submicron MOS Transistors," Electron Devices, IEEE Transactions on, vol. 44, pp. 847-855, 1997.

138

[78] R.R. Schaller, "Moore's Law: Past, Present and Future," Spectrum, IEEE, vol. 34, pp. 52-59, 1997. [79] A. Dancy and A. Chandrakasan, "Techniques for Aggressive Supply Voltage Scaling and Efficient Regulation," in Custom Integrated Circuits Conference, 1997., Proceedings of the IEEE 1997, pp. 579-586, 1997. [80] H. Won, K. Kim, K. Jeong, K. Park, K. Choi and J. Kong, "An MTCMOS Design Methodology and its Application to Mobile Computing," in Low Power Electronics and Design, 2003. ISLPED '03. Proceedings of the 2003 International Symposium on, pp. 110-115, 2003. [81] Zhigang Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson and P. Bose, "Microarchitectural Techniques for Power Gating of Execution Units," in Low Power Electronics and Design, 2004. ISLPED '04. Proceedings of the 2004 International Symposium on, pp. 32-37, 2004. [82] S. Narendra, S. Borkar, V. De, D. Antoniadis and A. Chandrakasan, "Scaling of Stack Effect and its Application for Leakage Reduction," in Low Power Electronics and Design, International Symposium on, 2001. pp. 195-200, 2001. [83] M. Bauer, R. Alexis, G. Atwood, B. Baltar, A. Fazio, K. Frary, M. Hensel, M. Ishac, J. Javanifard, M. Landgraf, D. Leak, K. Loe, D. Mills, P. Ruby, R. Rozman, S. Sweha, S. Talreja and K. Wojciechowski, "A Multilevel-Cell 32 Mb Flash Memory," in Solid- State Circuits Conference, 1995. Digest of Technical Papers. 42nd ISSCC, 1995 IEEE International, pp. 132-133, 351, 1995. [84] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S. Nagao, S. Kayano and T. Nakano, "A 64Kb Full CMOS RAM with Divided Word Line Structure," in Solid-State Circuits Conference. Digest of Technical Papers. 1983 IEEE International, pp. 58-59, 1983. [85] G. Fukano, K. Kushida, A. Tohata, Y. Takeyama, K. Imai, A. Suzuki, T. Yabe and N. Otsuka, "A 65nm 1Mb SRAM Macro with Dynamic Voltage Scaling in Dual Power Supply Scheme for Low Power SoCs," in Non-Volatile Workshop, 2008 and 2008 International Conference on Memory Technology and Design. NVSMW/ICMTD 2008. Joint, pp. 97-98, 2008. [86] Jiajing Wang and B.H. Calhoun, "Canary Replica Feedback for Near-DRV Standby VDD Scaling in a 90nm SRAM," in Custom Integrated Circuits Conference, 2007. CICC '07. IEEE, pp. 29-32, 2007. [87] T.H. Kim, J. Liu and C.H. Kim, "An 8T Subthreshold SRAM Cell Utilizing Reverse Short Channel Effect for Write Margin and Read Performance Improvement," in IEEE Custom Integrated Circuits Conference, 2007 (CICC '07). pp. 241-244, 2007. [88] C. H. Holder Jr, H. C. Kirsch and J. H. Stefany, "Semiconductor Memory with Boosted Word Line," Semiconductor Memory with Boosted Word Line, 1987.

139

[89] D. Markovic, C.C. Wang, L.P. Alarcon, T.T. Liu and J.M. Rabaey, "Ultralow-Power Design in Near-Threshold Region," Proceedings of the IEEE, vol. 98, pp. 237-252, Feb. 2010. 2010. [90] E.A. Vittoz, "Weak Inversion for Ultra Low-Power and Very Low-Voltage Circuits," in Solid-State Circuits Conference, 2009. A-SSCC 2009. IEEE Asian, pp. 129-132, 2009. [91] Bo Zhai, S. Pant, L. Nazhandali, S. Hanson, J. Olson, A. Reeves, M. Minuth, R. Helfand, T. Austin, D. Sylvester and D. Blaauw, "Energy-Efficient Subthreshold Processor Design," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 17, pp. 1127-1137, 2009. [92] A. Wang and A. Chandrakasan, "A 180-mV Subthreshold FFT Processor Using a Minimum Energy Design Methodology," Solid-State Circuits, IEEE Journal of, vol. 40, pp. 310-319, 2005. [93] K. Itoh, "Low-Voltage Memories for Power-Aware Ssystems," in Low Power Electronics and Design, 2002. ISLPED '02. Proceedings of the 2002 International Symposium on, pp. 1-6, 2002. [94] A. Raychowdhury, S. Mukhopadhyay and K. Roy, "A Feasibility Study of Subthreshold SRAM Across Technology Generations," in Computer Design: VLSI in Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE International Conference on, pp. 417-422, 2005. [95] A. Wang and A. Chandrakasan, "A 180mV FFT Processor Using Subthreshold Circuit Techniques," in Solid-State Circuits Conference, 2004. Digest of Technical Papers. ISSCC. 2004 IEEE International, pp. 292-529 Vol.1, 2004. [96] M. Alioto, "Guest Editorial for the Special Issue on Ultra-Low-Voltage VLSI Circuits and Systems for Green Computing," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 59, pp. 849-852, 2012. [97] M.E. Sinangil, N. Verma and A.P. Chandrakasan, "A Reconfigurable 65nm SRAM Achieving Voltage Scalability from 0.25–1.2 V and Performance Scalability from 20kHz–200MHz," in Solid-State Circuits Conference, 2008. ESSCIRC 2008. 34th European, pp. 282-285, 2008. [98] S. Hanson, Mingoo Seok, Yu-Shiang Lin, Zhi Yoong Foo, Daeyeon Kim, Yoonmyung Lee, N. Liu, D. Sylvester and D. Blaauw, "A Low-Voltage Processor for Sensing Applications With Picowatt Standby Mode," Solid-State Circuits, IEEE Journal of, vol. 44, pp. 1145-1155, 2009. [99] J. Constantin, A. Dogan, O. Andersson, P. Meinerzhagen, J.N. Rodrigues, D. Atienza and A. Burg, "TamaRISC-CS: An Ultra-Low-Power Application-Specific Processor for Compressed Sensing," in VLSI and System-on-Chip (VLSI-SoC), 2012 IEEE/IFIP 20th International Conference on, pp. 159-164, 2012. [100] L. Chang, R. K. Montoye, Y. Nakamura, K. A. Batson, R. J. Eickemeyer, R. H. Dennard, W. Haensch and D. Jamsek, "An 8T-SRAM for Variability Tolerance and 140

Low-Voltage Operation in High-Performance Caches," Solid-State Circuits, IEEE Journal of, vol. 43, pp. 956-963, 2008. [101] Zhiyu Liu and V. Kursun, "Characterization of a Novel Nine-Transistor SRAM Cell," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, pp. 488- 492, 2008. [102] B. H. Calhoun and A. Chandrakasan, "A 256kb Sub-threshold SRAM in 65nm CMOS," in Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International, pp. 2592-2601, 2006. [103] S.Hong, S. Kim, J. Wee and S. Lee, "Low-Voltage DRAM Sensing Scheme with Offset-Cancellation Sense Amplifier, " Solid-State Circuits, IEEE Journal of, vol. 37, pp. 1356-1360, 2002. [104] D. Somasekhar, Y. Ye, P. Aseron, S. Lu, M. Khellah, J. Howard, G. Ruhl, T. Karnik, S.Y. Borkar and V. De, "2GHz 2Mb 2T Gain Cell Memory Macro with 128GB/s Bandwidth in a 65nm Logic Process," in Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International, pp. 274-613, 2008. [105] K.C. Chun, P. Jain, J.H. Lee and C.H. Kim, "A 3T Gain Cell Embedded DRAM Utilizing Preferential Boosting for High Density and Low Power- On-Die Caches," Solid-State Circuits, IEEE Journal of, vol. 46, pp. 1495-1505, 2011. [106] R. Iqbal, P. Meinerzhagen and A. Burg, "Two-Port Low-Power Gain-Cell Storage Array: Voltage Scaling and Retention Time," in Circuits and Systems (ISCAS), 2012 IEEE International Symposium on, pp. 2469-2472, 2012. [107] K.C. Chun, P. Jain and C. H. Kim, "Logic-Compatible Embedded DRAM Design for Memory Intensive Low Power Systems," in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pp. 277-280, 2010. [108] Y. Lee, M. Chen, J. Park, D. Sylvester and D. Blaauw, "A 5.42 nW/kB Retention Power Logic-Compatible Embedded DRAM with 2T Dual-Vt Gain Cell for Low Power Sensing Applications," in Solid State Circuits Conference (A-SSCC), 2010 IEEE Asian, pp. 1-4, 2010. [109] K. C. Chun, P. Jain, T. Kim and C.H. Kim, "A 667 MHz Logic-Compatible Embedded DRAM Featuring an Asymmetric 2T Gain Cell for High Speed On-Die Caches, " Solid- State Circuits, IEEE Journal of, vol. 47, pp. 547-559, 2012. [110] K.C. Chun, W. Zhang, P. Jain and C.H. Kim, "A 2T1C Embedded DRAM Macro with No Boosted Supplies Featuring a 7T SRAM Based Repair and a Cell Storage Monitor," Solid-State Circuits, IEEE Journal of, vol. 47, pp. 2517-2526, 2012.

141

Appendix A: Large VLSI Arrays – Power and Architectural Perspectives

As appears in the International Journal “Information Technologies and Knowledge” (IJ ITK), vol. 4, issue 1, pages. 76-88, published in October 2010 [42]. Also presented at the iTech2010 Conference of the ITHEA, Varna, Bulgaria, June 2010.

Appendix 76 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

LARGE VLSI ARRAYS – POWER AND ARCHITECTURAL PERSPECTIVES

Adam Teman, Orly Yadid-Pecht and Alexander Fish

Abstract: A novel approach to power reduction in VLSI arrays is proposed. This approach includes recognition of the similarities in architectures and power profiles of different types of arrays, adaptation of methods developed for one on others and component sharing when several arrays are embedded in the same system and mutually operated. Two types of arrays are discussed: Image Sensor pixel arrays and SRAM bitcell arrays. For both types of arrays, architectures and major sources of power consumption are presented and several examples of power reduction techniques are discussed. Similarities between the architectures and power components of the two types of arrays are displayed. A number of peripheral sharing techniques for systems employing both Image Sensors and SRAM arrays are proposed and discussed. Finally, a practical example of a smart image sensor with an embedded memory is given, using an Adaptive Bulk Biasing Control scheme. The peripheral sharing and power saving techniques used in this system are discussed. This example was implemented in a standard 90nm CMOS process and showed a 26% leakage reduction as compared to standard systems. Keywords: VLSI Arrays, SRAM, Smart Image Sensors, Low Power, AB2C. ACM Classification Keywords: B.3.1 Semiconductor Memories - SRAM, B.6 Logic Design – Memory Control and Access, B.7 Integrated Circuits – VLSI, E.1 Data Structures – Arrays, I.4.1 Digitization and Image Capture

Introduction

The continuing persistence of Moore’s Law [Moore65] throughout recent years has led to great opportunities for embedding complex systems and extended functionality on a single die. The primary example of this trend is the modern day, high performance, multi-core microprocessor that employs large memory caches in order to achieve large bandwidth. Another popular example is the smart image sensor, which integrates additional capabilities of analog and digital signal processing into a conventional CMOS sensor array. Both microprocessors and image sensors are frequent components of various Systems-On-Chip (SOC) that also embed several additional SRAM arrays for various functionality. As a result of these trends, large VLSI arrays frequently cover a large area of various microelectronic systems, sometime well over half of the total silicon die. One of the side effects of the integration of large VLSI arrays is, of course, power consumption. In the last decade, low-power design has ousted high-performance as the main focus of the VLSI industry. This is a result of the constant exponential rise in power density over the past three decades, coupled with the rise in popularity of mobile, battery powered devices. This power increase proved to be unacceptable in immobile, high performance systems, when the cost and complexity of heat dissipation became too high, and in mobile devices, where increased performance and functionality are required alongside the need for large spans between recharging. In today’s systems, it is very common that the main source of power consumption is the large memory arrays. In digital camera systems, the pixel array along with its periphery are obviously the main consumers of power, and likewise, in other SOCs comprising smart imagers, they tend to be close to the top of the list. These facts lead us to realization that low power solutions for embedded arrays are a necessity in modern VLSI design. In this paper, we have chosen the two types of arrays mentioned above, embedded SRAM bitcell arrays and image sensor pixel arrays, for discussion. Through these examples, we will show that there are several similarities in the architectures and power profiles of different types of arrays. Many techniques and solutions have been developed for power reduction in each type of array, but rarely has one technique been adapted to fit International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 77

another type of array. Through our discussion, we will show that such possibilities exist and provide an important direction for low power research. The discussion will start with a review of the architectures of both types of subsystems (i.e. bitcell and pixel arrays), describing the components that compose each. We will then discuss the sources of power consumption and the related problems for each subsystem, as well as a number of existing low power solutions for each case. We will continue with a comparison of the two types of subsystems, highlighting similarities and discussing peripheral sharing opportunities. Finally, we will give an example of a system, recently developed by our group, that utilizes these similarities to achieve power reduction in a smart image sensor system with embedded memory.

SRAM Architecture and Power Considerations

Modern digital systems require the capability of storing and accessing large amounts of information at high speeds. Of the different types of memories, the Static Random Access Memory (SRAM) is the most common embedded memory, due to its high speeds and relatively high density in standard fabrication processes. SRAMs are widely used in microprocessors as caches, tag arrays, register files, branch table predictors, instruction windows, etc. and occupy a significant portion of the die area. In high-performance processors, L1 and L2 caches alone occupy over half of the die area [Mamidipaka, 2004]. Accordingly, SRAMs are one of the main sources of power dissipation in modern VLSI chips, especially high-end microprocessors and SOCs. Figure 1 shows a typical block diagram of an SRAM, with emphasis on the main components and sub-blocks. The core of the SRAM is an array of identical bitcells, laid out in a very regular and repetitive structure, each bitcell storing either a ‘1’ or ‘0’ on a cross-coupled latch, and enabling read and write access. The bitcells are divided into rows and columns, allowing complete random access, through the use of X and Y addressing circuitry consisting of a row decoder and a column multiplexer. The addressing is propagated to the individual bitcell through a grid of horizontally wired wordlines and vertically wired bitlines. A particular bitcell is accessed (either read or written) when its row’s wordline and its column’s bitline are asserted simultaneously.

Figure 1: Typical SRAM Component Block Diagram 78 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

In order to initiate either a read or a write, the read column logic and write column logic blocks are required. The read column logic block typically consists of a low swing sense amplifier to enhance the performance and readout a digital signal from the asserted column. The write column logic block consists of a write driver that asserts the data to be written onto the relevant column. A read/write enable control signal selects which of the two blocks is activated, and the asserted wordline initiates the bitcell on the selected column to be read from or written to. Additional blocks needed for SRAM operation include column precharge, internal timing, digital control blocks and biasing circuitry. The column precharge block prepares the read/write operation by setting the columns into a known state. The internal timing blocks sense various transitions in internal and external signals to initiate or terminate operation phases. The digital control blocks enable application of advanced error correcting, row/column redundancy, etc. The biasing circuitry is generally required for sense amplifier operation. The power profile of SRAMs include both dynamic power, consumed during read and write operations, as well as static power, consumed during standby (“hold”) periods. The dynamic power, similar to standard logic, is a function of the supply voltage and frequency, giving the standard tradeoff between power and performance/reliability. Static power in SRAMs, on the other hand, is mainly due to unwanted parasitic leakage currents. As technology scales, leakage currents become a more dominant factor, causing the static power of SRAMs to become a major issue and one of the primary static power components of many systems. A unified active power equation is given in Equation 1 [Rabaey2003] [Itoh2001]:

PVDD I array I decode I periphery  (1) Vmimni1  nmCVfCVfI   DD act hld DEint PT int DCP where m and n are the number of columns and rows, respectively, f is the operating frequency, VDD is the general supply voltage, Vint is the internal supply voltage, iact is the effective current of the selected cells, ihld is the data retention current of inactive cells, CDE is the output node capacitance of each decoder, CPT is the total capacitance of the digital logic and periphery circuits, and IDCP is the static current of the periphery. The dynamic power of an SRAM is mainly consumed in the following areas: address decoding, bitline charging/discharging, and readout sensing. During address decoding, power is consumed both by the switching of the decoders themselves, as well as by charging and discharging the selected wordlines, which can have high capacitances. During both read and write operations, the bitlines are precharged and subsequently discharged. This is especially power consuming during writes, when the bitline is fully discharged, or when a full discharge read scheme is chosen. Sense amplifiers typically depend on bias currents for operation, consuming constant power when they are activated. The static power of an SRAM is primarily consumed through leakage currents inside the bitcells themselves during standby (hold) periods, i.e. when the particular cell (or the whole array) is not asserted. This includes subthreshold and gate leakages in both the inner cross coupled latch structure, as well as to/from the bitlines through the access transistors on unselected rows. Another large contributor to static power is from the precharge circuitry, when a constant charging scheme is used, i.e. a high-resistance supply or diode-connected transistor is placed on the bitlines to replenish lost precharge voltage. Other contributors to the static power are the leakage currents in the decoders and other blocks. An in-depth analysis of the power dissipation by all SRAM components can be found in [Itoh2001]. Several standard methods have been developed over the years to reduce the power consumption of SRAMs. The standard methods are based on physical partitioning of the array in each of the axes. Banked organization of SRAMs divides the array both horizontally and vertically into sub-arrays. An external decoder raises the chip select of the selected bank, reducing the dynamic power consumption, as smaller decoders are needed, and less International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 79

wordline and bitline capacitances are charged/discharged. The Divided Word Line (DWL) approach divides the array horizontally, propagating the decoder output on a global wordline, and subsequently raising the local wordline of a partition of columns, reducing the overall capacitance charged, and requiring smaller wordline drivers. Partitioning the columns using the Divided Bitline scheme, with partial multiplexing inside the array, reduces the bitline capacitance and in certain sensing schemes, will reduce the power consumption. All of these solutions come at the expense of additional area overhead, but a good tradeoff can achieve a worthwhile reduction of power consumption as well as an improvement in performance. Using advanced timing and sensing schemes is another standard method to achieve a substantial dynamic power reduction. Using pulsed word lines and/or reduced bitline voltage swings, results in less discharge during read cycles, but is accompanied with complex design considerations and higher sensitivity to process variations. Timing the activation of sense amplifiers limits biasing currents to be present only during the exact times that the sensing is carried out. Additional low static power sense amplifiers, such as a Differential Charge Amplifiers and Self Latching Sense Amplifiers, also achieve static power reduction. Many schemes have been proposed to reduce the bitcell leakage power, such as Supply Voltage Gating [Powell2000] [Flautner2002], Reversed Body Biasing (RBB) [Nii1998] [Hanson2003], Dynamic Voltage Scaling [Kim2002] and Negative Word Line (NWL) application [Wang 2007]. Recently, many proposals have shown minimum energy point operation of SRAMs in the subthreshold or near-subthreshold region. Examples of these include various works by Chandrakasan and Calhoun et.al. [Chandrakasan2007] [Chandrakasan2008] [Calhoun2007].

CMOS Image Sensor Architecture and Power Considerations

Traditionally, digital image sensors were fabricated in Charge Coupled Device (CCD) technology, but the integration of image sensors into more and more products, made the Active Pixel Sensor (APS) an attractive solution. This image sensor architecture is implemented in standard CMOS technology processes, and provides significant advantages over the CCD imagers in terms of power consumption, low voltage operation, and monolithic integration. With the rising popularity of portable, battery operated devices that require high-density ultra low power image sensors, the CMOS alternative has become very widespread. In addition, the CMOS technology allow for the fabrication of so called “smart” image sensors that integrate analog and/or digital signal processing onto the same substrate as the imager and its digital interface. Low power smart image sensors are very useful in a variety of applications, such as space, automotive, medical, security, industrial and others [Fish2007]. CMOS image sensors generally operate in one of two modes: rolling shutter or global shutter (snapshot) mode. When rolling shutter mode is used, each row of pixels is initiated for image capture separately in a serial fashion. This creates a slight delay between adjacent rows, resulting in image distortion in cases of relative motion between the imager and the scene. With the global shutter technique, the image is captured simultaneously by all pixels, after which the exposure is stopped, and the data is stored in-pixel while the image is read out. The operation of both techniques can be divided into three stages: Reset, Phototransduction and Readout. During the Reset stage, an initial voltage is set on the photodiode capacitance that constitutes most of the pixel area. Subsequently, the pixel enters the Phototransduction stage, during which the incident illumination causes the capacitance to discharge throughout a constant integration time. Readout is commenced at the end of the integration time, and the final value of the pixel is read out and converted to a digital value. Figure 2 shows a component block diagram of a generic smart CMOS APS based image sensor. The core of the image sensor is a pixel array, generally consisting of a photodiode, in-pixel amplification, a selection scheme and a reset scheme. A full description of the operation of this pixel is given by Yadid-Pecht, et.al. [Yadid-Pecht2004]. 80 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

Some smart imagers employ more complex pixels, enabling them to perform analog image processing at the pixel level, such as A/D conversion.

Figure 2: Generic Smart Image Sensor Component Block Diagram Access to the pixels is carried out through the row selection block. This is usually made up of a shift register, as serial access is commonly employed, although in certain applications, a digital decoder is preferred. An entire row is generally accessed simultaneously for both reset and readout operations, except for in applications where random access is required, such as tracking window systems. Several blocks are required at every column for the parallel operation of an entire pixel row. These include Sample and Hold (S/H) circuits, Corellated Double Sampling circuits and Analog to Digital Converters (ADC). The S/H circuitry generally measures the reset level of the pixels to enable the CDS to remove Fixed Pattern Noise (FPN). Column-wise ADCs are only one option; the others being in-pixel ADC or single ADC per imager. The selected scheme will be according to the tradeoffs of area, power, speed and precision. Additional blocks that are required in the periphery of the imager include the general Biasing Circuitry and Bandgap References for creating biasing currents for the in-pixel signal amplifiers, usually implemented through a Source Follower (SF) scheme. and the ADCs; Digital Timing and Digital Control blocks for producing the proper sequencing of the addresses, ADC timing, etc. The sensitivity of a digital image sensor is usually proportional to the area of the photodiode and the resolution is set by the number of pixels. This results in a relatively large area covered by the image sensor, compared to other on-chip circuits, and accordingly, a large percentage of the overall power consumption. The contribution of different image sensor components to overall power dissipation may vary significantly from system to system. For example, pixel array power dissipation can vary from a number of µWatts for a small array employing 3 transistor APS architecture to hundreds of mWatts for large format "smart" imagers employing in-pixel analog or digital processing. The power dissipation of the pixel array of a generic “smart” image sensor can be given by Equation 2:

PFNMEEArray R reset  read_log out  E ana  E digital  NMP leakage (2) International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 81

where FR is the imager’s frame rate, N and M are the number of rows and columns, respectively, Ereset is the energy required for pixel reset, Eread_out is the energy dissipated during signal readout during one frame, Eanalog and

Edigital are energy dissipation components dissipated by in-pixel analog and/or digital processing during one frame and Pleakage is the in-pixel leakage power. The dynamic power in the above equation is proportional to the frame rate and is composed of the energy required to refill the photodiode capacitance during reset; the power dissipated through column-wise biasing currents during readout; and additional energy consumed by (optional) in-pixel functionality. The static power is due to leakage through the reset and row selection switches during integration and standby periods. These leakages also degrade the performance and precision of the imager. The row selection block can be another major source of power dissipation, depending on the size of the array and the method of operation. In both global and rolling shutter modes, the row reset and row selection capacitances are periodically charged, proportional to the frame rate. In window tracking applications, on the other hand, the power of the row (and column) selection blocks can be dominated by leakage power, as the majority of the rows/columns may not be asserted for long periods. The other primary source of power dissipation is the analog circuitry, including the ADCs, S/H, CDS and biasing circuitry. Optimally, these are timed to consume power only during their precise periods of operation, but they generally have a high power profile. The analog peripheral blocks present a constant tradeoff between speed, noise immunity, and precision versus power consumption and area, and for low power systems, the choice of these blocks needs to be made cautiously. Additional power is dissipated in the digital timing and control blocks; however, the complexity and frequency of these tend to be lower than standard digital circuits, and so most common power reduction techniques can be implemented on these blocks. An in depth description of all the contributions to power dissipation in a smart image sensor is given by Fish, et.al. [Fish2008]. Image sensors provide power reduction opportunities at all the design levels, starting with the technology and device levels, through the circuit level and all the way to the architecture and algorithm level. Standard power reduction techniques, such as supply voltage reduction and technology scaling, aren’t always applicable to CMOS image sensors, as they are frequently accompanied by unacceptable tradeoffs. Supply voltage reduction reduces both the precision and the noise immunity of image sensors, while technology scaling generally includes side effects, such as increased leakage current and dark current, as well as reduced photoresponsivity. However, at the technology level, processes can be modified for low power image sensor fabrication albeit, at an increased cost. An example of such a process is the Silicon-on-Sapphire (SOS) process that provides a very low power figure and enables backside illumination [Culuriciello2004]. The device and circuit level provide several opportunities for limiting power dissipation, depending on the options and layers provided by the chosen technology. The presence of separate wells for both nMOS and pMOS transistors enables the application of body biasing on inactive rows for leakage reduction. This technique loses its effect with scaling, as the effect on a devices threshold voltage is reduced, but image sensors are generally fabricated in technologies up to 90nm, where it is still efficient. Additional devices, such as high-VT transistors and thick oxide transistors can also be used for leakage reduction on slow busses. Another technique commonly used for leakage reduction is serial connection of “off” transistors for “stack effect” utilization [Narendra2001]. Smart image sensors provide many interesting opportunities for power reduction at the architectural and algorithm levels. Depending on the functionality of the sensor, these systems can be equipped with designated blocks for eliminating unnecessary power consumption. An example of this is the tracking sensor we proposed 82 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

[Teman2008] that used row and column shift registers for window definition and an analog winner-take-all circuit for motion tracking. In this system, the pixels outside the window of interest were deactivated and ADCs were used only for initial detection. The switching activity of the shift registers was very low, as well, further reducing the system power consumption.

Similarities between SRAMs and Image Sensors

In the previous two sections, the architectures and power profiles of two types of VLSI arrays, SRAMs and Image Sensors, were presented along with a number of examples of methods for reducing the power consumption of each. This section will deal with the similarities between these two architectures and their sources of power dissipation, arguing that low power approaches and methods developed for one type of array should be researched and adapted for the other. Clearly, the first similarity between the two architectures is the two-dimensional array based structure of m rows and n columns of identical unit cells. SRAM bitcells have been optimized over the years to produce a dense layout to fit as much memory as possible onto a given area. This is possible due to the regular patterning of the cells, allowing many exceptions in design rules. The dense layout results in reduced capacitances, provides benefits in power and performance, as well as smaller peripheral circuits for a given memory size. In the case of pixel arrays, dense layouts provide similar benefits; however, the reduction in pixel size has a negative effect on pixel sensitivity, quantum efficiency, noise figures, etc. Various approaches for an optimal pixel layout have been proposed, such as the hexagonal shaped pixel [Staples2009]. For both SRAM bitcell and imager pixel design, leakage current during idle cycles (“hold” cycles for bitcells and “integration” cycles for pixels) has become a major focus. In some cases, smart image sensors contain memory circuits in-pixel, which further deepens the similarity. Utilization of leakage reduction methods, such as multiple threshold transistors and body biasing have been presented for both types of arrays. Modern CMOS processes include designated transistors for use in SRAM bitcells, optimized for leakage reduction. Several groups are researching low voltage operation of SRAMs in the subthreshold or near-subthreshold regions of operation. This approach could be used for image sensors, especially for operation of in-pixel or peripheral logic, due to their reduced frequency requirements. Random or pseudo-random access to the unit cells in both types of arrays is achieved through row and column addressing. In SRAM design, the row addressing for wordline assertion is generally achieved through a row decoder, while standard image sensors, operating in the global or rolling shutter modes, use shift registers for reset and row selection. Both types of circuits are fitted to the pitch of the rows for layout and have been deeply researched for optimal operation in terms of power, area and performance, especially due to the fact that they drive large capacitances. Certain image sensors employ decoders for row addressing (such as the tracking window example, given above), while serially accessed SRAMs benefit from using a shift register. Other architectures have also been proposed, such as daisy chaining bitcells for robust digital column-wise readout [Chandrakasan2006]. SRAMs often save power and improve performance by sub-dividing the arrays into banks, local wordlines, etc. Image sensors could partially adopt similar techniques at opposing sides of the array or fully adopt them at the expense of losing several pixels that could be compensated for through signal processing. Column addressing, which is inherent to most SRAM designs, is used in some image sensors, when random access is necessary. Column-wise operations are performed on the data of both types of arrays; SRAMs perform write-driving, precharging and readout in this fashion, while imagers perform column-wise CDS and readout. The primary noise cancellation mechanism for imagers is the CDS function, while SRAMs often employ dummy columns for timing and level comparison. Both concepts provide opportunities to be adapted to the other field. International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 83

Both fields employ analog blocks, necessary for performance, accuracy and functionality. SRAMs use sense amplifiers to speed up readout and reduce bitline swing, while imagers use ADCs to create a digital readout from the analog signal measured by the pixel. Both are done either column-wise or one-per array (or bank). Both require biasing currents for proper operation. Both are major power consumers and should be timed carefully to operate only when necessary. Smart image sensors sometimes use alternative readout blocks, such as Winner Take All (WTA) circuits, when binary decisions are required rather than precise level readout. Similar uses could be applied to SRAMs used by specific applications. Finally, both architectures employ digital control and timing blocks to administer their respective operating modes. Image sensors require precise timing of their reset and integration signals, as well as for CDS and ADC operation. SRAMs are often asynchronously self-timed, employing Address Transition Detectors (ATD) and other circuits to initiate precharge, read and write phases. Digital control logic maneuvers the components between operation modes, and often registers are used to latch read out signals. Careful design of these timing and control blocks can provide substantial power savings.

Table 1. summarizes the architectural similarities between SRAMs and Image Sensors: Designation SRAM Component Imager Component mxn Array Bitcells Pixels, in-pixel memory, in-pixel ADC Row Addressing Decoder, Row Drivers, wordline Shift Register, Row Drivers, Row Selection lines, Reset lines Column Column Multiplexer, Bitlines Readout columns, optional column Addressing decoder/multiplexer Column-wise Precharge circuits, Write Drivers, Sample and Hold, CDS, Column ADC Operation Bitlines, Column Sense Amplifier Analog circuitry Sense Amplifiers, Biasing Circuitry ADC, Biasing Circuitry, Bandgap Reference Timing and Digital Control, Self Timing logic, Digital Control, Digital Timing Control ATD, Dummy Column, Error Correction Table 1: Summary of Architectural Similarities between SRAM and Image Sensors

Peripheral Sharing

In the previous section, we discussed the similarities between SRAM arrays and Image Sensors. The correlation between the two types of arrays is even more inherent in systems that employ both units, working in cohesion to achieve certain functionalities. This is often the case in smart image sensor systems that use SRAM arrays to temporarily store previously read out data or results of image processing. In such cases, the similarities provide several architectural opportunities for sharing peripherals, thus resulting in a reduction of both power and area, and often a performance improvement due to the inherent synchronization between the units. Figure 3. shows two examples of peripheral sharing. In the Figure 3(a), a column-wise shared architecture is shown. In this case, the readout columns of the pixel array are directly connected to the vertical writing and/or reading logic of the SRAM. A possible application is a smart image sensor that periodically stores spatial data in 84 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

an embedded SRAM for further use or processing. In this case, the parallel readout of the image is directly routed (through the column-wise processing, such as CDS, S/H and possibly ADC circuits) to the SRAM write drivers. SRAM operation is simplified to a one-dimensional (row) access scheme, as an entire row of data is read out from the image sensor and written in parallel. This architecture saves power and area, by simplifying or even eliminating the column addressing circuitry, integrating the timing and control signals of the two arrays, and even providing opportunity for replacing the SRAM’s row decoder with a much smaller and less power hungry shift register. Careful design can further reduce the digital and analog blocks by creating control signals and biases appropriate for both arrays.

Row Selection

(a) (b) Figure 3. Two examples of peripheral sharing between SRAM and Imager arrays. (a) Column-wise peripheral sharing. (b) Row-wise peripheral sharing

Figure 3(b) shows a possible row-wise approach to peripheral sharing. In this architecture, the two arrays are placed on a horizontal axis, enabling the distribution of row addressing signals via a mutual row selection block. One example would place a shift register in between the two arrays, asserting the reset and row selection lines of the imager in coordination with the wordlines of the SRAM, according to a predefined timing scheme. This method of SRAM addressing could be used in a serially accessed memory working in coordination with the adjacent imager. Certain applications would allow a further reduction in peripherals (saving both power and area), by integrating the column addressing blocks of the two arrays. Digital timing and control blocks could again produce common signals and analog blocks could be designed to use similar biasing levels, further integrating the two systems. Several other peripheral sharing architectures and techniques could be proposed, depending on the application, the relationship between the smart imager and the SRAM and the operating profile of the system. Such peripheral International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 85

sharing doesn’t necessarily have to include complete integration between the two arrays. For example, a significant reduction in area could be achieved by using a single bandgap reference block for a standard imager and an SRAM array on the same die, even if they are independent of each other. These architectural opportunities should be taken into consideration when developing a system that uses both types of blocks, as the saving in power and area, as well as the prospect of performance enhancement, can be considerable. The following section gives a practical implementation example of such a system.

Implementation Example: An Improved Adaptive Bulk Biasing Control (AB2C) System

In the previous sections, we argued that image sensors and memory arrays have many common features and that power reduction techniques, developed for one field, could be adapted for the other. In addition, we proposed opportunities for peripheral sharing in systems, such as smart image sensors, that include both pixel arrays and embedded SRAM arrays. In this section, we will present an example of a system, developed by our group, that utilizes both approaches for power and area reduction, as well as performance optimization.

(a) WL WL V V p-well p-well

(b) (c) Figure 4. Architecture and basic circuits for Improved AB2C System. (a) Schematic of Smart Pixel (b) Schematic of SRAM Bitcell (c) Full system architecture

The Adaptive Bulk Biasing Control (AB2C) approach to leakage reduction in image sensors was originally proposed by Fish, et. al. [Fish2007]. This system took advantage of the serial row access scheme, inherent to the majority of image sensors, for the application of a gradually changing body biasing to reduce the leakage in image sensors during the long integration periods. It can be shown that the slow voltage gradient applied to the 86 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

bulk of a given row requires less power and causes less spatial noise than a standard pulsed approach. The system applies the full Reversed Body Bias (RBB) to the rows farthest away from the selected (i.e. reset or readout) row, and no RBB (or potentially a performance enhancing Forward Body Bias) to the selected row. An improved AB2C system [Teman2009] implements the original concept on a smart image sensor employing an embedded memory. Figure 4(a) shows the pixel circuit implemented in the smart image sensor employing an in- pixel memory bit. The serial access scheme of the smart imager includes a periodic partial readout of the pixel level, and according to the illumination level, data is written to both the in-pixel memory bit and the embedded SRAM array. After the full integration time, the final pixel level is read out along with the data stored at the associated SRAM address. This system provides opportunities for both row-wise and column-wise peripheral sharing, due to the synchronized serial operation of the image sensor with its associated SRAM addresses. A column-wise approach was chosen, as the parallel propagation of the column data to and from the SRAM proved to be more dense. Implementation of the adaptive bulk biasing approach for leakage reduction in the SRAM was enabled by the serial access operation, inherent to the system. A twin-well was used to separately bias the bulks of each row of nMOS transistors, as the pMOS body biasing in deep submicron technologies (a standard 90nm TSMC process was used) is inefficient. The SRAM bitcell schematic is shown in Figure 4(b). The body nodes of the nMOS transistors was connected to the AB2C circuit, driven by the row addressing shift register, used to serially access the array. The full architecture for the Improved AB2C system is shown in Figure 4(c). The column-wise setup enabled parallel writing of the image sensor readout values directly into the selected SRAM word below it, and subsequent readout of the SRAM value along with the final pixel value after integration. Similar row addressing blocks were used for both arrays, comprising shift registers for horizontally wired signals (reset, row select, wordlines) and AB2C circuits. These circuits include a network of resistors with connections to the bulks of rows between them. The resistor network is biased with a voltage running between the active row and the opposite row (i.e. the row farthest away from the active row), thus creating a gradual voltage drop on the bulks of adjacent rows. The bias point is switched along with the row selection shift register, causing a small charge/discharge of the row bulk capacitance, with a minimal energy penalty. The row selection blocks (including AB2C circuitry) could be shared between the two arrays, pending routing options, further saving area and power.

Figure 5: Static power consumption at various body biasing levels for the presented smart image sensor with embedded SRAM employing an AB2C biasing scheme. International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010 87

The static power reduction achieved with the application of the AB2C architecture is plotted in Figure 5 for the presented system implemented in a standard TSMC 90nm CMOS process. The minimum energy point was achieved with a reverse biasing voltage of 350mV. A higher RBB results in higher power dissipation of the AB2C blocks, while a lower RBB results in more pixel/bitcell leakage power. This results in a 26% power reduction as compared to the same system without the biasing voltage or the AB2C power, as seen at the 0V point on the figure. This reduction improves with array sizes, and is even more effective at older technologies with a higher supply voltage, often used for image sensor implementation.

Conclusions and Further Research

A novel approach to power reduction in VLSI arrays was presented. The architectures of two types of arrays, image sensors and SRAM arrays were described. Sources of power consumption were noted for each array type, and some common techniques for power reduction were shown. It was contended that the similarities between the array types provide many opportunities for adaptation of methods and techniques for power reduction and optimization between the two. A number of architectural concepts based on peripheral sharing were suggested for systems employing both types of arrays. Finally, an example of a system that implements both approaches (method adaptation and peripheral sharing) was presented. The example showed an AB2C scheme that was originally developed for image sensors and was implemented on an SRAM array, as well, providing a substantial static power reduction for the entire system. The image sensor and SRAM array were connected in a column- wise scheme, further saving both area and power, while optimizing the operation process.

Bibliography

[Rabaey2003] J.M. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits: A Design Perspective, 2-nd Edition, Prentice Hall, 2003 [Itoh2001] K.Itoh, VLSI Memory Chip Design, Springer-Verlag, 2001 [Mamidipaka, 2004] M. Mamidipaka, K.l Khouri, N. Dutt , M. Abadir, Analytical models for leakage power estimation of memory array structures, In: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, Stockholm, Sweden , Sep. 2004. [Flautner2002] K. Flautner, N.S. Kim, S. Martin, D. Blaauw, T. Mudge, Drowsy caches: simple techniques for reducing leakage power, In: Proceedings of the 29th annual international symposium on Computer architecture, Anchorage, Alaska, 2002. [Powell 2000] M. Powell, S.H. Yang, B. Falsafi, K. Roy, T.N. Vijaykumar, Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories, Proceedings of the 2000 international symposium on Low power electronics and design, pg. 90-95, Rapallo, Italy, 2000 [Nii1998] K. Nii, H. Makino, Y.Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami, T. Arakawa, H. Hamano, A low power SRAM using auto-backgate-controlled MT-CMOS, Proceedings of the 1998 international symposium on Low power electronics and design, pg. 293-298, Monterey, California, 1998 [Wang2007] C.C. Wang, C.L. Lee, W.J. Lin, A 4-kb Low-Power SRAM Design with Negative Word-Line Scheme, IEEE Trans. on Circuits and Systems I: Fundamental Theory and Applications, Vol. 54, No. 5, pp. 1069-1076, May 2007 [Hanson2003] H. Hanson, M.S. Hrishkesh, V. Agarwal, S.W.Keckler, D. Burger, Static energy reduction techniques for microprocessor caches, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 11, Issue 3, pg. 303- 313, June 2003 [Kim2002] N.S. Kim, K. Flautner, D. Blaauw, T. Mudge, Drowsy instruction caches: leakage power reduction using dynamic voltage scaling and cache sub-bank prediction, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pg. 219-130, Istanbul, Turkey, 2002 88 International Journal “Information Technologies and Knowledge”, Vol. 4, Number 1, 2010

[Chandrakasan 2007] B.H. Calhoun, A.P. Chandrakasan, A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low- Voltage Operation, IEEE Journal of Solid-State Circuits, vol. 42, no. 3, pp. 680-688, March 2007 [Chandrakasan 2008] N. Verma, A.P. Chandrakasan, A 256 kb 65 nm 8T Subthreshold SRAM Employing Sense-Amplifier Redundancy, IEEE Journal of Solid-State Circuits, pp. 141-149, January 2008 [Calhoun2008] J.Wang, B.H. Calhoun, Techniques to Extend Canary-based Standby VDD Scaling for SRAMs to 45nm and Beyond, IEEE Journal of Solid-State Circuits, Vol. 43, No. 11, pages 2514-2523, November 2008 [Fish2008] A. Fish, O. Yadid-Pecht, "Considerations for Power Reduction in "Smart" CMOS Image Sensors",,by Kris Iniewski, CRC Press, 2008 [Yadid-Pecht2004] O. Yadid-Pecht and R. Etienne-Cummings, CMOS imagers: from phototransduction to image processing, Kluwer Academic Publishers, 2004 [Culuriciello2004] E. Culurciello and A. G. Andreou, A 16x16 Silicon on Sapphire CMS Photosensor Array With a Digital Interface For Adaptive Wavefront Correnction, Proc. ISCAS, Vancouver, May, 2004. [Teman2008] A. Teman, S. Fisher, L. Sudakov, A. Fish, O. Yadid-Pecht, Autonomous CMOS image sensor for real time target detection and tracking. Proc. of ISCAS 2008, pg. 2138-2141, 2008 [Narendra2001] S. Narendra, et. al., "Scaling of Stack Effect and its Application for Leakage Reduction," Proc. of ISLPED 2001, pp. 195-200, 2001 [Staples2009] C. J.Stapels, P. Barton, E. B. Johnson, D. K. Wehe, P. Dokhale, K. Shah, F. L. Augustine, J. F. Christian, Recent developments with CMOS SSPM photodetectors, Proceedings of the Fifth International Conference on New Developments in Photodetection, Pg. 145-149, October 2009 [Chandrakasan2006] A. Wang, B.H. Calhoun, A.B. Chandrakasan, Sub Threshold Design For Ultra Low Power Systems, Springer 2006 [Fish2007] A. Fish, T. Rothschild, A. Hodes, Y. Shoshan and O. Yadid-Pecht, Low Power CMOS Image Sensors Employing Adaptive Bulk Biasing Control (AB2C) Approach, Proc. IEEE International Symposium on Circuits and Systems, pp. 2834-2837, New-Orleans, USA, May 2007.

Authors' Information

Adam Teman – Masters Student, The VLSI Systems Center, Ben Gurion University of the Negev, P.O. Box 653 Be'er Sheva 84105, Israel; e-mail: [email protected] Major Fields of Scientific Research: VLSI, Digital circuit design, Low power memories and CMOS image sensors. Dr. Orly Yadid-Pecht – iCore Professor, Director of Integrated Sensors, Intelligent Systems (ISIS), University of Calgary, 2500 University Drive N.W. Calgary, Alberta, Canada T2N1N4, e- mail: [email protected] Major Fields of Scientific Research: VLSI, CMOS Image Sensors, Neural Networks and Image Processing. Dr. Alexander Fish – Senior Lecturer, Head of Ultra Low Power Circuits and Systems Lab, The VLSI Systems Center, Ben Gurion University of the Negev, P.O. Box 653 Be'er Sheva 84105, Israel; e-mail: [email protected] Major Fields of Scientific Research: Ultra-low power VLSI, Low-power image sensors, Mixed signal design.

Appendix B: Review and Classification of Gain Cell eDRAM Implementations

As appears in the proceedings of the 27th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI 2012) [21]. Presented in Eilat, Nov. 2012.

Appendix 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel

2012 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Review and Classification of Gain Cell eDRAM Implementations

Adam Teman∗, Pascal Meinerzhagen†, Andreas Burg†, and Alexander Fish‡ ∗VLSI Systems Center, Ben-Gurion University of the Negev, Be’er Sheva, Israel †Institute of Electrical Engineering, EPFL, Lausanne, VD, 1015 Switzerland ‡Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel Email: [email protected], pascal.meinerzhagen@epfl.ch Abstract—With the increasing requirement of a high-density, tion. However, the conventional 1-transistor, 1-capacitor (1T- high-performance, low-power alternative to traditional SRAM, 1C) eDRAM requires costly process adders, and provides Gain Cell (GC) embedded DRAMs have gained a renewed inter- limited voltage downscaling, while requiring frequent power est in recent years. Several industrial and academic publications have presented GC memory implementations for various target consuming refresh operations [6–8]. A partial solution to these applications, including high-performance processor caches, wire- drawbacks is provided by logic compatible gain-cell (GC) less communication memories, and biomedical system storage. eDRAMs [7]. While the concept of gain cells dates back to In this paper, we review and compare the recent publications, the early 1970’s, they fell into oblivion due to the predominant examining the design requirements and the implementation development of dedicated process technologies for stand-alone techniques that lead to achievement of the required design metrics of these applications. SRAM and DRAM chips. Only during the last decade have GC memories been discovered again as a potential alternative I. INTRODUCTION to SRAM due to their potential for higher density, lower Embedded memories consume a dominant part of the overall power consumption, higher reliability, and 2-port functionality area of ASICs and Systems-on-Chip (SoCs), and according to in advanced nodes and at low voltages. Since 2005, around 20 the 2011 International Technology Roadmap for Semiconduc- publications from industry and academia show innovative GC tors (ITRS) [1] this trend will continue into the foreseeable designs and array architectures, mostly aiming at replacing future. Power dissipation has become the main performance SRAM as high-speed caches in high-end processors. A few limiter in modern microprocessors, and larger cache memo- recent publications design and optimize GC memories for ries significantly improve micro-architectural performance and use in wireless communications systems or other fault-tolerant utilization of multi-core systems with only a modest increase systems, while other work has verified the feasibility of low- in power [2,3]. In state-of-the-art processors, the die area voltage operation, making GC arrays good candidates for devoted to cache memories is approximately 50%; however, biomedical systems. [2,3,5,8–21]. memories occupy significant portions of lower performance Gain cells are dynamic memory bitcells comprised of systems and components, as well. The standby power of ultra- 2–3 standard logic transistors and optionally an additional low power systems, such as biomedical implants and wireless MOSCAP or diode. The additional devices (as compared to sensor networks, is also often dominated by their embedded their 1T counterparts) are used to both increase the in-cell memories that continue to leak during long periods of system storage capacitance, as well as amplify the readout charge standby. flow as compared to the stored charge level, thus providing the The traditional choice of embedded memories has been the name ”gain” cells [12]. The reduced device count results in a 6T SRAM, as it provides high-speed read and write perfor- much higher bitcell density, as compared to a standard SRAM, mance with robust static data retention. However, growing while the decoupled read port provides both a non-destructive memory capacities have led to significant efforts to replace the read operation and two-ported functionality. Neither read nor relatively large SRAM bitcell with a smaller alternative. Con- write operations suffer from the ratioed contention between current read/write access is an effective method for achieving devices in a 6T SRAM, resulting in increased margins and high memory bandwidth [4], but two-ported SRAMs require enabling voltage scaling [15,19]. Finally, leakage power is additional transistors to implement the unit cells, resulting highly reduced, as fewer devices suffer from Drain Induced in even larger area demands. In addition, the off-transistor Barrier Lowering (DIBL), and scaled supply voltages reduce leakage currents of SRAM cells have become one of the major other leakage components. power consuming components in VLSI systems, especially Despite these favorable features, gain cells suffer from in standby mode. To combat power consumption, one of the a number of drawbacks. The primary concern is the small most effective solutions has been found to be lowering the internal storage capacitor that results in short retention times, system supply voltage (VDD). However, depleted read and requiring power-hungry refresh operations. In addition, the de- write margins, coupled with increasing process variations, pleted storage voltages following a long retention period result limit the minimum operating voltage of SRAM arrays. Hence, in poor read performance. These characteristics are highly an appropriate candidate for the replacement of SRAM would dependant on process-voltage-temperature (PVT) variations, need to provide high density, low power, and low-voltage thereby requiring careful margin distribution, cell tracking, and operation, while retaining compatibility with standard logic reference voltage control [2]. fabrication processes [5]. In this paper, we examine the various gain cell implemen- Embedded DRAMs (eDRAMs) have long been a candidate tation options and consider the resulting trade-offs. We review for replacement of mainstream SRAMs in nanoscale CMOS the methods for contending with the drawbacks and improving due to their small cell size and non-ratioed circuit opera- the performance of the circuits. As a result, we will discuss

978-1-4673-4681-8/12/$31.00 ©2012 IEEE 1 the compatibility of the existing designs to various target 3 applications according to energy-efficiency aspects of these 10 implementations. High−end processors 2 [5] 10 [3] II. CATEGORIZATION OF GAIN CELL ARRAYS [2] [3] [9] [12] From the large number of recent publications on GC mem- 1 [11] 10 [24] ories, it is possible to identify three main categories of target [20] Biomedical applications: 1) high-end processors requiring large embedded 0 10 cache memories; 2) fault-tolerant systems including channel [18] decoders for wireless communications; and 3) low-voltage [21] −1

Bandwidth [Gb/s] 10 low-power biomedical systems. [15] A. Gain Cells for High-end processors Wireless & SoC [4] −2 [19] 10 The vast majority of recent research on GC memories is [15] dedicated to large embedded cache memories for micropro- −3 cessors [2,3,5,9,11–14,16,22,23]. In fact, GC memories are 10 65 90 120 150 180 considered to be an interesting alternative to SRAM, which has Technology node [nm] been the dominant solution for cache memories for decades. This is due to GC eDRAM’s higher density, increased speed, Fig. 1. Bandwidth vs. technology node of several previously published studies. and potentially lower leakage power. Besides the obvious 180 nm CMOS process achieves low retention power through advantage of high integration density, the main design goal voltage scaling well below the the nominal supply voltage [15]. for GC memories in this application category are high speed The positive impact of supply voltage scaling on retention time operation and high memory bandwith, especially for industrial for given access statistics and a given write bitline control players like IBM [13] and Intel [5,22], and recently also for scheme is demonstrated in [18], proposing near-threshold academia [3,23]. A smaller number of research groups specify (NVT) operation for longer retention times and therefore low power consumption as their primary design goal [2,14]. A lower retention power. A recent study [19] shows that the recent study shows that GC memories can potentially consume supply voltage of GC arrays can even be scaled down to the less data retention power (i.e., the sum of leakage power and subthreshold (sub-V ) domain, while still guaranteeing robust refresh power) than SRAM arrays (leakage power only) [3]. T operation and high memory availibility for read and write B. General Systems-on-Chip operations. Several authors are not very specific about their target E. Comparison of State-of-the-Art Implementations applications [7,10,24], as they only mention general SoCs. However, they follow the same trend as the aforementioned Fig. 1 shows the bandwidth and the technology node of processor community by proposing GC memories as a re- state-of-the-art GC memory implementations, highlighted ac- placement for the mainstream 6T SRAM solution. For these cording to target application categories. References appearing SoC applications, the main drivers are the potential for higher multiple times correspond to different operating modes or density and lower power consumption than SRAM. operating points of the same design. A more than four orders- of-magnitude difference in the achieved memory bandwidth C. Gain Cells for Wireless Communications Systems among the various implementations. GC memories designed A small number of recently presented GC memory designs as cache memory for processors achieve around 10 Gb/s if are fundamentally different from the aforementioned work, as implemented in older technologies and over 100 Gb/s if im- they are specifically built and optimized for systems which plemented in a more advanced 65 nm node. Most memories require only short retention times, and, in some cases, are toler- designed for wireless communications systems or generally ant to a small number of hardware defects (read failures) [25]. for SoCs still achieve bandwidths between 1 and 10 Gb/s. The refresh-free GC memory used in a recently published low- Only the high-density multilevel GC array has a lower band- density parity check (LDPC) decoder is periodically updated width due to a slow successive approximation multilevel read with new data, and therefore requires a retention time of operation [21]. GC memories targeted towards biomedical only 20 ns [20]. Besides safely skipping power-hungry refresh systems are preferably implemented in a mature, reliable cycles and designing for low retention times, the work in [8,21] 180 nm CMOS node and achieve sufficiently high bandwidths also exploits the fact that wireless communications systems between 10 Mb/s and several 100 Mb/s at NVT or sub-VT and other fault-tolerant systems are inherently resilient to supply voltages. a small number of hardware defects. In fact, by proposing Fig. 2 plots the retention power (i.e., the sum of re- memories based on multilevel GCs, the storage density of GC fresh power and leakage power) of previously reported GC memories is further increased at the price of a small number memories versus their retention time. For energy-constrained of read failures which do not significantly impede the system biomedical systems, long retention times of 1–10 ms are a performance [8,21]. key design goal in order to achieve low retention power of D. Gain Cells for Biomedical Systems between 600 fW/bit and 10 pW/bit. The memory banks of the While the previously described target applications require LDPC decoder have a nominal retention time of 1.6 µ [20], relatively high memory bandwidth, several recent GC memory which is around four orders-of-magnitude lower than that of publications target low-voltage low-power biomedical appli- the arrays targeted at biomedical systems. Even though the cations. A GC memory implemented in a mature low-leakage reported power consumption of 5 µW/bit corresponds to active

2 10 10 0.7 [20] Wireless

[18] 8 0.6 10 High−end processors [15]

[3] Biomedical 0.5 [11] [11] 6 10 [5] [3] 0.4 High−end processors [2] Array e ciency Wireless 4 [3]

Retention power [fW/bit] 10 [18] [21] [10] SoC 0.3 [15] Biomedical [5] 2 10 0.2 0 1 2 3 4 0 2 4 6 8 10 10 10 10 10 10 2 Retention time [ µs] Area cost per bit [µm /bit]

Fig. 2. Retention Power vs. Retention Time for several previously published Fig. 3. Array efficiency vs. Area Cost Per Bit for several previously published studies. studies. power [20], it is fair to compare it to the retention power is their reduced device count, as compared to traditional of other implementations, as data would anyway need to be SRAM circuits. The highest device count appears in [13], refreshed at the same rate as new data is written. Interestingly, comprising three transistors and a gated diode, with all other the power consumption per bit of this refresh-free eDRAM proposals made up of two [3,5,11,15,18,19,22] or three [2,8– is almost seven orders-of-magnitude higher than the retention 10,12,14,20,21,24] transistors. The obvious implication of the power per bit of the most efficient eDRAM implementation for transistor count is the bitcell size; however, the choice of the biomedical systems. The retention time and retention power topology is application dependent, as well. The simple struc- of GC memories for processors are in between the values for ture of the 2T topologies usually includes a write transistor the wireless and biomedical application domains. Overall, of (MW) and a read transistor (MR). MW connects the write course, it is clearly visible that enhancing the retention time bit line (WBL) to the storage node when the write word line is an efficient way to lower the retention power. (WWL) is asserted, and MR amplifies the stored signal by The area cost per bit (ACPB) is defined as the silicon area driving a current through the read bit line (RBL) when the of the entire memory macro (including peripheral circuits), read word line (RWL) is asserted. The 2T structure results in divided by the storage capacity. As opposed to the simple coupling effects between the control lines and storage node, bitcell size metric, ACPB accounts for the area overhead of which can affect the data and degrade performance. Therefore, peripheral circuits and is a more suitable metric to compare a third device is often added, primarily to decouple RBL from different memory implementations. Moreover, we define the the storage node and reduce RBL leakage. This option enables array efficiency as the bitcell size divided by the ACPB the designer to trade off density for enhanced performance, ro- to normalize this metric independent of technology node. bustness, and/or retention time. This trade-off is quite apparent Fig. 3 shows the comparably higher ACPB of biomedical GC in the cache designs, as the larger capacity systems [3,5,11] memories due to the use of a mature 180 nm CMOS node. prefer the 2T topology at the cost of additional hardware to However, despite their small storage capacity requirements, retain performance. The Boosted 3T topology of [2] actually these implementations achieve a high array efficiency of over utilizes the coupling effect to extend the retention time by 0.5, by using small yet slow peripherals [15]. On the other connecting MR to RWL rather than ground, thereby negating hand, none of the GC memories targeted toward processors, some of the positive voltage step inherent to the PMOS MW wireless communications, or SoC applications achieves an configurations. An interesting choice of the 2T topology was array efficiency as high as 0.5, meaning that over half of the used in [19] even though the target application was a small area of the macrocell is occupied by peripheral circuits. array for ultra-low power biomedical sensors. In this case, the stacked readout path of the 3T topology proved to be too slow III. CIRCUIT TECHNIQUES FOR TARGET APPLICATIONS under sub-VT biases. In the previous section, we examined the recently proposed One of the basic considerations that differentiate between GC arrays and analyzed their target systems and applications. high-performance and low-power systems is the refresh power. A primary conclusion was that gain cells have been shown to Whereas high-performance systems may employ a destructive be an attractive alternative to traditional SRAM arrays for large read operation with write-back, low-power systems ensure a caches, ultra-low power systems, and wireless communication non-destructive read and try to maintain high retention times systems. In this section, we will take a closer look at the to minimize refresh power. This is apparent in the “Main circuits used in these proposals, and analyze the compatibility Design Metric” row of Table I, showing orders-of-magnitude of these techniques with their target metrics. difference in retention time between the two target categories. A. Gain-Cell Topologies B. Device Choices An extensive comparison between recent GC topologies is The majority of today’s CMOS process technologies pro- presented in Table I. The common feature for all these circuits vide several device choices, manipulating different oxide

3 TABLE I DRIVER OPERATING MODES Category High Performance Processor Caches Publication [9,12,24] [11] [13] [5,22] [2,14] [3] Plate Line (-100mV) WBL RWL BL WWL WWL

RBL WWL RBL RWL WBL MR MW MR

WWL BL MW RBL Bitcell MS WWL MW MW MR MA MR MW MS MR RWL MS GD GND/ RWL RWL RWL Vbias WL GD RWL WBL WBL 0.12 µm, 0.13 µm, Tech. Node 0.15 µm 90 nm 65 nm 65 nm 65 nm 65 nm PTM Boosted 3T, Gated Diode, Multi-Level Bitlines, RBL Clamping, Half Swing Gated Diode PVT tracking read Techniques Footer Power Gating, Hybrid open bitline Pipelined WBL, Sense Amplifier reference feedback, Foot Driver architecture Architecture Stepped WWL Regulated WBL Main 400 MHz, 400 MHz, up to 2 GHz, 2 GHz, 500 MHz, 667 MHz, Design 70 µs retention, 100 µs retention, 110 µs retention, 10 µs retention, up to 1.25 ms ret., 110 µs ret., Metric 100 kb 1 Mb 40 kb 2 Mb 64 kb 192 kb Category General SoC Wireless Low Power Biomedical Systems Publication [10] [8,21] [20] [15] [18] [19]

RWL RWL WWL WWL WWL WWL WWL WWL RBL RBL RBL MR MR WBL MR RBL Bitcell MW MW MW MW MW MS MW MS MS MR MR MR RBL VSR RBL RWL WBL RWL WBL RWL RWL WBL WBL WBL Tech. Node 90 nm 90 nm 65 nm 0.18 µm 0.18 µm 0.18 µm Refresh Free, I/O Write Transistor, Hybrid Cell Forced Feedback, Multi Level Bitcell, Low Area Sense Techniques Sequential Low Area Sense with I/O MW, Write Echo Refresh PVT Replica Column Buffer Decoding Buffer Sense Buffer Main V =0.5 V, 32 1 kb arrays, V =0.75 V, up to 306 V =0.75 V, V =400 mV, DD 2–50 µs retention, × DD DD DD Design 180 µA ref. power, 700 MHz, ms ret., 0.1–1 MHz, 3.3 ms retention, over 40 ms ret., 1.45 µm2/bit density Metric 5 MHz 170 ns retention 662 fW/bit ret. power 11.9 pW/bit ret. power 500 kHz

thicknesses and channel implants to create several threshold a weak 1 , and an NMOS passes a weak 0 ; therefore an (VT) and voltage tolerance options. Careful choice of the underdrive (PMOS) or boosted (NMOS) access voltage of appropriate device (PMOS/NMOS, standard/high/low VT) can WWL is necessary to pass a full level to the storage node. provide orders-of-magnitude improvement in GC performance, However, the larger the WWL swing is, the larger the step in as apparent in Table I. PMOS devices suffer from lower drive the direction of the deassertion at the storage node. A PMOS strength than their NMOS counterparts, but have substantially MW, for example, is cut-off by the rising edge of WWL, lower sub-VT and gate leakage. Since the majority of GC resulting in both capacitive coupling and charge injection to implementations are read access limited, PMOS devices are the storage node. Therefore, the initial 0 value will always used in the vast majority of the proposed circuits. For most of be significantly higher than ground for a PMOS MW, and the the common process technologies, the primary cause of storage initial 1 value will be significantly lower than VDD for an node charge loss is sub-VT leakage through MW, and therefore NMOS device. This limits the storage node range and degrades the ultra-low power implementations [15,19] employ a high- both the readout overdrive, as well as the retention time. Using VT or I/O PMOS to substantially extend retention time. Gate a same-type device for MR of a 2T cell induces an additional leakage is a substantial contributor in thin oxide nodes, and so step in the same direction during read access, further impeding the all-PMOS 2T configuration [5] balances the sub-VT and the performance. A hybrid cell, mixing NMOS and PMOS gate leakages out of and in to the storage node to improve devices [3,8,10,19,21], can be used to combat these effects at retention time. The decoder system of [20] requires high the expense of in-bit well separation. performance with very short retention times, and therefore an C. Circuit Techniques all NMOS low-VT circuit is used. Low-VT devices are used in the readout path of several other publications [3,10], to In addition to the choice of a circuit topology and device improve read performance without a large static power penalty, options, several circuit techniques have been demonstrated to as the voltage drop over the read node is minimal during write further improve system performance according to the target and standby cycles. application. One simple and efficient technique is the employ- ment of a sense buffer in place of a standard sense amplifier An important effect caused by device choice selection is (SA) in low-power systems [15,18,19]. This implementation the storage node coupling and charge injection. WWL access requires a larger RBL swing, trading off speed for area and significantly modifies the initial level of the storage node, PVT sensitivity. The area trade-off is apparent in Fig. 3 depending on several factors. A PMOS write transistor passes as [15] shows exceptionally high area efficiency. Several

4 other SA configurations have been demonstrated to deal with Frequently updating wireless communication systems can • various design challenges. The authors of [11] proposed a trade-off high-speed access for limited retention time to force feedback SA to enable operation at voltages as low achieve improved bandwidth. as 0.5 V. Chun, et al. [3] overcome the problem of small ACKNOWLEDGMENT RBL voltage swing by using a current mode SA featuring a cross-coupled PMOS latch and pseudo-PMOS diode pairs. This work was kindly supported by the Swiss National Other SA designs used include p-type gated diodes [9,12,13], Science Foundation under the project number PP002-119057. offset compensating amps [11], single-ended thyristors [20], REFERENCES and standard latches [5]. The most complex sensing scheme [1] “International technology roadmap for semiconductors,” 2009. [Online]. is used for Multi-Level Bitcells in [8,21]. To decipher the four Available: http://www.itrs.net data levels, a successive approximation SA is used. [2] K. C. Chun et al., “A 3T gain cell embedded DRAM utilizing prefer- ential boosting for high density and low power on-die caches,” IEEE Several publications [10,15,18,19] discharge WBL during JSSC, 2011. non-write operations to extend retention time that is worse [3] K. Chun et al., “A 667 MHz logic-compatible embedded DRAM for a stored 0 than a 1 with a PMOS WM. A Write Echo featuring an asymmetric 2T gain cell for high speed on-die caches,” IEEE JSSC, 2012. Refresh technique was employed by Ichihashi, et al. [10], to [4] M. Kaku et al., “An 833MHz pseudo-two-port embedded DRAM for further reduce the WBL= 1 disturbance. In this technique, graphics applications,” in Proc. IEEE ISSCC, 2008. the number of 1 write-back operations during refresh are [5] D. Somasekhar et al., “2 GHz 2 Mb 2T gain cell memory macro with 128 GBytes/sec bandwidth in a 65 nm logic process technology,” IEEE counted and oppositely biased to combat the disturbance. The JSSC, vol. 44, no. 1, pp. 174–185, 2009. authors of [2] recognized that the steady state level of a 1 [6] S. Hong et al., “Low-votage DRAM sensing scheme with offset- and 0 is common, so they monitor this level and use it as cancellation sense amplifier,” IEEE JSSC, 2002. [7] N. Ikeda et al., “A novel logic compatible gain cell with two transistors the WBL voltage for writing a 1 . This minimizes the 0 and one capacitor,” in Proc. of Symposium on VLSI Technology, 2000, level disturbance without impeding the worst-case 1 level. pp. 168–169. For the system proposed in [3], WBL switching speed is the [8] P. Meinerzhagen et al., “Design and failure analysis of logic-compatible multilevel gain-cell-based DRAM for fault-tolerant VLSI systems,” in performance bottleneck, and therefore a half-swing WBL is Proc. IEEE GLSVLSI, 2011. employed, improving the write speed and reducing the write [9] W. Luk and R. Dennard, “2T1D memory cell with voltage gain,” in power. Proc. of IEEE Symposium on VLSIC, 2004, pp. 184–187. An issue that is rarely discussed in 2T bitcell imple- [10] M. Ichihashi et al., “0.5 V asymmetric three-tr. cell (ATC) DRAM using 90nm generic CMOS logic process,” in Proc. IEEE Symposium on VLSI mentations is the voltage saturation of RBL during readout. Circuits, 2005, pp. 366–369. Depending on the implementation of MR, readout is achieved [11] D. Somasekhar et al., “A 10Mbit, 15GBytes/sec bandwidth 1T DRAM by either charging (NMOS) or discharging (PMOS) RBL. chip with planar mos storage capacitor in an unmodified 150nm logic process for high-density on-chip memory applications,” in Proc. of IEEE However, once RBL crosses a threshold (depending on the ESSCIRC, 2005, pp. 355–358. current ratio of the selected bitcell and the number of off [12] W. Luk and R. Dennard, “A novel dynamic memory cell with internal unselected cells), a steady state is reached. This phenomena voltage gain,” IEEE JSSC, vol. 40, no. 4, pp. 884 – 894, April 2005. [13] W. Luk et al., “A 3-transistor DRAM cell with gated diode for enhanced not only limits the swing available for RBL sensing, but also speed and retention time,” in Proc. IEEE Symposium on VLSI Circuits, causes static current dissipation that is present throughout the 2006, pp. 184–185. entire read operation. This is one of the phenomena considered [14] K. C. Chun et al., “A sub-0.9V logic-compatible embedded DRAM with boosted 3T gain cell, regulated bit-line write scheme and PVT-tracking in the analysis of [18] resulting in an optimal choice of VDD read reference bias,” in Proc. IEEE Symposium on VLSI Circuits, 2009. for a low-power GC. Somasekhar, et al. [5] combat the self [15] Y. Lee et al., “A 5.4nW/kB retention power logic-compatible embedded clamping of RBL by explicitly clamping its voltage under with DRAM with 2T dual-VT gain cell for low power sensing applicaions,” in Proc. IEEE A-SSCC, 2010. designated devices. [16] W. Zhang et al., “Variation aware performance analysis of gain cell embedded DRAMs,” in Proc. ACM/IEEE ISLPED, pp. 19–24. IV. C ONCLUSION [17] K. C. Chun et al., “Logic-compatible embedded DRAM design for In this paper, we reviewed and compared the recently memory intensive low power systems,” in Proc. of IEEE ISCAS, June 2010, pp. 277–280. proposed GC memories, categorizing them according to target [18] R. Iqbal et al., “Two-port low-power gain-cell storage array: voltage applications and overviewing the characteristics that make scaling and retention time,” in Proc. IEEE ISCAS, 2012. them appropriate for these applications. A closer look into [19] P. Meinerzhagen, A. Teman et al., “A sub-VT 2T gain-cell memory for biomedical applications,” in Proc. IEEE Sub-VT, Pre-Publication - 2012. the circuit design of these arrays provided further insight [20] Y. Park et al., “A 1.6 mm2 38-mW 1.5 Gb/s LDPC decoder enabled by into the methods used to achieve the required design metrics refresh-free embedded DRAM,” in Proc. of IEEE Symposium on VLSIC, through the use of different bitcell topologies, device options, 2012, pp. 114–115. [21] M. Khalid, P. Meinerzhagen, and A. Burg, “Replica bit-line technique technology nodes, and peripheral implementations. To summa- for embedded multilevel gain-cell DRAM,” in Proc. of IEEE NEWCAS, rize briefly, the following best practice guidelines should be June 2012, p. pp. followed when designing GC arrays for future applications: [22] D. Somasekhar et al., “2GHz 2Mb 2T gain-cell memory macro with 128GB/s bandwidth in a 65nm logic process,” in Proc. IEEE ISSCC, High-V write access transistors for long retention times • T 2008. and low refresh power, in conjunction with area-efficient [23] K. C. Chun et al., “A 1.1V, 667MHz random cycle, asymmetric 2T gain cell embedded DRAM with a 99.9 percentile retention time of 110 sense buffers for high array efficiency are most suitable µsec,” in Proc. IEEE Symposium on VLSIC, June 2010, pp. 191 –192. to meet the storage requirements of biomedical systems. [24] M. Chang et al., “A 65nm low power 2T1D embedded DRAM with High-speed applications should use sensitive sense ampli- leakage current reduction,” in Proc. IEEE SOCC, 2007, pp. 207–210. • [25] G. Karakonstantis, C. Roth, C. Benkeser, and A. Burg, “On the exploita- fiers to overcome small voltage differences, and should tion of the inherent error resilience of wireless systems under unreliable consider the use of LVT readout transistors for improved silicon,” in Proc. of IEEE DAC, June 2012, pp. 510 –515. read access.

5

Appendix C:

Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications under Technology Scaling

As appears on pages 54-72 of vol. 3, no. 2 of the MDPI Journal of Low Power Electronics and Applications (JLPEA) [25], 2013.

Appendix J. Low Power Electron. Appl. 2013, 3, 54-72; doi:10.3390/jlpea3020054 OPEN ACCESS Journal of Low Power Electronics and Applications ISSN 2079-9268 www.mdpi.com/journal/jlpea Article

Exploration of Sub-VT and Near-VT 2T Gain-Cell Memories for Ultra-Low Power Applications under Technology Scaling

Pascal Meinerzhagen 1,*, Adam Teman 2, Robert Giterman 2, Andreas Burg 1 and Alexander Fish 3

1 Institute of Electrical Engineering, Ecole Polytechnique Fed´ erale´ de Lausanne, Station 11, Lausanne, VD 1015, Switzerland; E-Mail: andreas.burg@epfl.ch 2 VLSI Systems Center, Ben-Gurion University of the Negev, POB 653, Be’er Sheva 84105, Israel; E-Mails: [email protected] (A.T.); [email protected] (R.G.) 3 Faculty of Engineering, Bar-Ilan University, Ramat-Gan 52900, Israel; E-Mail: alexander.fi[email protected]

* Author to whom correspondence should be addressed; E-Mail: pascal.meinerzhagen@epfl.ch; Tel.: +41-21-69-31027; Fax: +41-21-69-32687.

Received: 4 February 2013; in revised form: 16 March 2013 / Accepted: 19 March 2013 / Published: 29 April 2013

Abstract: Ultra-low power applications often require several kb of embedded memory

and are typically operated at the lowest possible operating voltage (VDD) to minimize both dynamic and static power consumption. Embedded memories can easily dominate the overall silicon area of these systems, and their leakage currents often dominate the total power consumption. Gain-cell based embedded DRAM arrays provide a high-density, low-leakage alternative to SRAM for such systems; however, they are typically designed for operation at nominal or only slightly scaled supply voltages. This paper presents a gain-cell array which, for the first time, targets aggressively scaled supply voltages, down into the subthreshold

(sub-VT) domain. Minimum VDD design of gain-cell arrays is evaluated in light of technology scaling, considering both a mature 0.18 µm CMOS node, as well as a scaled 40 nm node. We first analyze the trade-offs that characterize the bitcell design in both nodes, arriving at a best-practice design methodology for both mature and scaled technologies. Following this analysis, we propose full gain-cell arrays for each of the nodes, operated at a minimum

VDD. We find that an 0.18 µm gain-cell array can be robustly operated at a sub-VT supply voltage of 400 mV, providing read/write availability over 99% of the time, despite refresh cycles. This is demonstrated on a 2 kb array, operated at 1 MHz, exhibiting full functionality J. Low Power Electron. Appl. 2013, 3 55

under parametric variations. As opposed to sub-VT operation at the mature node, we find that the scaled 40 nm node requires a near-threshold 600 mV supply to achieve at least 97% read/write availability due to higher leakage currents that limit the bitcell’s retention time. Monte Carlo simulations show that a 600 mV 2 kb 40 nm gain-cell array is fully functional at frequencies higher than 50 MHz.

Keywords: embedded memory; gain cell; energy efficiency; subthreshold operation; near-threshold operation; retention time; access speed; technology scaling

1. Introduction

Many ultra-low power (ULP) systems, such as biomedical sensor nodes and implants, are expected to run on a single cubic-millimeter battery charge for days or even for years, and therefore are required to operate with extremely low power budgets. Aggressive supply voltage scaling, leading to near-threshold

(near-VT) or even to subthreshold (sub-VT) circuit operation, is widely used in this context to lower both active energy dissipation and leakage power consumption, albeit at the price of severely degraded on/off current ratios (Ion/Ioff ) and increased sensitivity to process variations [1]. The majority of these biomedical systems require a considerable amount of embedded memory for data and instruction storage, often amounting to a dominant share of the overall silicon area and power. Typical storage capacity requirements range from several kb for low-complexity systems [2] to several tens of kb for more sophisticated systems [3]. Over the last decade, robust, low-leakage, low-power sub-VT memories have been heavily researched [4–6]. In order to guarantee reliable operation in the sub-VT domain, many new SRAM bitcells consisting of 8 [7,8], 9 [5,9], 10 [4], and up to 14 [2] transistors have been proposed. These bitcells utilize the additional devices to solve the predominant problems of write contention and bit-flips during read, and, in addition, some of the designs reduce leakage by using transistor stacks. All these state-of-the-art sub-VT memories are based on static bitcells, while the advantages and drawbacks of dynamic bitcells for operation in the sub-VT regime have not yet been studied. Conventional 1-transistor-1-capacitor (1T-1C) embedded DRAM (eDRAM) is incompatible with standard digital CMOS technologies due to the need for high-density stacked or trench capacitors. Therefore, it cannot easily be integrated into a ULP system-on-chip (SoC) at low cost. Moreover, low-voltage operation is inhibited by the offset voltage of the required sense amplifier, unless special offset cancellation techniques are used [10]. Gain-cells are a promising alternative to SRAM and to conventional 1T-1C eDRAM, as they are both smaller than any SRAM bitcell, as well as fully logic-compatible. Much of the previous work on gain-cell eDRAMs focuses on high-speed operation, in order to use gain-cells as a dense alternative to SRAM in on-chip processor caches [11,12], while only a few publications deal with the design of low-power near-VT gain-cell arrays [13–15]. A more detailed review of previous work in the field of gain-cell memories, including target application domains and circuit techniques, can be found in [16]. The possibility of operating gain-cell arrays in the sub-VT regime for high-density, low-leakage, and J. Low Power Electron. Appl. 2013, 3 56

voltage-compatible data storage in ULP sub-VT systems has not been exploited yet. One of the main objections to sub-VT gain-cells is the degraded Ion/Ioff current ratio, leading to rather short data retention times compared with the achievable data access times. However, the present study shows that these current ratios are still high enough in the sub-VT regime to achieve short access and refresh cycles and high memory availability, at least down to 0.18 µm CMOS nodes. While gain-cells are considerably smaller than robust sub-VT SRAM bitcells, they also exhibit lower leakage currents, especially in mature CMOS nodes where sub-VT conduction is the dominant leakage mechanism. Recent studies for above-VT, high-speed caches show that gain-cell arrays can even have lower retention power (leakage power plus refresh power) than SRAM (leakage power only) [17]. However, a direct power comparison between gain-cell eDRAM and SRAM is difficult and not within the scope of this paper; for example, an ultra-low power sub-VT SRAM implementation [2] employs power gating of all peripheral circuits and of the read-buffer in the bitcell, while most power reports for gain-cell eDRAMs include the overhead of peripherals. Compared with SRAM, gain-cells are naturally suitable for two-port memory implementation, which provides an advantage in terms of memory bandwidth, and enables simultaneous and independent optimization of write and read reliability. Finally, while local parametric variations directly compromise the reliability of the SRAM bitcell (write contention, and during read), such parametric variations only impact the access and retention times of gain-cells, which is not a severe issue when targeting the typically low speed requirements of ULP applications, such as sub-VT sensor nodes or biomedical implants.

To start with, we consider sub-VT gain-cell eDRAM design in a mature 0.18 µm CMOS node, which is typically used to: (1) easily fulfill the high reliability requirements of ULP systems; (2) reach the highest energy-efficiency of such ULP systems, typically requiring low frequencies and duty cycles [18]; and

(3) achieve low manufacturing costs. In a second step, we investigate the feasibility of sub-VT gain-cell eDRAMs under the aspect of technology scaling. In particular, in addition to the mature 0.18 µm CMOS node, we analyze low voltage gain-cell operation in a 40 nm CMOS technology node. We show that deep-nanoscale gain-cell arrays are still feasible, despite the reduced retention times inherent to these nodes. Due to high refresh rates, we identify that the minimum supply voltage (VDDmin) that ensures an array availability of 97% is in the near-VT domain.

1.1. Contributions:

The contributions of this work can be summarized as follows:

• We investigate the minimum achievable supply voltage for ultra-low power gain-cell operation. • We analyze gain-cell arrays from a technology scaling perspective, examining the design trade-offs that arise due to the inherent characteristics of various technology nodes. • For the first time, we present a fully functional gain-cell array at a deeply scaled technology node, as low as 40 nm.

• For the first time, we present a gain-cell array operated in the sub-VT domain. J. Low Power Electron. Appl. 2013, 3 57

1.2. Outline:

The remainder of this article is organized as follows. Section2 explains the best-practice 2T gain-cell design in light of technology scaling, emphasizing the optimum choices of the write access transistor, read access transistor, storage node capacitance, and word line underdrive voltage for different nodes. Sections3 and4 present detailed implementation results of a 2 kb gain-cell memory in a 0.18 µm and in a 40 nm CMOS node, respectively. Section5 summarizes the findings of this article.

2. Two-Transistor (2T) Sub-VT Gain-Cell Design

Previously reported gain-cell cell topologies include either two or three transistors and an optional MOSCAP or diode [16]. While the basic two-transistor (2T) bitcell has the smallest area cost, it limits the number of cells that can connect to the same read bitline (RBL) due to leakage currents from unselected cells masking the sense current [19]. However, as many ULP systems require only small memory arrays with relatively few cells per RBL, in the following section, we consider the implementation of a 2T bitcell as a viable low-voltage option and propose a best-practice 2T bitcell design for the considered technology nodes (0.18 µm and 40 nm).

2.1. 2T Gain-Cell Implementation Alternatives

Figure1 shows the four basic options for implementing a 2T gain-cell, allowing both the write transistor (MW) and the combined storage and read transistor (MR) to be implemented with either an NMOS or a PMOS device. These standard topologies require the following control schemes to achieve robust write and read operations. A boosted write wordline (WWL) voltage is required during write access due to VT drop across MW; above VDD for the NMOS option (VBOOST) and below VSS for the PMOS option (VNWL). For a read operation with a PMOS MR, the parasitic RBL capacitance is pre-discharged, and the read wordline (RWL) is subsequently raised. If the selected bitcell’s storage node (SN) holds a “0”, MR is conducting and charges RBL past a detectable sensing threshold. If SN holds a “1”, MR is cut off, such that RBL remains discharged below the sensing threshold. Using an NMOS transistor to implement MR provides the exact opposite operation, i.e., RBL is pre-charged and RWL is lowered to initiate a read. In the considered 0.18 µm CMOS technology, both MW and MR can be implemented with either standard-VT core or high-VT I/O devices. In more advanced technology nodes, typically starting with the 130 nm or 90 nm node for most semiconductor foundries, several VT options become available for core devices, most commonly low-VT (LVT), standard-VT (SVT), and high-VT (HVT) devices. One of the primary considerations for gain-cell implementation is achieving high retention time, i.e., the time it takes for the level stored on SN to deteriorate through leakage currents. In mature, above-100 nm CMOS nodes, subthreshold conduction is the dominant leakage mechanism, compromising data retention in any 2T gain-cell through the channel of MW, as shown in Figure2(a). Therefore, the primary selection criterion for the device type of MW is to minimize subthreshold conduction. Note that subthreshold conduction of MW weakens both a logic “1” and a logic “0” level, whenever the write bitline (WBL) voltage is opposite to the SN voltage. J. Low Power Electron. Appl. 2013, 3 58

Figure 1. 2T gain-cell implementation options including the schematic waveforms.

All-PMOS Cell Mixed PMOS-NMOS Cell read ‘0’ read ‘0’ V DD read ‘1’ VDD read ‘1’ WWL WWL

-VNWL -VNWL SN SN MW MR MW MR L L B B

L CSN L CSN R R

B VDD B VDD

W RWL W RWL GND GND Mixed NMOS-PMOS Cell All-NMOS Cell read ‘0’ read ‘0’ VBOOST VBOOST read ‘1’ read ‘1’ WWL WWL GND GND SN SN MW MR MW MR L L B B

L C L C

SN R SN R

B VDD B VDD

W RWL W RWL GND GND

Figure 2. Leakage components that are considered for the choice of the best-practice write and read transistor implementations, for (a) Mature CMOS nodes; and (b) Scaled CMOS nodes.

Edge-Direct Tunneling (IEDT) Gate tunneling (Igate) G G WWL WWL and Edge-Direct Tunneling (IEDT) SN S D SN S MW D MW MR MR L L B B L R L R B B B B W W RWL RWL Subthreshold Subthreshold GIDL (IGIDL) & conduction (Isub) conduction (Isub) junction leakage (Idiff) Above-100nm CMOS Sub-100nm CMOS (a) (b) In more advanced, sub-100 nm CMOS nodes, there are other significant leakage mechanisms that can compromise data integrity. (Note that in the sub-VT region, these mechanisms are still negligible, as compared with subthreshold conduction. However, as shown in Section 2.3.2, at near-VT supplies, some of the mechanisms must be considered). Only leakage components that bring charge onto the SN or take charge away from SN need to be considered in terms of retention time, while other leakage components are merely undesirable in terms of static power consumption. Figure2(b) schematically shows the main leakage components that can compromise the stored level in sub-100 nm nodes, including reverse-biased pn-junction leakage (Idiff ), gate-induced drain leakage (IGIDL), gate tunneling leakage

(Igate), edge-direct tunneling current (IEDT), and subthreshold conduction (Isub). When employing a J. Low Power Electron. Appl. 2013, 3 59

PMOS MW, the bulk-to-drain leakages (Idiff and IGIDL) weaken a logic “0” and strengthen a logic “1”, but have the opposite impact (strengthen a logic “0” and weaken a logic “1”) when MW is implemented with an NMOS device. During standby, MW is always off and has no channel; therefore, forward gate tunneling (Igate) from the gate into the channel region and into the two diffusion areas that would occur in a turned-on MOS device is of no concern here. Only the edge-direct tunneling current, from the diffusion connected to the SN in the absence of a strongly inverted channel, compromises data integrity. When using an NMOS MW, edge-direct tunneling discharges a logic “1”, while it charges a logic “0” for a PMOS MW. The only leakage through MR that affects the stored data level is gate tunneling. During standby, there is no channel formation in MR, no matter what the stored data level is. For example, if using an

NMOS MR, both RWL and RBL are charged to VDD during standby, such that even a logic “1” level results in zero gate overdrive. In this case, both diffusion areas of MR are at the same potential as the

SN, eliminating tunneling currents between the diffusions and the gate (IEDT = 0). However, tunneling might occur from the gate directly into the grounded bulk (Igate), weakening a logic “1”. If the same cell stores a logic “0”, tunneling between the gate and bulk is avoided (Igate = 0), while reverse tunneling from the diffusions (IEDT) into the gate can charge the logic “0” level. The exact opposite biasing conditions and corresponding tunneling mechanisms are found when implementing MR with a PMOS.

2.2. Best-Practice Write Transistor Implementation

2.2.1. Mature 0.18 µm CMOS Node

For the ULP sub-VT applications, long retention times that minimize the number of power-consuming refresh cycles are of much higher importance than fast write access. Therefore, low subthreshold conduction becomes the primary factor in the choice of a best practice write transistor in the 0.18 µm node. The subthreshold conduction of NMOS and PMOS, core and I/O devices offered in this process are shown in Figure 3(a). Clearly, the I/O PMOS device has the lowest subthreshold conduction Isub

(VGS = 0 V, VDS = −VDD) among all device options and across all standard process corners, leading to the longest retention time. At a 400 mV sub-VT VDD, the on-current Ion (VGS = −VDD, VDS = −VDD) of this preferred I/O PMOS device is still four orders of magnitude larger than Isub, as shown in Figure 3(b), which results in sufficiently fast write and refresh operations compared with the achievable retention time. This holds for temperatures up to 37 ◦C, which is considered a maximum, worst-case temperature for ULP systems that are often targeted at biomedical applications, typically attached to the human body, and hardly suffer from self-heating due to low computational complexity. Nevertheless, for temperatures ◦ as high as 125 C, a sufficiently high Ion/Isub ratio of four orders of magnitude is still achieved at a slightly higher supply voltage of 500 mV. Figure 4(a) shows the worst-case time dependent data deterioration after writing into a 2T gain-cell with a PMOS I/O write transistor under global and local variations. The blue (bottom) curves show the deterioration of a logic “0” level with WBL tied to VDD, and the red (top) curves show the deterioration of a logic “1” level with WBL tied to ground. The plot was simulated with a sub-VT 400 mV VDD assuming a storage node capacitance of 2.5 fF. A worst-case retention time of 40 ms can be estimated from this figure, corresponding to the minimum time at which the “0” and “1” levels intersect. It is clear that a J. Low Power Electron. Appl. 2013, 3 60 logic “0” level decays much faster than a logic “1” level, corresponding with previous reports for the above-VT domain [11,13]. In fact, the decay of a “1” level is self-limited due to the steady increase of the reverse gate overdrive (VGS,MW = VDD −VSN) and the increasing body effect (VBS,MW = VDD −VSN) of MW with progressing decay. Both of these effects suppress the device’s leakage. Furthermore, the charge injection (CI) and clock feedthrough (CF) that occur at the end of a write access (when MW is turned off) cause the SN voltage level to rise, strengthening a “1” and weakening a “0” level [16,20]. Therefore, careful consideration must be given to the initial state of the “0” level following a write access, as will be discussed in Section 2.4.

Figure 3. (a) Subthreshold conduction of different transistor types in an 0.18 µm node; and

(b) I/O PMOS Ion/Isub current ratio as a function of VDD for the typical-typical (TT) process corner at different temperatures.

10 -40 °C 0°C 8 27°C 37°C ) 6 80°C sub /I °

on 125 C 4 log(I Minimum ratio

2

0 0 0.2 0.4 0.6 0.8 V [V] DD (a) (b)

Figure 4. (a) Worst-case retention time estimation of 0.18 µm sub-VT gain-cell with

VDD = 400 mV;(b) Best-practice gain-cell for sub-VT operation in 0.18 µm CMOS.

WWL

SN MR

CSN RBL

WBL RWL

I/O Core PMOS NMOS

(a) (b) J. Low Power Electron. Appl. 2013, 3 61

2.2.2. Scaled 40 nm CMOS Node

While choosing the best device option for MW, subthreshold conduction must again be kept as small as possible, as it affects both a “1” and a “0” level. The diffusion leakage, the GIDL current, and the edge-direct tunneling current weaken one logic level, while they strengthen the other. However, all three leakage components work against the logic level that has already been weakened through CI and CF at the end of a write pulse. For example, with a PMOS MW, the logic “0” level is weakened through a

positive SN voltage step when closing MW, while IGIDL, Idiff , and IEDT further pull up SN, deteriorating the stored “0”. Therefore, in order to protect the already weaker level, the optimum device selection aims at minimizing all of these leakage components. Figure 5(a) shows the leakage components of minimum sized devices provided in the 40 nm process (the LVT devices were left out of the figure for display

purposes, as their leakage is significantly higher than the leakage of other devices) at a near-VT supply voltage of 600 mV. This figure clearly shows that despite the increasing significance of other leakage

currents with technology scaling, Isub is still dominant at this node (some of the leakage components are not modeled for the I/O devices; however, this does not impact our analysis, as the PMOS HVT already provides the lowest total leakage). However, the advantage of using an I/O device is lost, and a more compact HVT PMOS device provides the lowest total leakage. This trend is confirmed when evaluating the leakage components of intermediate process nodes, as well, showing that the leakage benefits of

This Tableusing compares an between I/O deviceleakage currents deteriorate in 40 nm Devices to the point where the area versus leakage trade-off favors the use of an vdd=600HVT mV deviceIsub atIdiff aroundIGIDL theIEDT 65 nm node.Igate IEDT PMOS HVT 7.05E-13 2.47E-13 7.75E-21 3.18E-14 1.00E-17 6.36E-14 NMOS HVT 1.56E-12 4.83E-13 6.00E-38 1.09E-13 1.02E-16 2.18E-13 PMOS SVT Figure1.41E-11 3.54E-13 5. (a1.33E-19) Leakage2.76E-14 components8.80E-18 5.52E-14 of various devices in the considered 40 nm node NMOS SVT 1.46E-11 3.45E-13 1.47E-31 9.30E-14 5.73E-21 1.87E-13 0 0 0 0 PMOS IO at1.72E-12 a near-1.24E-14VT 0.00E+00supply voltage0 of 6000 mV;0 (b) Worst-case Ion(weak 1 )/Ioff (weak 0 ) of NMOS IO 8.08E-12 1.70E-14 0.00E+00 0 0 0 PMOS LVT MR,4.70E-11 implemented1.94E-13 1.70E-22 with LVT,3.25E-14 SVT,7.00E-18 and6.50E-14 HVT devices. Both plots were simulated under NMOS LVT typical1.26E-10 conditions.0.00E+00 1.56E-28 9.26E-14 7.94E-16 1.85E-13

1.6E-11 IIdiffgate 20 1.4E-11 LVT IIGIDLdiff SVT 1.2E-11 IIEDTGIDL 15 HVT IIgateEDT 1.0E-11 IIsubsub

8.0E-12 off I / 10 on 6.0E-12 I

4.0E-12

5 Leakage Current [A] Current Leakage 2.0E-12

0.0E+00 0 PMOS NMOS PMOS NMOS PMOS NMOS 0.2 0.4 0.6 0.8 1 1.2 V [V] HVT HVT SVT SVT IO IO DD (a) (b)

1.00E-09 PMOS HVT NMOS HVT PMOS SVT NMOS SVT PMOS IO NMOS IO

1.00E-10

IEDT J. Low Power Electron. Appl. 2013, 3 62

2.3. Best-Practice Read Transistor Implementation

2.3.1. Mature 0.18 µm CMOS Node

At the onset of a read operation, capacitive coupling from RWL to SN causes a voltage step on SN [20]. Our analysis from the previous section showed that MW should be implemented with a PMOS device, resulting in a strong logic “1” and a weaker logic “0”. Therefore, it is preferable to implement MR with an NMOS transistor that employs a negative RWL transition for read assertion. The resulting temporary decrease in voltage on SN counteracts the previous effects of CI and CF, thus improving the “0” state during a read operation (effect is reversed upon deassertion of the RWL). As a side effect, this negative SN voltage step also lowers the “1” level and therefore slightly slows down the read operation; however, this level is already initially boosted due to deassertion of the WWL. An additional and perhaps more significant reason to choose an NMOS device for readout is that NMOS devices are approximately an order-of-magnitude stronger than their PMOS counterparts at sub-VT voltages. Therefore, implementing MR with an NMOS device provides a fast read access, which not only results in better performance but also is essential for ensuring high array availability. As mentioned, the considered 0.18 µm process provides only core and I/O devices, and considering the three-orders-of-magnitude higher on-current for core devices at sub-VT, the choice of an NMOS core MR is straightforward.

To summarize, the most appropriate 2T gain-cell for sub-VT operation in an above-100 nm CMOS node comprises an I/O PMOS write transistor and a core NMOS read transistor, as illustrated in Figure 4(b). The resulting hybrid NMOS/PMOS gain-cell shares the n-well on three sides between neighboring cells [19] to keep the area cost low, as discussed in Section3.

2.3.2. Scaled 40 nm CMOS Node

When considering the best device type for scaled nodes, the large number of options presents some interesting trade-offs for the implementation of MR. The increasing gate leakage currents (Igate and

IEDT) at scaled nodes could potentially present an advantage for a thick oxide I/O device due to its reduced gate tunneling. However, at low voltages, the tunneling currents are small in comparison with the subthreshold conduction through MW, as shown in Figure 5(a). In addition, Igate and IEDT actually appear in opposite directions, as the stored “0” level rises, further reducing their impact. On the other hand, the two primary considerations for the above-100 nm nodes are even more relevant at scaled nodes. The achievable retention time in the 40 nm process turns out to be approximately three orders-of-magnitude lower than that of the 0.18 µm node. Therefore, the negative step caused by RWL coupling to SN is even more important, and fast reads are essential to provide sufficient array availability, despite the high refresh rates. To further enhance the read step, layout techniques can be implemented to increase the capacitive coupling between RWL and SN. However, when considering read access times, additional trade-offs arise. For maximum read performance, MR could be implemented with an LVT device. At the 40 nm node, an LVT NMOS provides an 8× increase in on-current at 400 mV compared with an SVT NMOS. However, as the supply voltage is increased, this benefit reduces to 3× at 600 mV. The superior on-currents of LVT devices, as compared with SVT or HVT options, come at the expense of much higher off-currents, as well as increased process variations. When choosing the read device, J. Low Power Electron. Appl. 2013, 3 63 this trade-off must be taken into consideration, as it is mandatory to correctly differentiate between the discharged level of RBL due to a stored “1” and the depleted level due to a weak stored “0”. Furthermore, the unselected cells on the same column of a selected cell storing a “1” will start to counteract the unselected discharge of RBL during a read, as VGS,MR = VDD − VRBL. In effect, this limits the speed and minimum discharge level of RBL, according to the drive strength of the unselected MR devices. When considering sub-VT operation in the 40 nm node, the relatively low subthreshold conduction of the SVT, HVT, and I/O devices renders the LVT the only feasible option for MR to achieve a reasonable RBL discharge time. However, as VDD is increased into the near-VT region, an SVT device provides sufficient on-current, while the higher VT and lower leakage enable better reliability under process variations, as well as improved array availability.

Figure 5(b) shows the worst case current ratio Ion/Ioff of the NMOS read transistor MR, implemented with different device types as a function of VDD. Ion is given for a weak “1” level, estimated as the steady state high voltage of SN when tying WBL to VDD (VSN = 0.85VDD). Ioff is given for a weak “0” level, estimated at VSN = 0.4VDD, which would provide a sufficient margin to differentiate between the two levels (this is verified for the chosen implementation at the minimum feasible bias in Section4). For supply voltages below 600 mV, the LVT device has the highest current ratio and is therefore preferred, as it provides the best achievable array availability. Likewise, the SVT device is preferred for VDD between

600 and 800 mV, while the HVT device is the best option for even higher VDD.

2.4. Storage Node Capacitance and WWL Underdrive Voltage

2.4.1. Mature 0.18 µm CMOS Node

To close the design of the 2T bitcell, two important design parameters must be taken into consideration. First, the storage node capacitance (CSN), primarily made up of the diffusion capacitance of MW and the gate capacitance of MR, is typically around 1 fF for minimum device sizes. However, we find that by applying layout techniques, such as metal stacking, this value can be extended by over

5×, providing a configurable design parameter. Second, to address the VT drop across MW especially affecting the write “0” operation (but also the write “1” operation in the sub-VT regime), an underdrive voltage (VNWL) needs to be applied to WWL, the magnitude of which affects the write access time and the SN voltage.

Figure 6(a) shows the storage node voltage (VSN) after a write “0” access as a function of CSN and

VNWL, before and after closing MW. Figure 6(b) emphasizes the impact of CI and CF by showing the voltage step ∆V that occurs while closing MW. It is clear that any VNWL above −650 mV already results in a degraded logic “0” transfer prior to turning off MW. ∆V can be reduced by increasing CSN and by decreasing the magnitude of VNWL. Therefore, on the one hand, VNWL must be low enough to ensure a proper logic “0” transfer, while, on the other hand, it should be as high as possible to minimize ∆V . The optimum value for VNWL leading to the strongest “0” state after a completed write operation is found to be −650 mV, as shown in Figure 6(a). The optimum value for CSN is clearly the maximum displayed value of 2.5 fF. J. Low Power Electron. Appl. 2013, 3 64

Figure 6. Following a write “0” operation: (a) VSN before and after closing MW, as a

function of CSN and VNWL;(b) ∆V due to charge injection from MW and due to capacitive coupling from WWL to SN.

80 80 V =−0.8V NWL V =−0.8V NWL V =−0.65V 70 After CI & CF NWL V =−0.65V 60 NWL V =−0.6V NWL V =−0.6V 60 NWL

[mV] 40 50 SN V [mV] V ∆ 40 20 Before CI & CF 30

0 20 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 C [fF] C [fF] SN SN (a) (b)

2.4.2. Scaled 40 nm CMOS Node

It is clear that the storage node capacitance should always be as big as possible, regardless of the technology node. This not only results in an improved initial “0” level, as shown above, but also provides more stored charge and thus extends the retention time. A general characteristic of scaled CMOS nodes is the increased number of routing layers, which in the case of gain-cell design, can be used to build up the storage node capacitor. Here, we assume that all available metal layers can be used at no additional cost, as the memory is going to be embedded in a system-on-chip that already uses all the metal layers. Moreover, with technology scaling, the aspect ratio of metal wires changes to narrower but higher, and wires can be placed closer to each other, which is beneficial in terms of side-wall parasitic capacitance. However, much of this benefit is offset by the lower dielectric constants of the insulating materials (low-k) integrated into digital processes with technology scaling. In addition, the absolute footprint of the bitcell shrinks with technology, making it more challenging to allocate many inter-digit fingers for a high capacitance. In fact, in the considered 40 nm node, the footprint of a gain-cell containing only two core devices is so small that the minimum width and spacing rules for medium and thick metals are too large to exploit for increasing the capacitance of the SN. Therefore, our layout of the 40 nm cell is limited to 5 routing layers, and the overall SN capacitance is much lower than that achieved in the 0.18 µm node. Figure 7(a) summarizes the achievable storage node capacitance according to the number of thin metal layers provided by the two considered technology nodes. Figure 7(b) shows the 40 nm SN voltage step ∆V that occurs during the positive edge of WWL for a logic “0” transfer. As already observed for the 0.18 µm node, ∆V decreases with increasing SN capacitance and with decreasing WWL step size (i.e., with decreasing absolute value of the underdrive voltage, VNWL). While the charge injected from the large channel area of the selected I/O PMOS write transistor in the mature technology node results in a large voltage step severely threatening data integrity, the problem is slightly alleviated in more advanced nodes where small core transistors are preferred. The J. Low Power Electron. Appl. 2013, 3 65

resulting voltage steps of 10 to 45 mV are rather small compared with the minimum VDD where high array availability is achieved (as will be shown in Section4). Moreover, it is worth mentioning that strong “0” levels are transferred to SN even with the least aggressive underdrive voltage of −0.4V (however, at the expense of write access time). Therefore, the ∆V values in Figure 7(b) also correspond to the final SN voltage right after the write access. The final choice of VNWL for the 40 nm node needs to account for the write access time, which must remain short to guarantee high array availability in a node with high leakage and short retention time (see Section4). Therefore, Figure 7(c) shows the final VSN after CI and CF, as a function of the write pulse width. Over a large range of pulse widths as short as several ns, an underdrive voltage of −700 mV results in the strongest “0” levels, and is therefore preferred. Less underdrive, e.g., −500 mV, would result in weak “0” levels for pulse widths that are shorter than 3 ns.

Figure 7. (a) Storage node capacitance versus number of employed metal layers; (b) ∆V

due to CI and CF, as a function of CSN and VNWL, for VDD = 700 mV;(c) VSN after CI and CF versus write pulse width.

5 50 200 0.18µm CMOS V =−0.8V V =−0.9V 40nm CMOS 45 NWL NWL 4 V =−0.6V V =−0.7V 40 NWL 150 NWL V =−0.4V V =−0.5V 3 NWL NWL 35 [fF] 100 SN V [mV]

C 30 2 ∆

25 After CI & CF [mV]

SN 50 1 V 20

0 15 0 1 2 3 4 5 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 C [fF] Number of metal layers SN Write Pulse Width [ns] (a) (b) (c)

3. Macrocell Implementation in 0.18 µm CMOS

This section presents a 64 × 32 bit (2 kb) memory macro based on the previously elaborated 2T gain-cell configuration (Figure 4(b)), implemented in a bulk CMOS 0.18 µm technology. The considered

VDD of 400 mV is clearly in the sub-VT regime, as VT of MW and MR are −720 mV and 430 mV, respectively. Special emphasis is put on the analysis of the reliability of sub-VT operation under parametric variations. While the address decoders and the sense buffers are built from combinational

CMOS gates and operate reliably in the sub-VT domain [21], the analysis focuses on the write-ability, data retention, and read-ability of the gain-cell. All simulations assume a 1 µs write and read access time (1 MHz operation); a 3-metal SN capacitance of 2.5 fF, providing a retention time of 40 ms (according to previously presented worst case estimation); a temperature of 37 ◦C and account for global and local parametric variations (1k-point Monte Carlo sampling). Figure8 plots the distribution of the bitcell’s SN voltage at critical time points for the “0” and the “1” states. As expected, nominal 0 V and 400 mV levels are passed to SN just before the positive edge of the write pulse. CI and CF cause the internal levels to rise by 20–50 mV, resulting in a slightly degraded “0” level and an enhanced “1” level, while the distributions remain sharp. After a 40 ms retention period with a worst-case opposite WBL voltage, the distributions are spread out, but the “1” levels are still strong, while the extreme cases of the “0” levels have severely depleted, approaching 200 mV. However, the “0” J. Low Power Electron. Appl. 2013, 3 66 and “1” levels are still well separated, and moreover, the “0” levels are improved following the falling RWL transition, resulting in a 10–20 mV decrease.

Figure 8. Distribution of the SN voltage of a logic “0” and a logic “1” at critical time points: (1) [circles] directly after a 1 µs write access (before turning off MW); (2) [squares] after turning off MW; (3) [diamonds] after a 40 ms retention period under worst-case WBL conditions; and (4) [triangles] during a read operation.

1000

Write 800 Logic ‘0’ Logic ‘1’ CI & CF Retention 600 Read pulse

400 Occurrences

200

0 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 V [V] SN

To verify the read-ability of the bitcell, Figure9 shows the distribution of the RBL voltage ( VRBL) following read “0” and read “1” operations after the 40 ms retention period. In addition, the figure plots the distribution of the trip-point (VM) of the sense buffer. While read “0” is robust in any case (RBL stays precharged), read “1” is most robust if all unselected cells on the same RBL as the selected cell store “0” (see Figure 9(a)), while it becomes more critical if all unselected cells store “1” (see Figure 9(b)), thereby inhibiting the discharge of RBL through the selected cell. This worst-case scenario for a read

“1” operation is illustrated in Figure 10(a). In order to make the read operation more robust, VM is shifted to a value higher than VDD /2 by appropriate transistor sizing in the sense inverter. Ultimately, the

VRBL distributions for read “0” and read “1” are clearly separated, and the distribution of VM is shown to comfortably fit between them, as shown in Figure9. The layout of the 0.18 µm 2T gain-cell, comprising a PMOS I/O MW and an NMOS core MR, is shown in Figure 10(b). The figure presents a zoomed-in view of one bitcell (surrounded by a dashed line) as part of an array. The chosen technology requires rather large design rules for the implementation of I/O devices; however, by sharing the n-well on three sides and stacking the bitlines, a reasonable area 2 of 4.35 µm per bitcell is achieved. In the same node, a single-ported 6T SRAM bitcell for above-VT operation has a comparable area cost of 4.1 µm2 (cell violates standard DRC rules), whereas SRAM bitcells optimized for robust operation at low voltages are clearly larger (e.g., the 14T SRAM bitcell in [2] has an area cost of 40 µm2). The depicted layout also enables metal stacking above the storage node to provide an increased SN capacitance of up to 5 fF (see Figure 7(a)). J. Low Power Electron. Appl. 2013, 3 67

Figure 9. Distribution of RBL voltage (VRBL) after read “1” [circles] and read “0”

[diamonds] operations and distribution of the trip-point VM of the read buffer [squares], for (a) favorable and (b) unfavorable read “1” conditions.

1000 1000

Read 0 Read 0 800 V 800 V M M Read 1 600 600 Read 1

400 400 Occurrences Occurrences

200 200

0 0 0 0.1 0.2 0.3 0.4 0.1 0.15 0.2 0.25 0.3 0.35 0.4 V [V] V [V] RBL RBL (a) (b)

Figure 10. 180 nm gain-cell array: (a) Worst-case for read “1” operation: all cells in the same column store data “1”; to make the read “1” operation more robust, the sense inverter

is skewed, with a trip-point VM > VDD /2;(b) Zoomed-in layout.

WWL

VDD SN MR Idisturb 63X L { B } W RWL unselected VDD VDD WWL V DD gnd SN MR Isense

L RBL B

W OUT RWL selected

VDD Standard Skewed inverter inverter Skewed read buffer: V >V /2 M DD OUT RBL VM VDD (a) (b)

At an operating frequency of 1 MHz, a full refresh cycle of 64 rows takes approximately 128 µs. With a worst-case 40 ms retention time, the resulting availability for write and read is 99.7%. As summarized in Table1, the average leakage power of the 2 kb array at room temperature (27 ◦C) is 1.95 nW, while the active refresh power of 1.68 nW is comparable, amounting to a total data retention power of 3.63 nW (or 1.7 pW/bit). This total data retention power is comparable with previous reports on low-voltage gain-cell arrays [13], given for room temperature as well. J. Low Power Electron. Appl. 2013, 3 68

Table 1. Figures of Merit.

Technology Node 180 nm CMOS 40 nm LP CMOS

Number of thin metal layers 5 5 Write Transistor PMOS I/O PMOS HVT Read Transistor NMOS Core NMOS SVT

VDDmin 400 mV 600 mV Storage Node Capacitance 1.1 fF–4.9 fF 0.27 fF–0.72 fF Bitcell Size 1.12 µm × 3.89 µm (4.35 µm2) 0.77 µm × 0.42 µm (0.32 µm2) Array Size 64 × 32 (2 kb) 64 × 32 (2 kb) Write Access Time 1 µs 3 ns Read Access Time 1 µs 17 ns Worst-Case Retention Time 40 ms 44 µs Leakage Power 1.95 nW (952 fW/bit) 68.3 nW (33.4 pW/bit) Average Active Refresh Energy 67 pJ 21.2 pJ Average Active Refresh Power 1.68 nW (818 fW/bit) 482 nW (235.5 pW/bit) Average Retention Power 3.63 nW (1.7 pW/bit) 551 nW (268.9 pW/bit) Array Availability 99.7% 97.1%

4. Macrocell Implementation in 40 nm CMOS

Whereas gain-cell implementations in mature technologies have been frequently demonstrated in the recent past, 65 nm CMOS is the most scaled technology in which gain-cells have been reported to date [16]. In this section, for the first time, we present a 40 nm gain-cell implementation, and explore array sizes and the corresponding minimum operating voltages that result in sufficient array availability. As previously described, core HVT devices are more efficient than I/O devices for write transistor implementation at scaled nodes, providing similar retention times with relaxed design rules (i.e., reduced area). In addition, the multiple threshold-voltage options for core transistors provide an interesting design space for the read transistor selection, trading off on and off currents, depending on supply voltage. Two additional factors that significantly impact the design at scaled nodes are the reduced storage node capacitance, due to smaller cell area and low-k insulation materials, and severely impeded retention times, due to lower storage capacitance and increasing leakage currents. Therefore, array availability becomes a major factor in gain-cell design and supply voltage selection. For this implementation, a minimum array availability of 97% was defined. Considering a minimum array size of 1 kb (32 × 32), sufficient array availability is unattainable with the LVT MR implementation for a supply voltage lower than 500 mV, suitable for this device according to Figure 5(b). Therefore, an SVT device was considered with near-threshold supply voltages above 500 mV. Figure 11(a) shows the array availability achieved under varying supply voltages, considering array sizes from 1 kb to 4 kb. The red dashed line indicates the target availability of 97%, showing that this benchmark can be achieved with a 2 kb array at 600 mV. At this supply voltage, with a −700 mV underdrive write voltage, the write access time is 3 ns, and the worst-case read access time is 17 ns, while the worst-case retention time is 44 µs (see Table1). Figure 12 shows the distribution of the time required J. Low Power Electron. Appl. 2013, 3 69 to sense the discharged voltage of RBL during a read “1” operation following a full retention period (green bars). The red bars (read “0”) represent an incorrect readout, caused by a slow RBL discharge through leakage, such that the read access time must be shorter than the first occurrence of an incorrect read “0”. The clear separation between the two distributions shows that by setting the read access time to 17 ns, the system will be able to robustly differentiate between the two stored states.

Figure 11. 40 nm gain-cell array: (a) Array availability as a function of supply voltage and array size; (b) Zoomed-in layout.

100

98 97%

96

94

92

Array Availability [%] 32x32 [1kb] 90 64x32 [2kb] 128x32 [4kb] 88 500 600 700 800 900 Supply Voltage [mV] (a) (b)

Figure 12. Read access time distribution for the 40 nm gain-cell implementation: RBL discharge time for correct data “1” sensing, and undesired RBL discharge time till sensing threshold through leakage for data “0”.

300 Read 0 Read 1 250

200

150

Occurrences 100

50

0 1 2 3 4 10 10 10 10 Read Time [ns]

A zoomed-in layout of the 40 nm gain-cell array is shown in Figure 11(b), with a bitcell area of 0.32 µm2 (surrounded by the dashed line). For comparison, a single-ported 6T SRAM bitcell in the same node has a slightly larger silicon area of 0.572 µm2, while robust low-voltage SRAM cells are considerably larger (e.g., the 9T SRAM bitcell in [5] has an area cost of 1.058 µm2). As shown in Table1, the implemented 40 nm array exhibits a leakage power of 68.3 nW, which is clearly higher than for the J. Low Power Electron. Appl. 2013, 3 70

0.18 µm array. Even though the active energy for refreshing the entire array is only 21.2 pJ, the required refresh power of 482 nW is again higher than for the 0.18 µm node, due to the three orders-of-magnitude lower retention time. Consequently, the total data retention power is around 150× higher in 40 nm CMOS, compared with 0.18 µm CMOS.

5. Conclusions

This paper investigates two-transistor sub-VT and near-VT gain-cell memories for use in ultra-low-power systems, implemented in two very different technology generations. For mature, above-100 nm CMOS nodes, the main design goals of the bitcell are long retention time and high data integrity. In the considered 0.18 µm CMOS node, a low-leakage I/O PMOS write transistor and an extended storage node capacitance ensure a retention time of at least 40 ms. At low voltages, data integrity is severely threatened by charge injection and capacitive coupling from read and write wordlines. Therefore, the positive storage-node voltage disturb at the culmination of a write operation is counteracted by a negative disturb at the onset of a read operation, which is only possible with an NMOS read transistor. Moreover, the write wordline underdrive voltage must be carefully engineered for proper level transfer at minimum voltage disturb during de-assertion. Monte Carlo simulations of an entire 2 kb memory array, operated at 1 MHz with a 400 mV sub-VT supply voltage, confirm robust write and read operations under global and local variations, as well as a minimum retention time of 40 ms leading to 99.7% availability for read and write. The total data retention power is estimated as 3.63 nW/2 kb, the leakage power and the active refresh power being comparable. The mixed gain-cell with a large I/O PMOS device has a large area cost of 4.35 µm2, compared with an all-PMOS or all-NMOS solution with core devices only. In more deeply scaled technologies, such as the considered 40 nm CMOS node, subthreshold conduction is still dominant at reduced supply voltages. Gate tunneling and GIDL currents are still small, but of increasing importance, while reverse-biased pn-junction leakage and edge-direct tunneling currents are negligible. In the 40 nm node, the write transistor is best implemented with an HVT core PMOS device, which provides the lowest aggregated leakage current from the storage node, even compared with the I/O PMOS device. A write wordline underdrive voltage of −700mV is employed to ensure strong “0” levels with a short write access time. Among various NMOS read transistor options, an SVT core device maximizes the sense current ratio between a weak “1” and a weak “0” for near-VT supply voltages (600–800 mV) where 97% array availibility is achieved. Both the access times and the retention time are roughly three orders-of-magnitude shorter than in the 0.18 µm CMOS node, due to the increased leakage currents and smaller storage node capacitance. While the active refresh energy is low (21 pJ), the high refresh frequency results in high refresh power (482 nW), dominating the total data retention power (551 nW). As compared with the 0.18 µm CMOS implementation, the scaled down design provides better performance (17 ns read access and 3 ns write access), and a compact bitcell size of 0.32 µm2.

To conclude, this analysis shows the feasibility of sub-VT gain-cell operation for mature process technologies and near-VT operation for a deeply scaled 40 nm process, providing a design methodology for achieving minimum VDD at these two very different nodes. J. Low Power Electron. Appl. 2013, 3 71

Acknowledgments

This work was kindly supported by the Swiss National Science Foundation under the project number PP002-119057. Pascal Meinerzhagen is supported by an Intel Ph.D. fellowship. The authors would like to thank Itzik Icin and Meitav Liber for their contribution to this work.

Declaration

Based on “A sub-VT 2T Gain-Cell Memory for Biomedical Applications”, by P. Meinerzhagen, A. Teman, A. Mordakhay, A. Burg, and A. Fish which appeared in the Proceedings of the IEEE 2012 Subthreshold Microelectronics Conference. ©2012 IEEE.

References

1. Sinangil, M.; Verma, N.; Chandrakasan, A. A Reconfigurable 65 nm SRAM Achieving Voltage Scalability from 0.25–1.2 V and Performance Scalability from 20 kHz–200 MHz. In Proceedings of the IEEE European Solid-State Circuits (ESSCIRC), Edinburgh, UK, 15–19 September 2008. 2. Hanson, S.; Seok, M.; Lin, Y.S.; Foo, Z.Y.; Kim, D.; Lee, Y.; Liu, N.; Sylvester, D.; Blaauw, D. A low-voltage processor for sensing applications with picowatt standby mode. IEEE J. Solid-State Circuit 2009, 44, 1145–1155. 3. Constantin, J.; Dogan, A.; Andersson, O.; Meinerzhagen, P.; Rodrigues, J.; Atienza, D.; Burg, A. TamaRISC-CS: An Ultra-Low-Power Application-Specific Processor for Compressed Sensing. In Proceedings of IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Santa Cruz, CA, USA, 7–10 October 2012. 4. Calhoun, B.H.; Chandrakasan, A.P. A 256-kb 65-nm sub-threshold SRAM design for ultra-low-voltage operation. IEEE J. Solid-State Circuit 2007, 42, 680–688. 5. Teman, A.; Pergament, L.; Cohen, O.; Fish, A. A 250 mV 8 kb 40 nm ultra-low power 9T supply feedback SRAM (SF-SRAM). IEEE J. Solid-State Circuit 2011, 46, 2713–2726. 6. Meinerzhagen, P.; Andersson, O.; Mohammadi, B.; Sherazi, Y.; Burg, A.; Rodrigues, J. A 500fW/Bit 14fJ/Bit-Access 4 kb Standard-Cell Based Sub-Vt Memory in 65 nm CMOS. In Proceedings of the IEEE European Solid-State Circuits (ESSCIRC), Bordeaux, France, 17–21 September 2012. 7. Chiu, Y.W.; Lin, J.Y.; Tu, M.H.; Jou, S.J.; Chuang, C.T. 8T Single-Ended Sub-Threshold SRAM with Cross-Point Data-Aware Write Operation. In Proceedings of the IEEE International Symposium on Low Power Electronics and Design (ISLPED), Fukuoka, Japan, 1–3 August 2011. 8. Sinangil, M.; Verma, N.; Chandrakasan, A. A reconfigurable 8T ultra-dynamic voltage scalable (U-DVS) SRAM in 65 nm CMOS. IEEE J. Solid-State Circuit 2009, 44, 3163–3173. 9. Teman, A.; Mordakhay, A.; Fish, A. Functionality and stability analysis of a 400 mV quasi-static RAM (QSRAM) bitcell. Microelectron. J. 2013, 44, 236–247. 10. Hong, S.; Kim, S.; Wee, J.K.; Lee, S. Low-voltage DRAM sensing scheme with offset-cancellation sense amplifier. IEEE J. Solid-State Circuit 2002, 37, 1356–1360. J. Low Power Electron. Appl. 2013, 3 72

11. Chun, K.C.; Jain, P.; Lee, J.H.; Kim, C. A 3T gain cell embedded DRAM utilizing preferential boosting for high density and low power on-die caches. IEEE J. Solid-State Circuit 2011, 46, 1495–1505. 12. Somasekhar, D.; Ye, Y.; Aseron, P.; Lu, S.L.; Khellah, M.; Howard, J.; Ruhl, G.; Karnik, T.; Borkar, S.; De, V.K.; Keshavarzi, A. 2 GHz 2 Mb 2T Gain-Cell Memory Macro with 128 GB/s Bandwidth in a 65 nm Logic Process. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 3–7 February 2008. 13. Lee, Y.; Chen, M.T.; Park, J.; Sylvester, D.; Blaauw, D. A 5.42nW/kB Retention Power Logic-Compatible Embedded DRAM with 2T Dual-Vt Gain Cell for Low Power Sensing Applicaions. In Proceedings of the IEEE Asian Solid State Circuits Conference (A-SSCC), Beijing, China, 8–10 November 2010. 14. Chun, K.C.; Jain, P.; Kim, C. Logic-Compatible Embedded DRAM Design for Memory Intensive Low Power Systems. In Proceedings of the IEEE International Symposium on Circuits and Systems, Paris, France, 30 May–2 June 2010. 15. Iqbal, R.; Meinerzhagen, P.; Burg, A. Two-Port Low-Power Gain-Cell Storage Array: Voltage Scaling and Retention Time. In Proceedings of the IEEE International Symposium on Circuits and Systems, Seoul, Korea, 20–23 May 2012. 16. Teman, A.; Meinerzhagen, P.; Burg, A.; Fish, A. Review and Classification of Gain Cell eDRAM Implementations. In Proceedings of the IEEE Convention of Electrical & Electronics Engineers in Israel, Eilat, Israel, 14–17 November 2012. 17. Chun, K.C.; Jain, P.; Kim, T.H.; Kim, C. A 667 MHz logic-compatible embedded DRAM featuring an asymmetric 2T gain cell for high speed on-die caches. IEEE J. Solid-State Circuit 2012, 47, 547–559. 18. Seok, M.; Sylvester, D.; Blaauw, D. Optimal Technology Selection for Minimizing Energy and Variability in Low Voltage Applications. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Bangalore, India, 11–13 August 2008. 19. Meinerzhagen, P.; Andic, O.; Treichler, J.; Burg, A. Design and Failure Analysis of Logic-Compatible Multilevel Gain-Cell-Based DRAM for Fault-Tolerant VLSI Systems. In Proceedings of the IEEE Great Lakes Symposium on VLSI, Lausanne, Switzerland, 2–4 May 2011.

20. Meinerzhagen, P.; Teman, A.; Mordakhay, A.; Burg, A.; Fish, A. A Sub-VT 2T Gain-Cell Memory for Biomedical Applications. In Proceedings of the IEEE Subthreshold Microelectronics Conference, Waltham, MA, USA, 910 October 2012. 21. Calhoun, B.; Wang, A.; Chandrakasan, A. Modeling and sizing for minimum energy operation in subthreshold circuits. IEEE J. Solid-State Circuit 2005, 40, 1778–1786.

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Appendix D: A Low-Cost Low-Power Non-Volatile Memory for RFID Applications

As appears on pages 1827-1830 of the Proceedings of the 2012 IEEE International Symposium on Circuits and Systems (ISCAS 2012) [59]. Presented in Seoul, Korea, May 2012.

Appendix A Low-Cost Low-Power Non-Volatile Memory for RFID Applications

Hadar Dagan, Adam Teman, Alexander Fish Evgeny Pikhay, Vladislav Dayan, Yakov Roizin Low Power Circuits and Systems Lab (LPC&S), TowerJazz, Ben-Gurion University of the Negev, Be’er Sheva, Israel Migdal HaEmek, Israel [email protected]

Abstract—One of the main obstacles delaying a more widespread cell operation modes; and special drivers that ensure low use of radio frequency identification (RFID) tags is cost. A currents in the CMOS periphery of the RFID chip. critical element of any RFID system is a low power embedded The rest of this paper is constructed as follows: Section II non-volatile memory (NVM) that can be fabricated without presents the targeted RFID System, overviewing its general additional masks to the core CMOS process. In this paper, we architecture and components; the architecture of the NVM present a 256-bit re-writeable NVM array, implemented in the TowerJazz 0.18μm CMOS process using only standard logic Memory is presented in Section III, and each component is process steps and masks. Based on the single-poly C-Flash described thereafter. Section IV describes the system bitcell, this array achieves an extremely low static power figure implementation. The paper is concluded in Section V. of 3.8μW during operation cycles. II. TARGETED RFID SYSTEM The NVM array described in this paper is intended for I. INTRODUCTION integration in a low-cost passive RFID system. The general Radio-Frequency Identification (RFID) technology has architecture of this system is shown in Figure 1. The system’s found broader applications over recent years [1]. The power is supplied through an on-chip antenna that absorbs the application field could be broader if the RFID tags would cost energy of the electro-magnetic waves, transmitted by external less [2]. In 2001, Sarma [3] suggested a “Five-cent RFID Tag” readers (read and write units) that are not included in the as a goal in RFID engineering. Achieving such a low-cost tag scope of this paper. The Energy Harvesting unit rectifies the assumes battery-free operation, low-cost chip fabrication antenna signal and converts it to a voltage of 1.2V-1.8V that is technology, and low cost packaging and testing. It is clear that used as the main supply voltage (VDD) of the chip. The voltage the RFID tag must be fabricated using no, or very few is dependent on the distance to the reader; the minimum 1.2V additional masks and process steps to the core CMOS value corresponds to the maximum distance of the RFID chip technology. One of the main contributors to the cost of RFID tags is the Non-Volatile Memory (NVM) used to store the identification data of the tag. One option is to incorporate a one-time programmable (OTP) memory. However, many applications require a multiple-write (MTP) functionality. The common way to implement MTP NVM is with a floating-gate (FG) double polysilicon technology. Such a technology 5V 0V 1.8V CLK requires many additional masks and process steps, -5V significantly increasing the price of the tag. In recent years, 1.8V several Embedded NVM solutions based on a single polysilicon layer have been proposed [4-8], including the ultra-low power consuming C-Flash bitcell [4], developed by Tower Semiconductor Ltd. In this paper, we present a 256-bit C-Flash module, implemented in the TowerJazz core 0.18μm CMOS technology without mask adders. This module was developed for integration in a low-cost, low-power passive RFID system. The distinguishing features of the developed memory are: a record for low power consuming single Poly NVM cell area of 35μm2; low (<|5|V) voltages employed for programming and erasing the cells; ultra-low currents in all Figure 1: Targeted RFID System Block Diagram utilized to synchronize the various units. transmitted back to the external unit. A the memory module. The The module. memory the the sends accordingly (DCU). The DCUinterprets the transmitted command, and KeyingShift ( are deactivatedand their railsfloated. are DCU following a is demodulated and transferred to the During energy. required the provide to enough is small reader the to distance the that assuming voltages, these to provide used is pumps charge operations, voltages of +5V and -5V are necessary. Apair of Gate bitcells. This is achieved with two voltage drivers, the operations, specific biases need to applied be the to relevant voltage required, whereas during selected rails across the array according tothe Mode drivers multiplex several voltages to horizontal and vertical applied to an entire word. cycles, the write register is reset, and the voltages required to programthe selected bits. During from the reader. During During reader. the from Output (SIPO) Write Register serially transferred from a DCU to the During Read Register parallelsampled and bya that is propagated to a row decoder. The data is read out in digital the through selected is word The once. entire dataword andis programmed, erased or read out at Flash is the NVM module that employs the previously proposed A.

The on-chip antenna is also connected to an In order to perform the Read The core of the passive RFID tag, described in Section II, System Architecture bitcells [4]. Each row ofthe memory array comprises an (TG) (TG) program Address III. Row Decoder Driver

ASK Figure 2: NVM Array Architecture Architecture Array NVM 2: Figure , and then serially transferred to the DCU. N , and the data to be written. cycles, the data towritten be to the arrayis

ON Tunnel Gate Driver and the ) Modem -V Read OLATILE Data In

command, and after processing, is CGi Read Parallel-Input Serial-Output Read Data Out BLi Control Gate (CG) , where the high frequency signal operations, the charge pumps , M , where, itisused to set the operations, this isthe only Address EMORY signal is returned to the , Program Program Serial-Input Parallel- Serial-Input and Mode Digital Control Unit D digital oscillator ESCRIPTION Erase Address Address Driver and Erase , and Erase operation is is operation signal, the the signal, signals to signals Amplitude Amplitude

(PISO) (PISO) . These Tunnel Tunnel signal Erase is C-

the column’s bitline. The switching threshold ( to gate transmission a through bitcell the of output the drives will remove these electrons. bias An opposite gate. floating the onto electrons of tunneling and than the (connected to the asymmetrically sized, such that Capacitor the Tunnel eliminate MOS depletion effects. The capacitors are n+ and p+ diffusions are formed in the capacitive areas to ( inverter digitally drives a high ( Capacitor, initiating F-N tunneling. For example, if and TG the to polarities with opposite signals 5V applying When connected to a pair of control signals, signals, control of pair a to connected ( gate floating the terminal of both capacitors is shared, thus forming apart of the memory array reside in adeep n-well (DNW). One silicon/gate oxide/IPW (isolated p-well) structure. All cells of an access transfer-gate. The capacitors are of apoly- capacitors; a floating poly-silicon gate; a readout inverter; and shownare Figurein 2. The bitcell comprises apair of The general 3Dview and schematic representation of the cell B. considerations will be discussed. module components will overviewed, be and various design standard process steps, including including steps, process standard The scheme. precharge the bitlines or employ any type of analog sensing The readout isstatic without the need to dynamically C. implants. applied to a standard currents, especiallywhen avoltage of more than 5.5V is appearanceunwanted of the as such problems, several encounters drivers voltage 5V and as low as -5V on the same control line. Design of such ashigh as voltages can output that drivers voltage specialized and low switching thresholds of the inverter ( readout voltage to the CG capacitor that is in between the high ( charge the removing ( charging by controlled is inverter CMOS Program, Erase or subsection, requires various biases to beapplied to the TG V

DG The FG node is the gate of a standard CMOS inverter that inverter CMOS standard a of gate the is node FG The The memory array is comprised of 256 C-Flash 256 of comprised is array memory The In the following subsections, each one of the NVM Operation of the C-Flash cell, as described in the previous the in previous as described cell, C-Flash the of Operation Memory Array Array Memory Voltage Drivers TG lines, according tothe array’s operating mode ( >5.5V). In addition, the voltage driver is required to CG =-5V, the potential at Figure 3: C-Flash Bitcell Layout and Schematic signals, a ~9.5Vwill a potential the signals, on Tunnel fall Control Capacitor Control C-Flash TG FG ). Application of these biases requires signal) is approximately 10X smaller erasing cellimplemented is entirelywith ), while the opposite terminals are Gate Induced Drain Leakage I/O (connected tothe device’s drain-to-gate nodes drain-to-gate device’s ) from the FG.By applyinga FG V will be 4.54V, resulting in DD ) or low ( CG deep n-well deep

and TG programming V M1 0 ) output state. state. ) output and CG . Connected V M (DNW) bitcells. V signal). signal). ) ofthe (GIDL) CG CG M0 Read, Read, ), the =5V =5V and ) or

-5V 1.8V 5V 1.8V 5V 1.8V EN mode mode addr Positive Driver Min Max

-5V Driver Driver 1.8V/5V 1.8V/5V 1.8V/5V addr[1] addr[1]

row0 data0 Logic Block End Driver TG0 Logic Block End Driver CG0 EN PC

addr[3] row1 data1 addr[2] Logic Block End Driver TG1 Logic Block End Driver CG1 row0 PC

row2 data2 addr[2] Logic Block End Driver TG2 Logic Block End Driver CG2 addr[3]

8 Dynamic row1 PC Tree Decoder 3 row15 data15 Logic Block End Driver TG15 Logic Block End Driver CG15

Figure 4: Architecture of Voltage Drivers addr[3] (a) Tunneling Gate (Row) Driver (b) Control Gate (Col) Driver addr[2]

row6 operate correctly in both Program/Erase modes, when the PC

DC/DC Converters provide the high 5V/-5V biases, as well as addr[3] 8 Dynamic in Read mode, when these biases are absent. row7 Tree Decoder 3 As a result, a pair of novel voltage drivers was designed to PC overcome these challenges. The architectures of the row-wise Tunneling Gate Driver and the column-wise Control Gate (a) (b) Driver are presented in Figure 4. Both drivers include global Figure 5: (a) Row Decoder Architecture (b) Dynamic Tree Decoder Schematic drivers that multiplex certain voltages to the entire set of row/column based End Drivers. Each End Driver receives a to be written to all columns. In Read mode, the selected row’s control signal from a logic block, according to a selector and WL signal is asserted, enabling the bitcells’ transmission gates. the array’s operating mode. The TG Driver’s selector is based This statically drives the bitcell digital output onto its on the data address, whereas the CG Driver’s selector is based column’s bitline. The bitline signals are fed into the 16-bit on the data to be written (programmed or erased) to the cell. A PISO Read Register and sampled. Subsequently, the register is full description of the voltage drivers’ design will be presented serially read out into the DCU. separately. IV. IMPLEMENTATION D. Row Decoder The 256-bit C-Flash based NVM memory was The NVM array employs a low-power 4-bit row decoder implemented in a TowerJazz 0.18μm CMOS process without for address selection. The architecture of the row decoder is any additional process steps or masks. The blocks were shown in Figure 5a. This decoder employs a pair of 3-bit designed according to the architecture shown in Figure 2, dynamic tree decoders with pre-discharge, as shown in Figure using the Cadence Virtuoso IC6 environment. Each block was 5b. The decoder design provides low power operation, by designed, implemented and verified through simulation with initially discharging all the rows and subsequently charging the Spectre circuit level simulator. The simulations included the selected row. Therefore, only one word line capacitance extreme process corners and testing possible statistical (CWL) is switched during each cycle. In addition, the tree- variations of the device parameters through Monte Carlo decoder structure switches a minimal number of capacitances simulations. The array and peripherals were laid out according per address change, further reducing the dynamic power to standard logic design rules. The full NVM module layout is consumption. This, of course comes at the expense of a long shown in Figure 6. The module’s functionality was verified transition time. This is acceptable in the power limited RFID through post-layout simulations. system, as it is operated at relatively low frequencies. The EN signal is used to select the relevant dynamic tree decoder, or disable both of them during the Pre-Discharge (PC) cycles and when the array is not accessed. E. Access Scheme The NVM array employs a full word program, erase, and read scheme through a pair of 16-bit registers. During program operations, the SIPO Write Register is loaded with the Data In word sent by the external reader. Accordingly, appropriate biases for programing are applied to the column- wise CG lines; the appropriate row is selected (according to the addr signal); and the word is programmed. When the mode Figure 6: Full Array Layout signal is set to Erase, the Write Register is reset, causing a ‘0’ Table 1: Figures of Merit Technology TowerJazz 0.18μm Bitcell Type TowerJazz C-Flash Array Size (rows x cols) 16X16 (256 bits) Memory Array Size (mm2)0.01mm2 (154μmX68μm) Module Size (mm2) 0.0336mm2 (320μmX105μm) Driver Voltages (V) -5V, 1.2V-1.8V, 5V Avg. Static Power (μW) 3.8μW and erased through F-N tunneling. The design of the module incorporates several challenging blocks, in particular high- voltage drivers and an original access scheme. The module was implemented in the TowerJazz 0.18μm core CMOS process without any additional masks or process steps. The system was designed, simulated and taped out for fabrication. Comprehensive measurements and test results will be presented in a future work. We would like to thank the Alpha

Figure 7: Functionality Waveform of the NVM Array Consortium for their funding and support of this work. performing a read-program-read sequence REFERENCES Signals (from top to bottom): PGM, WL, SEL, ERS, CG, TG, FG, Bitline [1] D.C. Wyld, "RFID 101: the next big thing for management," Functionality of the NVM module is demonstrated in Management Research News, vol. 29, pp. 154-173, 2006. Figure 7. The four top graphs in the figure are: (i) the [2] R. Want, "An introduction to RFID technology," Pervasive Computing, simulated row’s word line (WL) potential; (ii) the Program IEEE, vol. 5, pp. 25-33, 2006. (PGM) signal; (iii) the Erase (ERS) signal; and (iv) the Write [3] S.E. Sarma, "Towards the Five-Cent Tag," MIT Auto ID Center Data (SEL) signal. In this demonstration, the control signals Technical Report, MIT-AUTOID-WH-006, Nov. 2001. [4] Y. Roizin, E. Aloni, A. Birman, V. Dayan, A. Fenigstein, D. Nahmad, E. initiated a read-program-read sequence, with the initial value Pikhay and D. Zfira, "C-Flash: An Ultra-Low Power Single Poly Logic of the selected cell set to ‘0’. Below the digital signals, the NVM," in Non-Volatile Semiconductor Memory Workshop, 2008 and CG and TG line potentials are shown. During the read cycles, 2008 International Conference on Memory Technology and Design. both lines are driven to 1.8V, whereas during the Program NVSMW/ICMTD 2008. Joint, pp. 90-92, 2008. cycle, the CG line is driven to 5V and the TG line is driven to [5] R. Barsatan, Y.M. Tsz and M. Chan, "A zero-mask one-time programmable memory array for RFID applications," in Circuits and -5V. The voltage at the floating gate is at an intermediate Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International level for reads and rises to above 4V during the program Symposium on, pp. 4 pp., 2006. cycle. This creates the high potential across the TG capacitor [6] Z.Y. Cui, M.H. Choi, Y.S. Kim, H.G. Lee, K.W. Kim and N.S. Kim, that enables the F-N tunneling required to program the bit "Single poly-EEPROM with stacked MIM and n-well capacitor," cell. The bottom wave shows the output driven onto the Electronics Letters, vol. 45, pp. 185-186, 2009. [7] A. Atrash, G. Cassuto, W. Chen, V. Dayan, O. Galzur, M. Gutman, A. bitline that changes from ‘0’ to ‘1’ after the bitcell is Heiman, G. Hunsinger, D. Nahmad, A. Parag, E. Pikhay, Y. Roizin, B. programmed. Table 1 summarizes the major figures of merit Smith, A. Strum, T. Tishbi and R. Teggatz, "Zero-cost MTP high density of the NVM array. NVM modules in a CMOS process flow," in Memory Workshop (IMW), 2010 IEEE International, pp. 1-4, 2010. ONCLUSIONS V. C [8] S. Shukuri, S. Shimizu, N. Ajika, T. Ogura, M. Mihara, Y. Kawajiri, K. In this paper we presented an embedded low-power NVM Kobayashi and M. Nakashima, "A 10k-Cycling Reliable 90nm Logic module for integration into low-cost passive RFID chips. The NVM "eCFlash" (Embedded CMOS Flash) Technology," in Memory Workshop (IMW), 2011 3rd IEEE International, pp. 1-2, 2011. module employs the ultra-low power C-Flash memory cell that uses a CMOS inverter readout scheme and is programmed

Appendix E: Autonomous CMOS Image Sensor For Real Time Target Detection and Tracking

As appears on pages 2138-2141 of the Proceedings of the 2008 IEEE International Symposium on Circuits and Systems (ISCAS 2008) [51]. Presented in Seattle, Washington, May 2008.

Appendix Autonomous CMOS Image Sensor for Real Time Target Detection and Tracking

Adam Teman1, Sagi Fisher1, Liby Sudakov1, Alexander Fish2, member IEEE and Orly Yadid-Pecht1,2, Fellow IEEE

1. The VLSI Systems Center, Ben-Gurion University, Beer-Sheva, Israel 2. Dept of Electrical and Computer Engineering, University of Calgary, Alberta, Canada

Abstract— An autonomous image sensor for real time target adaptive visual tracking sensor with a hysteretic winner-take-all detection and tracking is presented. The sensor is based on a network by Indiveri et al. [7]. CMOS APS array, equipped with in-pixel functionality and This paper presents an implementation of a real time CMOS integrates analog and digital components to achieve autonomous image sensor for tracking purposes. The proposed sensor can be operation with minimal power dissipation. The system employs useful in star tracking, machine vision and navigation applications. a two-phased operation flow; during the initial acquisition stage, This implementation is a continuation of our previous work, in the digital controller detects and acquires the brightest targets in which we implemented a low-power multiple target tracking sensor the field of view within a single frame and defines windows of based on biological models of attention [10]. Similar to the previous interest (WOI) around the center of mass coordinates of each solution, the proposed sensor is designed in CMOS technology and object. Subsequently, the system moves into the analog tracking includes several features, such as the capability of tracking multiple mode during which all areas outside of the WOI are entirely salient targets in the field of view by real time target centroid shut down, thus saving power to a number of orders of computation, low power dissipation, fully autonomous operation, magnitude. In addition to its low power dissipation, the sensor operation in integration mode, high fill factor and high spatial features real-time operation, low fixed pattern noise, linearity resolution. However, a number of significant improvements differ and the ability to track a predefined number of targets this newly proposed imager architecture from our previous solution. throughout the entire field of view. A 64x64 pixel sensor array First, the proposed imager utilizes a current mode Correlated Double has been designed in 0.18µm CMOS technology and is operated Sampling (CDS) to reduce FPN in both acquisition and tracking via a 1.8V supply. The imager architecture is discussed, the modes of operation. Secondly, contrary to the previous circuits’ descriptions are shown and simulation results are implementation, where target acquisition and tracking were presented. performed in an analog manner, in this solution the acquisition is done digitally, while the target tracking is performed entirely by I. INTRODUCTION analog computations. On one hand this provides better precision and Visual tracking of salient targets in the field of view (FOV) is a flexibility during the target acquisition and on the other hand it very important operation in many applications, such as machine allows distributed architecture and fully parallel computations vision, eye tracking, star tracking, navigation, teleconferencing, during tracking. In addition, the sensor achieves better linearity, robotics and many others [1]- [7]. To accomplish real time operation higher fill factor and improved power dissipation. a large amount of information needs to be processed in parallel. This Section II presents the architecture of the proposed sensor with a is a very complicated task that demands huge computation special emphasis on its different modes of operation. A detailed resources. Serial selection of regions of interest for processing, while description of the pixel, CDS and centroid computation is provided eliminating most or all of the processing of all other regions, can in Section III. Section IV specifies various simulations and tests greatly reduce the computation complexity. Active Pixel Sensors performed on the sensor. Section V concludes the paper. (APS), implemented in a standard CMOS technology, recently became a very attractive solution for these applications, rivaling II. SYSTEM ARCHITECTURE AND PRINCIPLE OF OPERATION traditional charge coupled devices (CCDs). Offering significant Fig. 1 shows a general architecture of the proposed tracking advantages in terms of low power, low voltage and monolithic sensor. The sensor consists of: (a) APS array, (b) column and row integration, CMOS technology allows fabrication of so called shift registers for target windows definition and for image readout "smart" image sensors [8]. The term "smart" sensor relates to the control, (c) column and row CDS circuits for FPN reduction, (d) ability of the imager to integrate analog and digital signal processing column and row Center of Mass (COM) Update Units for target onto the same substrate with the sensor and its digital interface. tracking purposes, (e) column and row Winner-Take-All (WTA) Contrary to a conventional imager, which only captures the image circuits used in the tracking mode of operation, (f) an Analog to and transfers it for further processing, "smart" image sensors carry Digital Converter (ADC) for target detection in the acquisition mode out an extensive amount of computation at the focal plane itself and and (g) a digital control unit responsible for mode selection, system transmit only the result of this computation. Both analog and digital timing and shift register control. processing can be performed either in the pixel or in the array The operation of the proposed sensor can be divided into two periphery. main modes: the acquisition mode and the tracking mode. In the Many papers on various tracking imagers, implemented in acquisition mode the sensor’s goals are to locate the N most salient CMOS technology, can be found in the literature. Good examples of targets in the FOV and calculate their centroid coordinates. These these papers include a digital vision chip for high-speed target aims are achieved in the following way: an image is captured and tracking in machine vision applications by Komuro et al. [1], single- read out using a “Rolling Shutter” readout method, similar to a chip eye tracker by Kim et al. [2], CMOS imager for pointing and conventional CMOS image sensor array [8]. Current mode CDS tracking applications by Pain et al. [3], A foveated silicon retina for circuits are used in the imager analog output chain to reduce FPN, two-dimensional tracking by Etienne-Cummings et al. [5] and an while a successive approximation ADC is utilized to digitize the

978-1-4244-1684-4/08/$25.00 ©2008 IEEE 2138 dissipation. If the target changes its position, the "shift left" or "shift right" (for both x and y) signals are produced by the COM update blocks. These signals are input to the tracking mode control block and the appropriate shift register is shifted to the appropriate direction, correcting the location of the window. This enables WOI movement in 9 directions (N, S, E, W, NE, NW, SE, SW or no movement) per frame. III. CIRCUITS DESCRIPTION In this section the most important analog components of the proposed tracking sensor, such as the pixel, CDS circuit and COM update circuits are described. The detailed WTA circuit description can be found in [11]. A. Active Pixel Sensor In order to implement the described system architecture with an analog COM update capability, a current-mode pixel was chosen to enable analog summation of multiple pixels. The implementation of a single pixel is shown in Fig. 2. In this implementation the photodetector (n+/psub photodiode) is connected to a Figure 1. General architecture of the proposed sensor. transconductance amplifier (M2) and a reset transistor (M1). The transconductance amplifier is realized through a PMOS Common analog readout data. The digitized output image is transferred to the Source, biased in the linear region, thus ensuring a linear output. The digital control unit for further image processing. This unit defines a NMOS reset transistor causes a voltage drop in the reset voltage Window of Interest (WOI) around each target using very simple value, ensuring the opening of the transconductance transistor M2 COM detection algorithm. The algorithm is based on a single-scan (VSG>Vth), but also reducing the output swing of the pixel. The pixel real time calculation and it is performed in the following way: operates in integration mode discharging the capacitance, associated during the image readout, the digital values of every three adjacent with the photodiode and adjacent nodes capacitances at a constant pixels in each row are added. The M (M>N) highest sums and the pace throughout the integration period. At the end of the integration coordinates associated with their middle pixels are stored. The period, the voltage on the photodiode (that is proportional to the maximal sum value indicates the COM of the first target. The light intensity) is linearly converted to current through the linear second target’s COM is found by measuring the distance from the biased transconductance amplifier. first COM. If it is too close, we assume that it’s still a part of the A readout transistor (M3) disconnects the biasing voltage (Vref) at first object. The COM coordinates of target k are found in the same all times except during signal and reset readout sampling. This both way, while measuring distances from all previously found targets. reduces power and prevents output error, by zeroing the current Note, this centroid computation algorithm was chosen due to its supplied to the output buffers. This configuration is preferred over a simplicity and relaxed hardware requirements. Once centroid serially connected readout switch which affects the signal and coordinates of the N most salient targets are detected, WOIs are distorts the linear response. defined around the COM coordinates of all detected targets by In this implementation, the fill factor was increased by reducing loading row and column shift registers with the targets’ COM the conveyed signals to a total of three and an addition of only 6 locations. Subsequently, tracking mode is activated. Each active transistors for combinatory logic. These enable complete random window definition is performed using two different shift registers access and pixel timing with minimal noise effect on the signal. The and is illustrated by the emphasized area in Fig. 1. This window pixel select signals (Row and Column Select) are output from the definition method allows switching from one target of interest to shift register in the pixel’s row/column through a multiplexer used another without the need to access the memory and load the new for selecting the active window. In the array periphery, each Row target coordinates. The ADC is switched off, since all computations Select signal is passed through two AND gates along with the global in the tracking mode are performed in an analog manner. Reset and Readout signals, creating two new signal busses: “Reset During the tracking mode only the regions inside the predefined Row” and “Readout Row”. These signals are conveyed to all the windows are processed. Because the sensor is in the tracking mode pixels in the row along with the Column Select bus. The two-input most of the time, it is very important to achieve very low power AND gates, made up of transistors M10-M11 and M12-M13, control dissipation in this mode. In the proposed system this is achieved in the Reset and Readout switches (M1 and M3), described above. two ways: The output stage of the pixel comprises a biasing unit and two 1. Only the pixels of N active windows and the circuitry output buffers. The biasing unit was implemented by a responsible for proper centroid detection and pixel readout are complementary Wilson current mirror, connected to an external bias active. The remaining circuits (including most pixels of the array) voltage (Vref). This unit enables copying both the current from the are disconnected from the power supply. In addition, WTA circuits current mirror input to the output, as well as copying the bias voltage are only activated for a short period for sampling the COM update Vref to the current mirror input (drain of M4). Here, the reference output. voltage is conveyed to the drain of the transconductance amplifier 2. The circuit does not recalculate the new centroid coordinates. (M2), biasing it in the linear region. The value of Vref is slightly The CDS circuits receive a sum of all currents from all WOI pixels lower than Vdd, thus keeping the transconductance amplifier in the in their relevant row/column and transfer the output to the COM linear region, while ensuring it isn’t accidentally shut off or Update unit. This unit relies on current distribution to decide which saturated. No other circuits are connected to the bias node, so as not way the target has moved. The WTA determines the target’s to affect the bias voltage which is critical for pixel operability. direction of movement according to the output signals from the Disconnecting this bias voltage shuts down the pixel output, COM Update units. If no movement was detected, the circuit does resulting in low power dissipation at all times the pixel is not being not need to perform any action, significantly reducing system power read out.

2139 Figure 2. Implementation of a single pixel of the proposed tracking sensor. The current is mirrored through a single NMOS to any number of Figure 3. CDS circuit schematics output nodes without affecting the bias voltage. In this configuration the pixel outputs the current to both a column bus and a row bus, circuit. According to its output, the digital control will execute a allowing linear readout from a single pixel during the acquisition Shift-Left or Shift-Right command to the relevant shift register, thus mode and readout from a group of pixels within the WOI during the instantly updating the COM. The WTA circuit is designed to tracking mode. The concurrent readout of a group of pixels is differentiate between current differences up to a threshold, under allowed by connecting the currents from all pixels in the which no decision is taken and the COM stays unmoved. row/column. The sum of these currents is conveyed to the CDS unit, as explained below. IV. SIMULATION RESULTS B. Correlated Double Sampling Circuit The system was designed and simulated in a standard CMOS 0.18µm TSMC technology available through MOSIS. All analog The proposed sensor integrates the use of a current mode CDS components have been simulated in the Cadence Virtuoso circuit for noise reduction and elimination of spatial variations environment using the Spectre simulator. The simulations were across the array. The concept of the CDS is based on the design carried out under typical conditions (27°C) as well as extreme presented by Gruev et al. [12], where one CDS circuit was employed corners (SS 75°C, FF 0°C, SF and FS both 0°C and 75°C) and for the whole array. In our proposed tracking sensor, the concept under 10% supply power uncertainty. The circuits were carefully was adapted for use with the pixel array described above, as a per- laid out taking analog features, matching and sensitivity into column/row circuit with the corresponding operation range. Each consideration. CDS circuit receives a current sink from its corresponding row or Fig. 5 shows the simulation of a single pixel output current (see column, consisting of the accumulative output current of the pixels currents I and I in Fig. 2) as a function of the photodiode in the active window. This current is sampled during the readout COL ROW voltage (source of the M transistor in Fig. 2). The simulation was phase of the system clock and then, during the reset phase, the reset 1 carried out for the full voltage swing available on the photodiode. As current is subtracted from the readout current and output from the can be seen, the pixel achieves approximately linear response for a CDS. wide range of photodiode voltages, especially for low illumination The transistor scheme of the CDS circuit is shown in Fig. 3. The levels (high photodiode voltages), where the linearity is more first stage of the CDS is a simple current mirror that converts the important. current sink into a current source (M , M ). This current is 1 2 Fig. 6 shows the simulation of response of the CDS circuit in the accurately mirrored from M -M to the analog memory stage created 3 8 tracking mode for different photodiode voltages. In this case the by M -M . This is executed during the system clock’s readout 11 13 CDS subtracts the sum of 7 pixels (belonging to a single column of phase with the SW signal high, turning on M -M and charging 9 10 the WOI) reset values from their signal values. As can been seen, the capacitors C and C . The SW signal is fall upon the system’s reset 1 2 CDS provides a linear response (introducing a constant gain) for a phase, turning off M -M and trapping the charge gathered on C 9 10 1 wide range of WOI outputs. and C , thus holding the pixel output current on M -M . 2 12 13 Fig. 7 shows the simulation of the WTA circuit, including all Subsequently, the readout current is sunk from the pixel column/row decision making at various ratios of target movement. The and mirrored to M . The remainder of the current required by M 11 12 simulation starts (1st section) with substantial window movement to and M13 is supplied by M14 and mirrored through M15 as the output current. While providing linearity of the pixel output, subtraction of the left (W on the column WTA or S on the row WTA), causing a accumulative noise is achieved. much higher current output from the left side of the COM update

C. Center of Mass Update Unit An implementation of the COM update unit is shown in Fig. 4. The current output by the CDS of each row and column is sourced into an array of matched resistors. Each resistor is connected in parallel to a bypass switch controlled by simple logic (a 2-input NAND gate). The NAND gates are connected to every pair of adjacent outputs of the shift registers, thus opening the PMOS bypass gates if one of the columns/rows is active. This structure creates a balanced resistance for the two ends of the active window, and accordingly, a larger current will reach the end closer to the target’s center of mass. The outputs are connected to a 2-input WTA Figure 4: Implementation of the COM Update Unit

2140 TABLE I MAIN PARAMETERS OF THE SYSTEM Technology TSMC 0.18µm Power supply 1.8v Array size 64 x 64 Fill Factor 35% Tracking Capability 3 Targets of 7x7 pixels Maximum targets movement rate 1 pixel per frame in any direction Frame rate up to 100 Frames per second Pixel Size 10µm x 10µm ADC type Successive approximation, 8bit Estimated Power Dissipation of 34.5µW in acquisition mode 64x64 Pixel Array 0.4µW in tracking mode Estimated System Power < 2mW in acquisition mode Figure 5: Single pixel output current as function of the photodiode voltage Dissipation < 100µW in tracking mode (Post-Layout Simulation).

V. CONCLUSIONS AND FURTHER RESEARCH An autonomous CMOS image sensor for target detection and tracking was presented. Employing a novel architecture, the proposed imager enables acquisition and real time tracking of up to 3 bright targets in the field of view. A 64x64 pixel sensor array has been designed in 0.18µm CMOS technology and is operated via a 1.8V supply. The detailed description of the proposed imager architecture was presented. Pixel, CDS and COM update unit circuits were shown and simulation results were described. Further research includes fabrication of the proposed sensor in 0.18µm CMOS technology and testing it.

EFERENCES Figure 6: Response of CDS circuit in the tracking mode of operation (Post- R Layout Simulation). [1] T. Komuro, I .Ishii, M. Ishikawa, A. Yoshida. “A digital vision chip specialized for high-speed target tracking”, IEEE Transactions on unit. The WTA responds with a logic “0” on the “Right Winner” Electron Devices, vol. 50, issue 1, pp. 191-199, Jan 2003 output and a logic “1” on the “Left Winner” output, causing the [2] D. Kim, S. Lim and G. Han, “Single-Chip Eye Tracker Using Smart digital control unit to execute a “shift left” command on the relevant CMOS Image Sensor Pixels”, Analog Integrated Circuits and Signal shift register. In the second section the movement is again to the left, Processing, 45, 131–141, 2005 though slighter (only a 3µA difference between left and right), and [3] B. Pain, C. Sun, G. Yang and J. B. Heynssens, “CMOS imager for pointing and tracking applications”, US Patent 7,030,356, Apr 18, 2006 again the system shifts left. The third section shows movement [4] D. Coombs, M. Herman, T.-H. Hong, and M. Nashman, “Real-time within the noise region of the WTA (<3µA), conceived as lack of obstacle avoidance using central flow divergence, and peripheral flow”, movement, thus the output is a logic “0” both on the right and left IEEE Transactions on Robotics & Automation, vol. 14, pp. 49–59, Feb. winner signals, causing no change in the shift register. Finally in the 1998. 4th section, the target moves to the right, causing the “Right Winner” [5] R. Etienne-Cummings, J. Van der Spiegel, P. Mueller, and M. Zhang, “A signal to rise to “1” and therefore update the WOI to the right. Foveated Silicon Retina for Two-Dimensional Tracking”, IEEE Transactions on Circuits and Systems—II:, vol. 47, no. 6, June 2000. [6] T. G. Morris, T. K. Horiuchi, and P. DeWeerth, “Target-Based Selection Within an Analog Visual Attention System”, IEEE Transactions on circuits and systems-11, Analog and Digital Signal processing, vol.45, no.12, December 1998. [7] G. Indiveri, P. Oswald, and J. Kramer, “An adaptive visual tracking sensor with a hysteretic winner-take-all network”, Proc. ISCAS 2002, 2: 324-327, May, 2002 [8] O. Yadid-Pecht and R. Etienne-Cummings, "CMOS imagers: from phototransduction to image processing", Kluwer Academic Publishers, 2004. [9] A. El Gamal and H. Eltoukhy, "CMOS Image Sensors", IEEE Circuits & Devices Magazine, pp.6-20, May/June, 2005. [10] A. Fish, A. Spivakovsky, A. Golberg and O. Yadid-Pecht, “VLSI Sensor for multiple targets detection and tracking”, Proc. ICECS 2004. pp 543-546, 13-15 Dec. 2004. Tel-Aviv, Israel. [11] A. Fish, V. Milirud and O. Yadid-Pecht, “High speed and high resolution current winner-take-all circuit in conjunction with adaptive thresholding”, IEEE Transactions on Circuits and Systems II, vol. 52, no. Figure 8: Winner Take All decision outputs at different ratios of active 3, pp. 131-135, March 2005. window movement (Post-Layout Simulation). [12] V. Gruev, R. Etienne-Cummings, T. Horiuchi. "Linear Current Mode Table I summarizes the expected properties and features of the Imager with Low Fix Pattern Noise". Proc. ISCAS '04. Vancouver, proposed tracking system. Note, at this stage the system was Canada. May, 2004 designed for tracking of up to 3 salient targets.

2141

Appendix F: Data Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays

As appears in the Proceedings of the 27th IEEE Convention of Electrical and Electronics Engineers in Israel (IEEEI 2012) [58]. Presented in Eilat, Nov. 2012.

Appendix 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel

2012 IEEE 27-th Convention of Electrical and Electronics Engineers in Israel Data Retention Voltage Detection for Minimizing the Standby Power of SRAM Arrays

Noa Edri∗†, Sharon Fraiman∗, Adam Teman∗, and Alexander Fish† ∗Low Power Circuits and Systems Lab (LPC&S) of the VLSI Systems Center Ben-Gurion University of the Negev, Be’er Sheva, Israel †Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel Email: [email protected]

Abstract—Lowering the supply voltage can substantially de- BL BLB crease SRAM leakage power. There are several different ap- proaches to determine the minimum standby voltage at which WL WL an SRAM array can preserve its data, also known as the M1 M3 array’s Data Retention Voltage (DRV). The main goal of these approaches is to try and find the tail of the DRV distribution, M5 M6 since the worst SRAM bit cell sets the DRV of an entire memory Q QB block. The analytical approach concentrates on solving the sub- M2 M4 threshold voltage-transfer-characteristic equations of the core of a standard 6T bit cell. Another straight-forward method is running numerous Monte Carlo simulations to obtain the DRV at the required probability level. More advanced approaches based (a) on static noise margin statistics remain valid for the extreme tails of the DRV distribution. Finally, the most accurate method 0.25 0.5 is to use designated on-chip hardware to measure the DRV 0.2

post-silicon fabrication. This paper overviews several recently SNM>0 0.4 0.15 proposed methods for DRV estimation and determination and [V] 0.3 QB[V] discusses the advantages and trade-offs of these methods. 0.1 VDD

Voltage 0.2

0.05 SNM=0 0.1 Q I. INTRODUCTION SNM>0 QB 0 0 With the progression of technology and continuous device 0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 V [V] scaling, leakage power has become the major factor in IC Q[V] DD power consumption. SRAM cells are commonly used in VLSI (b) (c) circuits due to their speed and compatibility with logic, and Fig. 1. (a) Schematic of standard 6T SRAM cell (b) Butterfly curves above are often the largest component of digital systems. Therefore, and at DRV for a symmetric 6T cell (c) Hold ‘1’ state as a function of supply SRAM leakage power is one of the primary components of voltage for a symmetric 6T bit cell overall system power consumption and in many systems is the dominant factor. Several designs have been proposed for the implementation of low-power SRAM cells in order to reduce significantly reduces the leakage power of this cell, both due leakage power; however, these designs generally trade-off den- to the voltage factor in the static power expression, as well as sity, performance, and/or robustness to achieve this reduction. the dependence of leakage components on supply voltage [1]. Another approach to reduce leakage currents is to lower the However, VDD reduction also lowers the noise margins of the standby supply voltage of the SRAM array to its lowest limit. bit cell, resulting in degraded stability. The lowest voltage, This paper overviews several recently proposed methods to at which an SRAM bit cell retains its bi-stability during estimate this minimum voltage, and therefore provide a means hold, and therefore can still preserve its data is known as to maximize leakage reduction. An introduction to the concept the Data Retention Voltage (DRV). This voltage can be found of Data Retention Voltage is given in Section II. Section III by plotting the so-called “butterfly curves” (Fig. 1(b)) and presents five methods for estimation and/or detection of the measuring the bit cell’s Static Noise Margin (SNM) [2]. Under low limit and their advantages and trade-offs. Section IV standby (hold), the SNM metric represents the maximum level provides a discussion about these methods, and Section V of a DC noise source that can be applied to the internal data concludes the paper. nodes of the SRAM bit cell without losing the data. The DRV is the point where the SNM is zero (the collapsed curves in II. DATA RETENTION VOLTAGE Fig. 1(b)), and therefore, reducing the voltage below the DRV The most common SRAM circuit is the standard 6T bit will cause the cell to be unreliable [1], [3]. cell shown in Fig. 1(a). Lowering the supply voltage (VDD) There are two scenarios in which the zero DRV point of

978-1-4673-4681-8/12/$31.00 ©2012 IEEE 1 0.25 1600 FS 0.2 0.5 SNM>0 TT 1200 0.4 DD

V

0.15 [V] e s

0.3 c QB[V]

0.1 e n r

SNM=0 0.2 QB r 800 Voltage

Q u c 0.05 0.1 O c SNM>0 400 0 0 0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 V [V] Q[V] DD

(a) (b) 0 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 Fig. 2. (a) Butterfly curves above and at DRV for an asymmetric 6T cell. DRV [V] (b) Hold ‘1’ state as a function of supply voltage for an asymmetric 6T bit cell. Fig. 3. DRV Distribution of a 6T SRAM bit cell in a commercial 40 nm process at a typical and a worst-case process corner. a bit cell is reached. The first is for the ideal case, assuming a completely symmetric bit cell as shown in Fig. 1(b), such III. METHODS FOR DETERMINING THE DRV that the SNM for holding a 1 (SNML) and for holding a 0 This section describes five recently proposed methods for (SNMH) are identical. Reducing the standby voltage equally determining the DRV of an SRAM bit cell or array, including a decreases SNMH and SNML, until a point is reached where post-fabrication method proposed by our group. The following the butterfly curves collapse, leaving a single meta-stable point sub-sections provide an short introduction to each of these which at which the internal data nodes (Q and QB) have the methods, their primary advantages, and their limitations. same value (Fig. 1(c)). Since no physical circuit is completely symmetric, the realistic case takes into account mismatch A. Analytic Method [3] between the devices in the bit cell. The butterfly curves of an asymmetric bit cell above and at DRV are illustrated in Examining the behavior of a 6T SRAM memory bit cell (Fig. 2(a)). For the asymmetric case, the reduction of the when operated at DRV shows that all six transistors are biased supply voltage causes one of the lobes of the butterfly curve to in the sub-VT region. Hence, the values stored on the internal deflate faster than the other, until bi-stability is lost. Therefore, data nodes (Q and QB in 1(a)) are determined by the behavior the bit cell flips to the remaining stable point, regardless of of sub-VT leakage currents. As mentioned before, when VDD is the initial state, as shown in Fig. 2(b) [1]. lowered to DRV, the bit cell’s noise margin degrades to zero. Therefore, an analytic formula for DRV calculation can be The minimum standby voltage that can be provided to derived by solving the sub-V cross-coupled inverter voltage the array is determined by the specific inter- and intra-die T transfer equations. The formula is developed under two major variations that affect the transistors that comprise it. Intra-die assumptions: 1. Leakage current from the access transistor variations (i.e. local mismatch) cause the DRV to be different is negligible; and 2. All leakage currents other than sub-V for each memory cell, and therefore, the DRV of the entire T leakage are ignored, since in current technologies their effect array is determined by the worst case DRV (i.e. the maximum is negligible under sub-V biasing conditions. The sub-V DRV among the SRAM bit cells). Inter-die variation (i.e. T T current for a transistor can be written as [3]: global variations) cause the mean of the die distribution to vary from die to die. For example, the mean is expected to Vth Ai = SiI0 exp − (1) be larger than the nominal value for dies in the Fast-NMOS, niφT Slow-PMOS (FS) corner, due to the large NMOS pull down  leakage currents. The most problematic variation is caused where Si is the transistor’s W /L ratio; I0 is a process specific by random inter-device variation sources, such as Random current measured at VGS = VT for a transistor with Si =1; Doping Fluctuation (RDF), which increases with technology φT = kT/q is the thermal voltage; and ni is the sub-VT factor. Using this equation, an initial value of DRV can be estimated scaling. The randomness of the threshold voltage (VT) due to RDF can be modeled as a normal distribution with a standard as: 1 1 deviation that is inversely proportional to the channel area [1]. DRV = φ n − + n − initial T 2 3 · The combination of the inter- and intra-die variation on the 1 1 A4 A5 A1 log n3− + n4− A A n + 1 1 1 . · 2 3 2 (n1− +n2− )− densely packed SRAM bit cells lead to a skewed normal DRV (2) distribution with a heavy tail on the right [3], [4]. Figure 3     Assuming that during standby, the voltage at node Q (VQ) shows the DRV distribution of 5 K Monte Carlo samples for is 0 and the voltage at node QB (VQB) is VDD: a 6T SRAM bit cell in a commercial 40 nm process. Accurate DRV estimation, especially within the tail of the distribution, A1+A5 DRV1 VQ = φT A exp −n φ is essential for optimizing the trade-off between SRAM yield · 2 2 T . (3) A4 DRV1 VQB = DRVintial φT exp − and standby power savings. − · A3 n3φT  

2 Using (3) a final DRV expression can be obtained: original distribution is also included in order to avoid under- representation of the body samples, as shown in the following VQ (DRVinitial VQB) n2 DRV = DRV + + − (4) equation:. initial 2 2  g(X)=λ N(X)+λ U(X)+(1 λ λ )N(X µ ). (6) The suggested formula utilizes values that can be extracted 1 2 − 1 − 2 − s from transistor characterization, through either measurement Simulation of an Importance Sampling or Mixed Importance or simulation. This model captures the link between DRV and Sampling set is achieved by first generating a samples vector process variations, chip temperature, and sizing. The impact (in Matlab) that represents the distance of each of the six of these factors is exemplified in another formula: transistors threshold voltage from the nominal value. Then to ∆Si find the DRV, a transient simulation is performed, writing and ∆DRV = DRV0 + ai + bi∆VT + c∆T, (5) S reading both a 1 and a 0 from a bit cell characterized by i i i   each sample. where DRV0 is the nominal DRV at room temperature, and The simulation outputs should then then appropriately the sizing, variations, and temperature coefficients (ai, bi, and weighted to compensate for the use of a biased sampling ci, respectively) are extracted from simulation. Measurements distribution. While this method allows us to effectively sample and simulations in a bulk 130 nm technology have shown the rare cases, it also changes their statistics. In addition, the that process variations (bi) play the most critical role in the accuracy of the failure probability depends upon the number determination of DRV [3]. of simulated samples. Another limitation is that this method This analytical model was verified with measurement re- estimates the failure probability of a single voltage and needs sults, and showed a better than 90% reduction in leakage to be re-run for every new voltage value. power. However, two major drawbacks of this model are the complexity of transistor modeling at deeply nano-scaled C. Statistical Modeling [1] technologies, especially in lieu of extreme process variations, The DRV distribution model is another approach that tries and the fact that it can only model the DRV of a single cell to address the problem of estimating extreme probabilities. and not an entire SRAM array. This method relies on small MC simulations and stems from B. Monte Carlo and Mixture Importance Sampling [4], [5] the connection between DRV and SNM. Due the fact that this model is a consequence of the physical behavior of the The contemporary approach for finding the DRV is to transistors, it remains valid in the tail of the distribution, and run full Monte Carlo (MC) simulations until the desired enables estimation of the DRV even for large SRAM arrays. probability is reached. The number of MC simulations is In this approach, SNMH and SNML are approximated by inversely proportional to the probability of the event [5]. normal distribution. Their mean (µ) and variance (σ) can be Therefore, a very large number of samples is required to extracted from a small (1.5 k–5 k samples) MC simulation. find the right tail of the DRV distribution, which is made The DRV is determined when SNM reaches zero; therefore it up of extremely rare cases. This number is often so large is important to understand the sensitivity of SNM to VDD. It that it is impractical to achieve a high enough precision for was observed by [1] that SNMH and SNML stays normally the evaluation of the tail distribution through ordinary MC. distributed for various values of VDD. Since the shape of the Inaccuracies in tail evaluation (such as matching the samples distribution is mainly determined by the intrinsic parametric to a normal distribution) can cause failure of the memory array VT variation, σ is almost constant with the lowering of the resulting in costly yield reduction. On the other hand, if the supply voltage, and µ is a linear function of VDD: DRV is overestimated, the degree of power savings will be ∂σ ∂µ lost. 0 k. (7) To extract the tail distribution within a reasonable simulation ∂VDD ≈ ∂VDD ≈ time, an Importance Sampling technique can be used [6], The linear factor k can be extracted from a single DC sweep [7]. This technique increases the accuracy of the estimation since it is almost identical to the slope of the nominal SNMH obtained with a given number of iterations. The idea behind as function of VDD. Given a minimum acceptable noise margin the method is to increase the number of samples in areas of (s) at a given voltage, the hold failure probability of a cell can interest; in our case, in the tail of the distribution. This is be computed using: achieved by determining the sample points using a shifted dis- 1 2 tribution function, rather than the original distribution. Kanj, Pf (v, s)=P (SNMv

3 that satisfies the required cell hold failure probability can be Less Failure More computed with (9): VCTRL Reliable Threshold Reliable

1 1 1 √ √ #1 #2 #3 #4 FDRV − = k 2σ0erfc− 2 2 P µ0 + s + v0 − − SRAM Multiple sets of with P =1 Pf    Q QB cells Canary − (9) Cells P r o b a i l t y The main advantage of this method is it requires only a few thousands MC simulations to accurately predict extreme DRV DRV tail values. As a result it provides a significant speed up over standard MC methods. (a) (b) Another approach that relies on a relatively small number Fig. 4. (a) Canary bit cell design. (b) Principle of Canary based closed-loop of MC simulations is the recursive Statistical Blockade (SB) VDD scaling approach. method [1]. However, unlike the previous method, the advan- tage of this method is that it does not require any a-priori limitations. The general idea behind SB [8] is to block the measurement. The Built-in Self Repair (BISR) mechanism, samples that are not expected to fall within the tail of the prevalent in many of today’s systems, can be slightly modified distribution. In other words, a large number of samples is to accommodate this spare row elimination. In addition to the generated, but only a few are simulated. The basis for the obvious trade-off of this method, which is the silicon area of filter that selects the samples is an MC simulation with a small the added hardware, another drawback of this method is the number of points. After the tail samples have been simulated, increased testing time to map the chip’s DRV. the results are fitted to a statistical model. In order to adjust this E. Post Fabrication - Canary Replica Cells [9]–[11] technique to DRV estimation, the authors of [1] expanded this method into a recursive formulation. As a result, accurate and An alternative approach is not to determine the DRV during reliable DRV evaluations can be achieved, taking into account the startup stage, but rather to track it throughout chip opera- extremely rare cases (6σ-8σ). tion. Since, even a temporary reduction of the supply voltage below DRV will result in data loss the authors of [9] developed D. Post fabrication - Built In Self Test a technique that relies on Canary replica cells to track the DRV. For the majority of systems that include a low-power These replica cells are affected by environmental changes and standby mode for leakage reduction in SRAMs, the standby global variations in the same way as the core cells, but are voltage is set at a level substantially higher than the DRV configured to fail at a higher voltage than the SRAM cells. This simulated at design time. Even though this method provides a is done by inserting a PMOS header in between the supply very high yield, the standby voltage is set by the worst case voltage and the canary cell voltage, as shown in Fig. 4(a). die, and a major opportunity for substantial power reduction Multiple canary sets are designed, each with a different gate is lost. An alternative approach is to measure the DRV post- voltage biasing for the extended PMOS header. This feature fabrication and set the standby voltage independently for each produces a unique failure threshold for every set of canary fabricated die. This approach provides the means to cope with cells which can be used to select more power savings or both global, as well as local variation, which has become higher reliability, depending on the application. This concept is extremely critical in today’s advanced technology nodes. An illustrated through the DRV distributions of the sets of Canary efficient way to implement post-silicon DRV measurement is cells as compared to SRAM cells in Fig. 4(b). through the employment of the on-chip Built in Self Test One limitation of the canary technique is its inability to (BIST) circuits that are generally included for SRAM testing. track local variations. Therefore, a DRV detecting BIST unit A simple DRV test algorithm can be implemented in the is employed by the system to determine the nominal DRV BIST that writes data into the memory blocks at the nominal of the core cells. The value computed by the BIST is set as VDD and then decreases the voltage of the memories for a the minimum threshold of the system. For each new supply given duration, before raising the voltage back to its nominal voltage, the canary cells are checked, and when a failure value and reading out the data for comparison. By adding a occurs, the current voltage is compared to the failure threshold. DRV measurement feature to these components, the die can If the threshold failure is exceeded, the system raises VDD one be operated close to its DRV; however, a 50–100 mV guard degree up, and continues monitoring the canaries at this level. band is usually taken to cope with voltage and temperature fluctuations. IV. D ISCUSSION Due to the long right tail of the DRV distribution, by The previous section overviewed several methods and ap- eliminating only a few worst-case bit cells, the DRV can be proaches for determining the minimum standby voltage of an substantially reduced, providing further power savings. We SRAM array. A general categorization of the methods can propose adding spare rows or cells that can be accessed in divide them into those that provide DRV estimation during place of the original worst-case cells, found during the DRV the design phase, and those that aim to determine it only after

4 TABLE I COMPARISON OF METHODS ACKNOWLEDGMENT The authors would like to thank the Alpha Consortium of Analytical Mixture IS Statistical BIST Canary the office of the Chief Scientist of Israel for their support of Accuracy low medium medium high high this work. Design medium H/W H/W low medium EFERENCES Effort to high design design R Coping very process [1] J. Wang, A. Singhee, R. Rutenbar, and B. Calhoun, “Two fast methods low low high for estimating the minimum standby supply voltage for large srams,” with PVT low only IEEE Transactions on Computer-Aided Design of Integrated Circuits low to and Systems, vol. 29, no. 12, pp. 1908–1920, 2010. Area none none none medium medium [2] E. Seevinck, F. List, and J. Lohstroh, “Static-noise margin analysis of MOS SRAM cells,” IEEE JSSC, vol. 22, no. 5, pp. 748–754, 1987. [3] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, “Sram leakage suppression by minimizing standby supply voltage,” in Proc. of chip fabrication. One of the significant advantages of DRV ISQED, 2004, pp. 55–60. [4] A. Nourivand, A. Al-Khalili, and Y. Savaria, “Postsilicon tuning of estimation during the design phase is the possibility to change standby supply voltage in SRAMs to reduce yield losses due to para- the memory array if the calculated DRV is higher than required metric data-retention failures,” IEEE TVLSI, no. 99, pp. 1–1, 2011. and the leakage target is missed. This could lead to redesign, [5] R. Kanj, R. Joshi, and S. Nassif, “Mixture importance sampling and its application to the analysis of sram designs in the presence of rare failure including improvement of the cell’s immunity to process varia- events,” in Proc. of ACM/IEEE DAC, 2006, pp. 69–72. tions, changes in memory architecture, or choosing a different [6] J. Jess, K. Kalafala, S. Naidu, R. Otten, and C. Visweswariah, “Statistical bit cell topology. A more efficient approach is to include addi- timing for parametric yield prediction of digital integrated circuits,” IEEE TCAD, vol. 25, no. 11, pp. 2376–2392, 2006. tional hardware for post-fabrication DRV measurement. These [7] T. Hesterberg, “Advances in importance sampling,” Ph.D. dissertation, methods enable operation of the majority of the fabricated Stanford University, 1988. chips at a much lower supply voltage than the worst-case DRV [8] A. Singhee and R. Rutenbar, “Statistical blockade: a novel method for very fast monte carlo simulation of rare circuit events, and its by independently adjusting the standby voltage of each die application,” in Proc. DATE, 2007, pp. 1379–1384. according to global and local variation. By adding additional [9] J. Wang and B. Calhoun, “Canary replica feedback for near-drv standby hardware, these methods achieve a substantial reduction in VDD scaling in a 90nm SRAM,” in Proc. of IEEE CICC, 2007, pp. 29–32. standby power, and can often be implemented through small [10] J. Wang et al., “Techniques to extend canary-based standby VDD scaling modifications to pre-existing components. Table I provides a for SRAMs to 45 nm and beyond,” IEEE JSCC, vol. 43, no. 11, pp. general comparison of the methods that were overviewed in 2514–2523, 2008. [11] J. Wang, A. Hoefler, and B. Calhoun, “An enhanced canary-based system Section III according to the accuracy of the DRV estimation; with bist for sram standby power reduction,” IEEE TVLSI, no. 99, pp. the design effort in implementing the method; the ability of 1–5, 2011. the method to cope with Process/Voltage/Temperature (PVT) variations; and the area cost of implementation.

V. C ONCLUSION One of the best opportunities for leakage power reduction is to reduce the standby supply voltage to a minimum level; however, this requires accurate DRV detection to ensure robust data storage. This paper presented several recently introduced methods for DRV estimation. Design-time calculation can be achieved through analytical calculations; however, a more accurate estimation is obtained through statistical simulation. Since the DRV distribution is characterized by a long right tail, extremely long run times are required to account for extreme cases. Several advanced statistical methods have been proposed to significantly reduce the number of simulations, while still providing an accurate tail distribution. A more accurate and efficient DRV estimation can be achieved post- silicon by the introduction of hardware peripherals to measure and/or monitor the on-chip DRV. These methods help cope with process and environmental variations to operate each chip at its lowest possible standby voltage. In systems with large amounts of on-chip memories and long standby periods, the addition of these components can provide a very rewarding reduction in static power.

5

תקציר

עם המשך ההתקדמות של חוק מור, הספק נמוך החליף את שיפור הביצועים כמיקוד המרכזי של תכנון מעגלים

משולבים )VLSI(. המזעור המתמשך של הטכנולוגיה גרם להידרדרות ביחס זרמי הפעולה לזרמי הנתק )Ion/Ioff( של ההתקנים הסטנדרטיים, וכתוצאה מכך, גדלו זרמי הזליגה כך שההפסק הסטטי הינו הגורם העיקרי בצריכת ההספק של הרבה מערכות מודרניות. בשל שטחם הגדול ומספר ההתקנים הרב שמרכיבים אותם, תתי-מערכות מבוססות על מערכים )כגון זיכרונות ומערכי פיקסלים( הינם תורמים עיקריים לצריכת ההספק בהרבה מהמערכות של היום. אחוז גבוה של הרכיבים האלו נמצא במצב סטטי של "אחזקת מידע" או "אינטגרציה" לאורך מרבית מחזורי השעון, כך שהתרומה העיקרית שלהם לצריכת ההספק הכוללת של המערכת הינה בעקבות הצריכה הסטטית. בהתאם לכך, קבוצות מחקר רבות מסביב לעולם התמקדו בפיתוח טכניקות ומתודולוגיות לחסכון הספק בכל סוגי המערכים ובייחוד להורדת ההספק הסטטי. בעבודה הזו, בפעם הראשונה, בוצע איחוד המחקר של חסכון בהספק על גבי סוגים שונים של מערכים, ובפרט על מגוון של מערכי זיכרון וגלאי הדמאה. מחקר הזה כלל לימוד והשוואת המאפיינים, הטופולוגיות, והארכיטקטורות של כל סוג מערך ובהתאם לכך, פיתוח טכניקות חדשניות לעבודה בהספק נמוך. טכניקות אלו לוקחות בחשבון את האתגרים הייחודיים של כל תחום מערכים, ומנצלות הזדמנויות להתאים טכניקות מוכרות עבור סוג מערך אחד להפעלה עם סוג מערך אחר. כל אחד מדיסציפלינות התכנון העמידה אתגרים חדשים אשר דרשו הבנה מעמיקה של המכניזמים שלהן על מנת לתכנן פתרונות עמידים. בתחום תכנון הזיכרונות הסטטיים )SRAM(, נושא היציבות נחקר באופן מעמיק על מנת להבטיח אמינות מידע כאשר משלבים שיטות לחיסכון בהספק, כדוגמת הורדת מתח ההפעלה. מחקר זה הניב רב של פירסומים, אשר שלושה מהם מובאים בגוף הטקסט הזה, אשר מציגים שלושה תאי SRAM חדשניים אשר מאפשרים עבודה בהספק נמוך ביותר ומשתמשים בשיטות בחינת היציבות הדינאמית שפותחו על מנת להוכיח את יעילות המעגלים שפותחו. בתחום תכנון תאי הגבר הדינאמיים )Gain Cell embedded DRAM(, אחזקת מידע נבחנה להבטיח תדרי ריענון מתאימים ולחשב את צריכת ההספק הכוללת של המערכת. עבודה זו בוצעה בשיתוף פעולה עם קבוצת מחקר מובילה באוניברסיטת EPFL בשווייץ, ועד כה הניבה ייצור שבב בדיקה ראשון, פירסום מאמר ז'ורנל המצורף בגוף עבודה זו, ומספר פרסומים נוספים בכנסים ובז'ורנלים )חלקם עדיין בתהליכי קבלה(. בחלק השלישי של תיזה זו, מובא פרסום בז'ורנל המוביל בתחום תכנון השבבים )JSSC( אשר מתאר חלק עיקרי מעבודת מחקר שערכה שנתיים, במסגרתה פותחה מערכת RFID במחיר ובהספק נמוכים. מערך הזיכרון הבלתי נדיף )NVM( הכלול בתוך מערכת זו, דרשה פיתוח שיטות הפעלה מיוחדות להשגת פעולה דלת הספק ואמינה והניחבה פיתוח מתודולוגיית תיכנון חדשנית למעגלים מסוג זה. התחום האחרון, בו מתמקדת תיזה זו, הינה תחום תיכנון גלאי ההדמאה ה"חכמים" בהספק נמוך. במסגרת זו, פותח גלאי המשלב זיכרון מוטמע ומעגל AB2C חדשני להשגת חיסכון בהספק, אשר פורסם במאמר הכלול בגוף עבודה זו. גלאי זה כלל מימוש עיקרון שיתוף המרכיבים אשר הוצג לראשונה בפרסום נוסף, וכן פותחו עוד מספר גלאים חכמים אשר מוצגים בנספחים ובציטוטים.

כל המעגלים, מערכות, אלגוריתמים ושיטות נבחנו באופן תיאורטי וכן סומלצו בכלי CAD מתקדמים כגון סימולטור Spectre של חברת קיידנס. חלק מה תיכנונים האלה יוצרו בסיליקון ונבדקו במעבדה לאחר ייצור במסגרת ארבעה שבבי בדיקה עיקריים בטכנולוגיות 04 נ"מ, 04 נ"מ ו004- נ"מ. עבודות אלו בוצעו בשיתוף פעולה פורה עם חברות צורן, טאוור, ואלטייר ועם קבוצות מחקר באוניברסיטת תל אביב וב-EPFL )שווייץ(. סך כולל של 04 מאמרים פורסמו עד כה בז'ורנלים בינלאומיים ו01- מאמרים נוספים פורסמו במסגרת כנסים בינלאומיים של ארגון ה-IEEE. הפרסומים העיקריים, מרכיבים את גוף העבודה הזו או מצורפים בנספחים לעבודה.

הצהרת תלמיד המחקר עם הגשת עבודת הדוקטור לשיפוט

אני החתום מטה מצהיר/ה בזאת:

X חיברתי את חיבורי בעצמי, להוציא עזרת ההדרכה שקיבלתי מאת מנחה/ים.

X החומר המדעי הנכלל בעבודה זו הינו פרי מחקרי מתקופת היותי תלמיד/ת מחקר.

___ בעבודה נכלל חומר מחקרי שהוא פרי שיתוף עם אחרים, למעט עזרה טכנית הנהוגה בעבודה ניסיונית. לפי כך מצורפת בזאת הצהרה על תרומתי ותרומת שותפי למחקר, שאושרה על ידם ומוגשת בהסכמתם.

תאריך 12/1/2014 שם התלמיד/ה אדם תימן חתימה ______

העבודה נעשתה בהדרכת

פרופ' אלכסנדר פיש .

במחלקה להנדסת חשמל ומחשבים

בפקולטה להנדסה

מערכי מעגלים משולבים בהספק נמוך

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

אדם תימן

הוגש לסינאט אוניברסיטת בן גוריון בנגב

אישור המנחה: פרופ' פיש:______

אישור דיקן בית הספר ללימודי מחקר מתקדמים ע"ש קרייטמן ______

כ"ה באב התשע"ג 13 בינואר, 3114

באר שבע

מערכי מעגלים משולבים בהספק נמוך

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

אדם תימן

הוגש לסינאט אוניברסיטת בן גוריון בנגב

כ"ה באב התשע"ג 1 באוגוסט, 3112

באר שבע