Design of Coarse-Grained Power Gating for a Fine-Grained Many-Core Processor Array

Total Page:16

File Type:pdf, Size:1020Kb

Design of Coarse-Grained Power Gating for a Fine-Grained Many-Core Processor Array DESIGN OF COARSE-GRAINED POWER GATING FOR A FINE-GRAINED MANY-CORE PROCESSOR ARRAY by Shylesh Umapathy A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Engineering, Computer Engineering Option in the Lyles College of Engineering California State University, Fresno May 2018 ii APPROVED For the Department of Electrical and Computer Engineering We, the undersigned, certify that the thesis of the following student meets the required standards of scholarship, format, and style of the university and the student's graduate degree program for the awarding of the master's degree. Shylesh Umapathy Aaron Stillmaker (Chair) Electrical and Computer Engineering Reza Raeisi Electrical and Computer Engineering Nan Wang Electrical and Computer Engineering For the University Graduate Committee: Dean, Division of Graduate Studies iii AUTHORIZATION FOR REPRODUCTION OF MASTER’S THESIS Yes, I grant permission for the reproduction of this thesis in part or in its entirety without further authorization from me, on the condition that the person or agency requesting reproduction absorbs the cost and provides proper acknowledgment of authorship. Permission to reproduce this thesis in part or in its entirety must be obtained from me. Signature of thesis author: iv ABSTRACT DESIGN OF COARSE-GRAINED POWER GATING FOR A FINE-GRAINED MANY-CORE PROCESSOR ARRAY With the 53rd commemoration of Moore’s law and transistor sizing heading towards 4 nm, the number of transistors on an integrated circuit continue to double every year. However, there are many factors limiting this growth rate such as power consumption, which is a serious impediment for design of high-speed, low-power integrated circuits. In modern semiconductor manufacturing, leakage power in high- performance processors accounts for 20-30% of the total power. Power gating is one approach to reduce the power consumption of an integrated circuit by effectively disconnecting the power supply from blocks during idle mode and is commonly used in the design of commercial high-end processors and in SoC for portable applications. In this thesis, coarse-grained power gating techniques are explored to reduce the power consumption of a fine-grained many-core processor array. This work provides a detailed analysis and comparison of the design tradeoffs between using the ring and grid methods of power gates placement. The RTL to GDSII flow was performed using Synopsys EDA tools and NanGate FreePDK45 standard open cell library. The results depict that, on a test 128-bit MAC unit, placing power gates using the grid methodology created an 8.52% less IR drop but an 8-9% congestion increase when compared to the ring methodology. The leakage power of the 128-bit MAC unit was reduced by 99.98% with both the grid and ring placement of power gates when compared to a design without power gates. Shylesh Umapathy May 2018 ACKNOWLEDGMENTS Foremost, I would like to express my sincere gratitude to my advisor Dr. Aaron Stillmaker for the continuous support during my thesis, for his patience, immense knowledge, and motivation. He has always been there for my help and without his guidance, this thesis and report wouldn’t have been possible. I was fortunate enough to work under his guidance and I would cherish this experience for a long time. I am very thankful to my committee members Dr. Reza Raeisi, who has always inspired me with his leadership qualities and has always been there to help a student in every possible way. I wouldn’t have had a better understanding of Computer Architecture, if I hadn’t worked with Dr. Nan Wang, his seminars have helped me to sharpen my presentation skills. I am also very thankful to all the faculty members of Electrical and Computer Engineering for all the help. I would like to thank my friends, especially Meet Shah, who has always helped me, whenever I ran into a trouble while installing and configuring the Synopsys tools. I would also like to thank Brian Cardwell, who had helped me to generate all the great plots in this report. Finally, I must express my very profound gratitude to my family for providing me with unfailing support and continuous encouragement throughout my years of study. TABLE OF CONTENTS Page LIST OF TABLES ......................................................................................... viii LIST OF FIGURES .......................................................................................... ix CHAPTER 1: INTRODUCTION .................................................................... 11 CHAPTER 2: BACKGROUND ...................................................................... 14 Low Power Architectures ......................................................................... 16 Power Gating ............................................................................................ 21 Related Work ............................................................................................ 26 CHAPTER 3: TOOLS AND LIBRARIES ...................................................... 33 NanGate FreePDK45 ................................................................................ 34 CHAPTER 4: METHODOLOGY ................................................................... 36 Library preparation for NanGate FreePDK45 .......................................... 36 Designing 128-bit MAC unit .................................................................... 37 Synthesizing gate-level netlist .................................................................. 38 Place and Route Physical Design ............................................................. 40 Header design in Custom Compiler .......................................................... 50 HSPICE simulation of Power gates .......................................................... 52 CHAPTER 5: RESULTS AND ANALYSIS .................................................. 53 ModelSim Simulation ............................................................................... 53 Place and Route ........................................................................................ 53 CHAPTER 6: CONCLUSION AND FUTURE WORK ................................. 66 Conclusion ................................................................................................ 66 Future Work .............................................................................................. 67 APPENDICES ....................................................................................................... 73 vii vii Page APPENDIX A: Library Compiler and Milkyway ................................................. 74 APPENDIX B: TLUplus File ................................................................................ 76 APPENDIX C: Verilog Code ................................................................................ 79 APPENDIX D: Design Compiler Script ................................................................ 81 APPENDIX E: IC Compiler Script ........................................................................ 84 APPENDIX F: HSPICE Deck ............................................................................... 88 APPENDIX G: Detailed Timing Report ................................................................ 90 APPENDIX H: Detailed Congestion Report ....................................................... 100 LIST OF TABLES Page Table 1. Impact of scaling on ARM processor MIPS/mW [4] ........................ 11 Table 2. Fine-grained power gating techniques [18] ....................................... 27 Table 3. Comparison between NEMS and CMOS header switch [24] ............ 32 Table 4. Leakage Power of Power gates .......................................................... 52 Table 5. Impact of power gates placed in grid manner .................................... 58 Table 6. Impact of power gates placed in ring manner .................................... 60 Table 7. Impact of power gates on voltage and current ................................... 62 LIST OF FIGURES Page Figure 1. Header and Footer power gates [17] ................................................ 22 Figure 2. Grid of switches [19] ........................................................................ 23 Figure 3. Ring of switches [19] ........................................................................ 24 Figure 4. Leakage Power reduction [20] .......................................................... 28 Figure 5. Power reduction based on gating [20] .............................................. 28 Figure 6. Power reduction per core in AMD x86-64 [12] ............................... 29 Figure 7. Noise due to power gating [21] ........................................................ 29 Figure 8. Incremental scheduling [21] ............................................................. 30 Figure 9. Energy saved based on temperature [22] .......................................... 30 Figure 10. Impact of temperature on Leakage Power [23] .............................. 31 Figure 11. 128-bit MAC unit ........................................................................... 38 Figure 12. Block diagram of the 128-bit MAC unit in Design Compiler ........ 39 Figure 13. Detailed circuit diagram in Design Compiler ................................. 39 Figure 14. Voltage domain .............................................................................. 41 Figure 15. Voltage rails .................................................................................... 41 Figure 16. Power gates
Recommended publications
  • UCLA Electronic Theses and Dissertations
    UCLA UCLA Electronic Theses and Dissertations Title Implications of Modern Semiconductor Technologies on Gate Sizing Permalink https://escholarship.org/uc/item/56s9b2tm Author Lee, John Publication Date 2012 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California University of California Los Angeles Implications of Modern Semiconductor Technologies on Gate Sizing A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering by John Hyung Lee 2012 c Copyright by John Hyung Lee 2012 Abstract of the Dissertation Implications of Modern Semiconductor Technologies on Gate Sizing by John Hyung Lee Doctor of Philosophy in Electrical Engineering University of California, Los Angeles, 2012 Professor Puneet Gupta, Chair Gate sizing is one of the most flexible and powerful methods available for the timing and power optimization of digital circuits. As such, it has been a very well-studied topic over the past few decades. However, developments in modern semiconductor technologies have changed the context in which gate sizing is performed. The focus has shifted from custom design methods to standard cell based designs, which has been an enabler in the design of modern, large-scale designs. We start by providing benchmarking efforts to show where the state-of-the-art is in standard cell based gate sizing. Next, we develop a framework to assess the impact of the limited precision and range available in the standard cell library on the power-delay tradeoffs. In addition, shrinking dimensions and decreased manufacturing process control has led to variations in the performance and power of the resulting designs.
    [Show full text]
  • Instruction-Level Distributed Processing
    COVER FEATURE Instruction-Level Distributed Processing Shifts in hardware and software technology will soon force designers to look at microarchitectures that process instruction streams in a highly distributed fashion. James E. or nearly 20 years, microarchitecture research In short, the current focus on instruction-level Smith has emphasized instruction-level parallelism, parallelism will shift to instruction-level distributed University of which improves performance by increasing the processing (ILDP), emphasizing interinstruction com- Wisconsin- number of instructions per cycle. In striving munication with dynamic optimization and a tight Madison F for such parallelism, researchers have taken interaction between hardware and low-level software. microarchitectures from pipelining to superscalar pro- cessing, pushing toward increasingly parallel proces- TECHNOLOGY SHIFTS sors. They have concentrated on wider instruction fetch, During the next two or three technology generations, higher instruction issue rates, larger instruction win- processor architects will face several major challenges. dows, and increasing use of prediction and speculation. On-chip wire delays are becoming critical, and power In short, researchers have exploited advances in chip considerations will temper the availability of billions of technology to develop complex, hardware-intensive transistors. Many important applications will be object- processors. oriented and multithreaded and will consist of numer- Benefiting from ever-increasing transistor budgets ous separately
    [Show full text]
  • Placement and Routing in Computer Aided Design of Standard Cell Arrays by Exploiting the Structure of the Interconnection Graph
    325 Computer-Aided Design and Applications © 2008 CAD Solutions, LLC http://www.cadanda.com Placement and Routing in Computer Aided Design of Standard Cell Arrays by Exploiting the Structure of the Interconnection Graph Ioannis Fudos1, Xrysovalantis Kavousianos 1, Dimitrios Markouzis 1 and Yiorgos Tsiatouhas1 1University of Ioannina, {fudos,kabousia,dimmark,tsiatouhas}@cs.uoi.gr ABSTRACT Standard cell placement and routing is an important open problem in current CAD VLSI research. We present a novel approach to placement and routing in standard cell arrays inspired by geometric constraint usage in traditional CAD systems. Placement is performed by an algorithm that places the standard cells in a spiral topology around the center of the cell array driven by a DFS on the interconnection graph. We provide an improvement of this technique by first detecting dense graphs in the interconnection graph and then placing cells that belong to denser graphs closer to the center of the spiral. By doing so we reduce the wirelength required for routing. Routing is performed by a variation of the maze algorithm enhanced by a set of heuristics that have been tuned to maximize performance. Finally we present a visualization tool and an experimental performance evaluation of our approach. Keywords: CAD VLSI design, geometric constraints, graph algorithms. DOI: 10.3722/cadaps.2008.325-337 1. INTRODUCTION VLSI circuits become more and more complex increasingly fast. This denotes that the physical layout problem is becoming more and more cumbersome. From the early years of VLSI design, engineers have sought techniques for automated design targeted to a) decreasing the time to market and b) increasing the robustness of the final product.
    [Show full text]
  • Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice
    Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice Jairam S, Madhusudan Rao, Jithendra Srinivas, Parimala Vishwanath, Udayakumar H, Jagdish Rao SoC Center of Excellence, Texas Instruments, India (sjairam, bgm-rao, jithendra, pari, uday, j-rao) @ti.com 1 AGENDA • Introduction • Combinational Clock Gating – State of the art – Open problems • Sequential Clock Gating – State of the art – Open problems • Clock Power Analysis and Estimation • Clock Gating In Design Flows JS/BGM – ISLPED08 2 AGENDA • Introduction • Combinational Clock Gating – State of the art – Open problems • Sequential Clock Gating – State of the art – Open problems • Clock Power Analysis and Estimation • Clock Gating In Design Flows JS/BGM – ISLPED08 3 Clock Gating Overview JS/BGM – ISLPED08 4 Clock Gating Overview • System level gating: Turn off entire block disabling all functionality. • Conditions for disabling identified by the designer JS/BGM – ISLPED08 4 Clock Gating Overview • System level gating: Turn off entire block disabling all functionality. • Conditions for disabling identified by the designer • Suspend clocks selectively • No change to functionality • Specific to circuit structure • Possible to automate gating at RTL or gate-level JS/BGM – ISLPED08 4 Clock Network Power JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of – Clock Tree Buffer Power JS/BGM – ISLPED08 5 Clock Network Power • Clock network power consists of – Clock Tree Buffer
    [Show full text]
  • Saber Eletrônica, Designers Pois Precisamos Comprovar Ao Meio Anunciante Estes Números E, Assim, Carlos C
    editorial Editora Saber Ltda. Digital Freemium Edition Diretor Hélio Fittipaldi Nesta edição comemoramos o fantástico número de 258.395 downloads da edição 460 digital em PDF que tivemos nos primeiros 50 dias de circu- www.sabereletronica.com.br lação. Assim, esperamos atingir meio milhão em twitter.com/editora_saber seis meses. Na fase de teste, no ano passado, com Editor e Diretor Responsável a Edição Digital Gratuita que chamamos de “Digital Hélio Fittipaldi Conselho Editorial Freemium (Free + Premium) Edition”, já atingimos João Antonio Zuffo este marco e até ultrapassamos. Redação Fica aqui nosso agradecimento a todos os que Hélio Fittipaldi Augusto Heiss seguiram nosso apelo, para somente fazerem Revisão Técnica Eutíquio Lopez download das nossas edições através do link do Portal Saber Eletrônica, Designers pois precisamos comprovar ao meio anunciante estes números e, assim, Carlos C. Tartaglioni, Diego M. Gomes obtermos patrocínio para manter a edição digital gratuita. Publicidade Aproveitamos também para avisar aos nossos leitores de Portugal, Caroline Ferreira, cerca de 6.000 pessoas, que infelizmente os custos para enviarmos as Nikole Barros revistas impressas em papel têm sido altos e, por solicitação do nosso Colaboradores Alexandre Capelli, distribuidor, não enviaremos mais os exemplares impressos em papel Bruno Venâncio, para distribuição no mercado português e ex-colônias na África. César Cassiolato, Dante J. S. Conti, Em junho teremos a edição especial deste semestre e o assunto prin- Edriano C. de Araújo, cipal é a eletrônica embutida (embedded electronic), ou como dizem os Eutíquio Lopez, Tsunehiro Yamabe espanhóis e portugueses: electrónica embebida. Como marco teremos também no Centro de Exposições Transamérica em São Paulo, a 2ª edição da ESC Brazil 2012 e a 1ª MD&M, o maior evento de tecnologia para o mercado de design eletrônico que, neste ano, estará sendo promovido PARA ANUNCIAR: (11) 2095-5339 pela UBM junto com o primeiro evento para o setor médico/odontoló- [email protected] gico (a MD&M Brazil).
    [Show full text]
  • Predictive Aging of Reliability of Two Delay Pufs
    Predictive Aging of Reliability of two Delay PUFs Naghmeh Karimi1, Jean-Luc Danger2;3, Florent Lozac'h3, and Sylvain Guilley2;3 1 ECE Department, Rutgers University, Piscataway, NJ, USA 08854 [email protected] 2 LTCI, CNRS, T´el´ecom ParisTech, Universit´eParis-Saclay, 75013 Paris, France [email protected] 3 Secure-IC SAS, 35510 Cesson-S´evign´e,France [email protected] Abstract. To protect integrated circuits against IP piracy, Physically Unclonable Functions (PUFs) are deployed. PUFs provide a specific signature for each integrated circuit. However, environmental variations, (e.g., temperature change), power supply noise and more influential IC aging affect the functionally of PUFs. Thereby, it is important to evaluate aging effects as early as possible, preferentially at design time. In this paper we investigate the effect of aging on the stability of two delay PUFs: arbiter-PUFs and loop-PUFs and analyze the architectural impact of these PUFS on reliability decrease due to aging. We observe that the reliability of the arbiter-PUF gets worse over time, whereas the reliability of the loop-PUF remains constant. We interpret this phenomenon by the asymmetric aging of the arbiter, because one half is active (hence aging fast) while the other is not (hence aging slow). Besides, we notice that the aging of the delay chain in the arbiter-PUF and in the loop-PUF has no impact on their reliability, since these PUFs operate differentially. 1 Introduction With the advancement of VLSI technology, people are increasingly relying on electronic devices and in turn integrated circuits (ICs).
    [Show full text]
  • Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗
    IEICE TRANS. INF. & SYST., VOL.E101–D, NO.4 APRIL 2018 1116 PAPER Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach∗ Carlos Cesar CORTES TORRES†a), Nonmember, Hayate OKUHARA†, Student Member, Nobuyuki YAMASAKI†, Member, and Hideharu AMANO†, Fellow SUMMARY In the past decade, real-time systems (RTSs), which must in RTSs. These techniques can improve energy efficiency; maintain time constraints to avoid catastrophic consequences, have been however, they often require a large amount of power since widely introduced into various embedded systems and Internet of Things they must control the supply voltages of the systems. (IoTs). The RTSs are required to be energy efficient as they are used in embedded devices in which battery life is important. In this study, we in- Body bias (BB) control is another solution that can im- vestigated the RTS energy efficiency by analyzing the ability of body bias prove RTS energy efficiency as it can manage the tradeoff (BB) in providing a satisfying tradeoff between performance and energy. between power leakage and performance without affecting We propose a practical and realistic model that includes the BB energy and the power supply [4], [5].Itseffect is further endorsed when timing overhead in addition to idle region analysis. This study was con- ducted using accurate parameters extracted from a real chip using silicon systems are enabled with silicon on thin box (SOTB) tech- on thin box (SOTB) technology. By using the BB control based on the nology [6], which is a novel and advanced fully depleted sili- proposed model, about 34% energy reduction was achieved.
    [Show full text]
  • "Routing-Directed Placement for the Triptych FPGA" (PDF)
    ACM/SIGDA Workshop on Field-Programmable Gate Arrays, Berkeley, February, 1992. Routing-directed Placement for the TRIPTYCH FPGA Elizabeth A. Walkup, Scott Hauck, Gaetano Borriello, Carl Ebeling Department of Computer Science and Engineering, FR-35 University of Washington Seattle, WA 98195 [Carter86], where the routing resources consume more Abstract than 90% of the chip area. Even so, the largest Xilinx Currently, FPGAs either divorce the issues of FPGA (3090) seldom achieves more than 50% logic placement and routing by providing logic and block utilization for random logic. A lack of interconnect as separate resources, or ignore the issue interconnect resources also leads to decreased of routing by targeting applications that use only performance as critical paths are forced into more nearest-neighbor communication. Triptych is a new circuitous routes. FPGA architecture that attempts to support efficient Domain-specific FPGAs like the Algotronix implementation of a wide range of circuits by blending CAL1024 (CAL) and the Concurrent Logic CFA6000 the logic and interconnect resources. This allows the (CFA) increase the chip area devoted to logic by physical structure of the logic array to more closely reducing routing to nearest-neighbor communication match the structure of logic functions, thereby [Algotronix91, Concurrent91]. The result is that these providing an efficient substrate (in terms of both area architectures are restricted to highly pipelined dataflow and speed) in which to implement these functions. applications, for which they are more efficient than Full utilization of this architecture requires an general-purpose FPGAs. Implementing circuits using integrated approach to the mapping process which these FPGAs requires close attention to routing during consists of covering, placement, and routing.
    [Show full text]
  • Research Needs in Computer-Aided Design: Logic and Physical Design and Analysis
    Research Needs in Computer-Aided Design: Logic and Physical Design and Analysis Logic synthesis and physical design, once considered separate topics, have grown together with the increasing need for tools which reach upward toward design at the system level and at the same time require links to detailed physical and electrical circuit characteristics. This list is a summary of important areas of research in computer-aided design seen by SRC members, ranging from behavioral-level design and synthesis to capacitance and inductance analysis. Note that research in CAD for analog/mixed signal is particularly encouraged, in addition to digital and hybrid system-on-chip design and analysis tools. Not included here are research needs in test and in verification, which are addressed in separate needs documents, and needs in circuit and system design itself, rather than tools research; these needs are assessed in the science area page for Integrated Circuits and Systems Sciences. Earlier needs lists with additional details are found in the "Top Ten" lists in synthesis and physical design developed by SRC task forces. System-Level Estimation and Partitioning System-level design tools require estimates to rapidly select a small number of potentially optimum candidates from a possibly enormous set of possible solutions. Models should consider power consumption, performance, die area (manufacturing cost), package size and cost, noise, test cost and other requirements and constraints. Partitioning of functional tasks between multiple hardware and software components, considering trade-offs between requirements and constraints, must be made using reasonably accurate models. Standard methods are also needed to estimate the performance of microprocessors and DSP cores to enable automated core selection.
    [Show full text]
  • PDF of the Configured Flow
    mflowgen Sep 20, 2021 Contents 1 Quick Start 3 2 Reference: Graph-Building API7 2.1 Class Graph...............................................7 2.1.1 ADK-related..........................................7 2.1.2 Adding Steps..........................................7 2.1.3 Connecting Steps Together...................................8 2.1.4 Parameter System........................................8 2.1.5 Advanced Graph-Building...................................8 2.2 Class Step................................................9 2.3 Class Edge................................................ 10 3 User Guide 11 3.1 User Guide................................................ 11 3.2 Connecting Steps Together........................................ 11 3.2.1 Automatic Connection by Name................................ 11 3.2.2 Explicit Connections...................................... 13 3.3 Instantiating a Step Multiple Times................................... 14 3.4 Sweeping Large Design Spaces..................................... 15 3.4.1 More Details.......................................... 17 3.5 ADK Paths................................................ 19 3.6 Assertions................................................ 20 3.6.1 The File Class and Tool Class................................ 21 3.6.2 Adding Assertions When Constructing Your Graph...................... 21 3.6.3 Escaping Special Characters.................................. 21 3.6.4 Multiline Assertions...................................... 21 3.6.5 Defining Python Helper Functions..............................
    [Show full text]
  • A Fine-Grained 3D IC Technology with NP-Dynamic Logic
    Received 13 June 2016; revised 24 February 2017; accepted 8 March 2017. Date of publication 20 March 2017; date of current version 7 June 2017. Digital Object Identifier 10.1109/TETC.2017.2684781 NP-Dynamic Skybridge: A Fine-Grained 3D IC Technology with NP-Dynamic Logic JIAJUN SHI, MINGYU LI, MOSTAFIZUR RAHMAN, SANTOSH KHASANVIS, AND CSABA ANDRAS MORITZ J. Shi, M. Li, and C.A. Moritz are with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003 M. Rahman is with the School of Computing and Engineering, University of Missouri, Kansas City, MO 65211 S. Khasanvis is with BlueRISC Inc., Amherst, MA 01002 CORRESPONDING AUTHOR: J. SHI ([email protected]) ABSTRACT A new 3D IC fabric named NP-Dynamic Skybridge is proposed that provides fine-grained vertical 3D integration for future technology scaling. Relying on a template of vertical nanowires, it expands our prior work to incorporate and utilize both n- and p-type transistors in a novel NP-Dynamic circuit-style compatible with true 3D integration. This enables a wide range of elementary logics leading to more compact circuits, simple clocking schemes for cascading logic stages and low buffer requirement. We detail new design concepts for larger-scale circuits, and evaluate our approach using a 4-bit nanoprocessor implemented in 16 nm technology node. A new pipelining scheme specifically designed for our 3D NP-Dynamic circuits is employed in the nanoprocessor. We compare our approach with 2D CMOS as well as state-of-the-art tran- sistor-level monolithic 3D IC (T-MI) approach. Benchmarking results for the 4-bit nanoprocessor show bene- fits of up to 56.7x density, 3.8x power and 1.7x throughput over 2D CMOS.
    [Show full text]
  • Register Allocation and VDD-Gating Algorithms for Out-Of-Order
    Register Allocation and VDD-Gating Algorithms for Out-of-Order Architectures Steven J. Battle and Mark Hempstead Drexel University Philadelphia, PA USA Email: [email protected], [email protected] Abstract—Register Files (RF) in modern out-of-order micro- 100 avg Int avg FP processors can account for up to 30% of total power consumed INT → → by the core. The complexity and size of the RF has increased due 80 FP to the transition from ROB-based to MIPSR10K-style physical register renaming. Because physical registers are dynamically 60 allocated, the RF is not fully occupied during every phase of the application. In this paper, we propose a comprehensive power 40 management strategy of the RF through algorithms for register allocation and register-bank power-gating that are informed by % of runtime 20 both microarchitecture details and circuit costs. We investigate algorithms to control where to place registers in the RF, when to 0 disable banks in the RF, and when to re-enable these banks. We 60 80 100 120 140 160 include detailed circuit models to estimate the cost for banking Num. Registers Occupied and power-gating the RF. We are able to save up to 50% of the leakage energy vs. a baseline monolithic RF, and save 11% more Fig. 1. Average Reg File occupancy CDF for SPEC2006 workloads. leakage energy than fine-grained VDD-gating schemes. 1 1 Index Terms—Computer architecture, Gate leakage, Registers, SRAM cells 0.8 0.8 I. INTRODUCTION 0.6 0.6 F.cactus I.astar 0.4 0.4 Out-of-order superscalar processors, historically found only F.gems I.libq in high-performance computing environments, are now used in F.milc I.go 0.2 F.pov 0.2 Imcf a diverse range of energy-constrained applications from smart- F.zeus Iomn phones to data-centers.
    [Show full text]