Power1.Ps (Mpage)

Total Page:16

File Type:pdf, Size:1020Kb

Power1.Ps (Mpage) Low Energy & Power Design Issues Low Power Design Problem • Processor trends Microprocessor Power • Circuit and Technology Issues (source ISSCC) 30 • Architectural optimizations • Low power µP research project 20 10 Power (Watt) 0 75 80 85 90 95 Year When supply voltage drops to 1Volt, then 100Watts = 100 Amps Slide 2 Portable devices Two Kinds of Computation Required • General purpose processing (what you have been Portable Functions studying so far) • Multimodal radio • Bursty - mostly idle with bursts of computation • Protocols, ECC, ... • Maximum possible throughput required during active • Voice I/O compression & periods decompression • Handwriting recognition • Signal processing (for multimedia, wireless Battery • Text/Graphics processing communications, etc.) (40+ lbs) • Video decompression • Stream based computation • Speech recognition • No advantage in increasing processing rate above • Java interpreter required for real-time requirements How to get 1 month of operation? Slide 3 Slide 4 Optimizing for Energy Consumption Switching Energy • Conventional General Purpose processors (e.g. Vdd Pentiums) • Performance is everything ... somehow we’ll get the Vin Vout power in and back out • 10-100 Watts, 100-1000 Mips = .01 Mips/mW CL • Energy Optimized but General Purpose • Keep the generality, but reduce the energy as much as 2 possible - e.g. StrongArm Energy/transition = CL * Vdd • .5 Watts, 160 Mips = .3 Mips/mW 2 Power = Energy/transition * f = CL * Vdd * f • Energy Optimized and Dedicated • 100 Mops/mW Slide 5 Slide 6 Low Power & Low Energy System Design Energy Reduction in CPU’s • Standard power management helps • Sleep modes System Design partitioning, Power Down • Power down blocks • Clock rate reduction doesn’t help Algorithm Complexity, Concurrency, Locality, Regularity, Data representation • Number of operations = Nops 2 Voltage scaling, Parallelism, • Energy/operation = CV Architecture 2 Instruction set, Signal correlations • Total Energy = Nops * CV Transistor Sizing, Logic optimization, Energy is independent of clock rate! Circuit/Logic Activity Driven Power Down, Low-swing logic, Adiabatic switching • Reducing the clock rate only degrades Technology Threshold Reduction, throughput, but no savings in battery life - unless Multi-thresholds the voltage is changed Slide 7 Slide 8 α Node Transition Activity and Power Factors Affecting Transition Activity, 0->1 Switch a CMOS gate for N clock cycles “Static” component (does not account for timing) E = C • V 2 • nN() N L dd Type of Logic Function (NOR vs. XOR) EN : the energy consumed for N clock cycles Type of Logic Style (Static vs. Dynamic) n(N): the number of 0->1 transition in N clock cycles Signal Statistics E () N • nN • • 2 • Inter-signal Correlations Pavg = lim -------- fclk = lim ------------ C Vdd fclk N → ∞ N N → ∞ N L “Dynamic” or timing dependent component nN() α → = lim ------------ Circuit Topology 01N → ∞ N Signal Statistics and Correlations P = α • C • V 2 • f avg 01→ L dd clk Slide 9 Slide 10 Static 2 Input NOR Gate Type of Logic Style: Static vs. Dynamic V Vdd dd Assume: A CLK prob(A=1) = 1/2 ABOut prob(B=1) = 1/2 B CL 001 A B Then: 010 C prob(Out=1) = 1/2 100 L A B CLK prob(0→1) 110 = prob(Out=0).prob(Out=1) = 3/4 × 1/4 = 3/16 Power is dissipated when Out=0 A STATIC NOR DYNAMIC NOR α Out N0 3 0->1 = 3/16 B α = 3/16 α ==------- --- 0->1 01→ N 4 2 Slide 11 Slide 12 “Dynamic” or Glitching Activity in CMOS Glitch Reduction Using Balanced Paths A0 F Cin Add0 Add1 Add2 Add14 Add15 S0 S1 S2 S14 S15 A1 A2 A3 A4 A5 A6 A7 Ripple 4.0 4 A0 S15 A1 6 A2 2.0 3 A3 S10 F Merge Cin A 5 4 S1 A5 2 Sum Output Voltage, Volts 0.0 A6 0510Time, ns A α 7 0->1 can be > 1 due to glitching! Slide 13 Slide 14 Switching activity and capacitance minimization Minimum Supply Voltage 7.5 • Gated clocks. (disable all modules not in use each cycle) multiplier 2.0µm technology 7.0 C • V 6.5 clock generator L dd Td = enable only those modules using a bus 6.0 I • Block enables. ( ) 5.5 5.0 • Instruction Buffer. (0th level cache) 4.5 I ~ (V - V )2 4.0 dd t 3.5 Td(Vdd=1.5) (1.5) ² (5 - 0.7)2 3.0 ring oscillator • Add stop and sleep instructions to the instruction = 2 2.5 microcoded DSP chip Td(Vdd=5) (5) ² (1.5 - 0.7) set. 2.0 = 8 times slower at 1.5 1.5 adder NORMALIZED DELAY • Minimum size busses 1.0 adder (SPICE) Velocity saturated 2.0 4.0 6.0 I ~ (V - V ) not quite V (volts) dd t • Minimize I/O - on-chip memory dd so bad a penalty Lowering Vdd reduces energy but increases delays Critical difference is the amount above Vt Slide 15 Slide 16 Architecture Trade-offs - Reference Datapath Parallel Datapath A A 1 2T A>B 1 T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH A>B 1 C C LATCH 2T ADDER LATCH A LATCH 1 COMPARATOR LATCH B LATCH B C LATCH 2T COMPARATOR 1 COMPARATOR T MUXMUX C µ2 Area = 636 x 833 1 1 2T T 1 B A>B T COMPARATOR ADDER LATCH A LATCH LATCH B LATCH 1 C C LATCH ⇒ 2T Critical path delay Tadder + Tcomparator (= 25ns) 1 COMPARATOR ⇒ 2T fref = 40Mhz Area = 1476 x 1219 µ2 Total capacitance being switched = Cref The clock rate can be reduced by half with the same ⇒ Vdd = Vref = 5V throughput fpar = fref / 2 Power for reference datapath = P = C V 2 f Vpar = Vref / 1.7, Cpar = 2.15Cref ref ref ref ref 2 ≈ from [Chandrakasan92] (IEEE JSSC) Ppar = (2.15Cref) (Vref/1.7) (fref/2) 0.36 Pref Slide 17 Slide 18 The More Parallel the Better?? Pipelined Datapath 1.00 Fixed Throughput 0.90 Minimal Area 0.80 0.70 A 0.60 1 T 0.50 A>B B 1 T ADDER LATCH A LATCH LATCH P LATCH LATCH B LATCH 0.40 C2 LATCH C1 LATCH COMPARATOR 1 COMPARATOR 0.30 T C 0.20 1 1 µ2 T T Area = 640 x 1081 0.10NORMALIZED POWER Minimal Power 0.00 ⇒ 1.00 2.003.00 4.00 5.00 Critical path delay is less max [Tadder , Tcomparator] V (volts) dd Keeping clock rate constant: fpipe = fref ⇒ Capacitance overhead starts to dominate at “high” levels Voltage can be dropped Vpipe = Vref / 1.7 of parallelism and results in an optimum voltage Capacitance slightly higher: Cpipe = 1.15Cref P =(115C )(V /1 7)2 f ≈ 039P Slide 19 Slide 20 Architecture Summary for a Simple Memory Architecture Serial Access Parallel Access Architecture type Voltage Area Power MEMORY MEMORY Addr Addr Simple datapath CELL CELL (no pipelining or 5V 1 1 parallelism) ARRAY ARRAY Row Decoding Row Decoding Pipelined datapath 2.9V 1.3 0.39 4 4 4 4 4 4 4 4 f Mux f / 8 Latch 4 4 4 4 Parallel datapath 2.9V 3.4 0.36 4 f Latch 8-nibbles f Mux Pipeline-Parallel2.0V 3.7 0.2 4 bit display interface Voltage = 3V Voltage = 1.1V Slide 21 Slide 22 Proposed CPU Architecture: LP-ARM LP-ARM: Energy Estimation Mem[N:0] Add[31:0] fCLK Instruction Cache (8kB): VDD Low-Power SRAM: 2 kByte Block = 78 pJ [Burstein] Clock Bus/DMA Bus I/O Generator Abort Rst Complete Instruction Cache Design: ~150 pJ ∆V Controller Buffer fref Clock Generation Global & External 50pF line = 70 pJ Inst. ARM Data Interrupt Total Clock Generation: ~100 pJ Cache Core Cache Controller ARM Core Register File + ALU + Shifter > 50% Total [ARM,Burd] Int[7:0] Processor Buffer Register File: 30 pJ, ALU: 24 pJ, Shifter: 16 pJ Dominant State Bus I/O Total Core: ~140 pJ Energy Consumer Data[31:0] Slide 23 Slide 24 Research Goal 10 MIPS, 1 nJ/inst. ⇔ 80 MIPS, 9 nJ/inst. (10 mW) (720 mW) DC-DC LP-ARM Converter CPU 100 pJ 500 pJ Processor Bus << 100 pJ I/O 0.5 MB Interface SRAM (8 ICs) 100 pJ 300 pJ Improve energy efficiency by an order of magnitude Slide 25.
Recommended publications
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • Copyrighted Material
    CHAPTER 1 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Lasse Natvig, Alexandru Iordan, Mujahed Eleyat, Magnus Jahre and Jorn Amundsen 1.1 INTRODUCTION 1.1.1 Fundamental Techniques Parallelism hasCOPYRIGHTED been used since the early days of computing MATERIAL to enhance performance. From the first computers to the most modern sequential processors (also called uni- processors), the main concepts introduced by von Neumann [20] are still in use. How- ever, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neu- mann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures. Programming Multicore and Many-core Computing Systems, 3 First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 4 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Prefetching is a 'look-ahead technique' that was introduced quite early and is a way of parallelism that is used at several levels and in different components of a computer today. Both data and instructions are very often accessed sequentially. Therefore, when accessing an element (instruction or data) at address k, an auto- matic access to address k+1 will bring the element to where it is needed before it is accessed and thus eliminates or reduces waiting time.
    [Show full text]
  • Chapter 1-Introduction to Microprocessors File
    Chapter 1 Introduction to Microprocessors Expected Outcomes Explain the role of the CPU, memory and I/O device in a computer Distinguish between the microprocessor and microcontroller Differentiate various form of programming languages Compare between CISC vs RISC and Von Neumann vs Harvard architecture NMKNYFKEEUMP Introduction A microprocessor is an integrated circuit built on a tiny piece of silicon It contains thousands or even millions of transistors which are interconnected via superfine traces of aluminum The transistors work together to store and manipulate data so that the microprocessor can perform a wide variety of useful functions The particular functions a microprocessor perform are dictated by software The first microprocessor was the Intel 4004 (16-pin) introduced in 1971 containing 2300 transistors with 46 instruction sets Power8 processor, by contrast, contains 4.2 billion transistors NMKNYFKEEUMP Introduction Computer is an electronic machine that perform arithmetic operation and logic in response to instructions written Computer requires hardware and software to function Hardware is electronic circuit boards that provide functionality of the system such as power supply, cable, etc CPU – Central Processing Unit/Microprocessor Memory – store all programming and data Input/Output device – the flow of information Software is a programming that control the system operation and facilitate the computer usage Programming is a group of instructions that inform the computer to perform certain task NMKNYFKEEUMP Introduction Computer
    [Show full text]
  • Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance
    Power Architecture® ISA 2.06 Stride N prefetch Engines to boost Application's performance History of IBM POWER architecture: POWER stands for Performance Optimization with Enhanced RISC. Power architecture is synonymous with performance. Introduced by IBM in 1991, POWER1 was a superscalar design that implemented register renaming andout-of-order execution. In Power2, additional FP unit and caches were added to boost performance. In 1996 IBM released successor of the POWER2 called P2SC (POWER2 Super chip), which is a single chip implementation of POWER2. P2SC is used to power the 30-node IBM Deep Blue supercomputer that beat world Chess Champion Garry Kasparov at chess in 1997. Power3, first 64 bit SMP, featured a data prefetch engine, non-blocking interleaved data cache, dual floating point execution units, and many other goodies. Power3 also unified the PowerPC and POWER Instruction set and was used in IBM's RS/6000 servers. The POWER3-II reimplemented POWER3 using copper interconnects, delivering double the performance at about the same price. Power4 was the first Gigahertz dual core processor launched in 2001 which was awarded the MicroProcessor Technology Award in recognition of its innovations and technology exploitation. Power5 came in with symmetric multi threading (SMT) feature to further increase application's performance. In 2004, IBM with 15 other companies founded Power.org. Power.org released the Power ISA v2.03 in September 2006, Power ISA v.2.04 in June 2007 and Power ISA v.2.05 with many advanced features such as VMX, virtualization, variable length encoding, hyper visor functionality, logical partitioning, virtual page handling, Decimal Floating point and so on which further boosted the architecture leadership in the market place and POWER5+, Cell, POWER6, PA6T, Titan are various compliant cores.
    [Show full text]
  • Floboss 107 Flow Manager Instruction Manual
    Part Number D301232X012 May 2018 FloBoss™ 107 Flow Manager Instruction Manual Remote Automation Solutions FloBoss 107 Flow Manager Instruction Manual System Training A well-trained workforce is critical to the success of your operation. Knowing how to correctly install, configure, program, calibrate, and trouble-shoot your Emerson equipment provides your engineers and technicians with the skills and confidence to optimize your investment. Remote Automation Solutions offers a variety of ways for your personnel to acquire essential system expertise. Our full-time professional instructors can conduct classroom training at several of our corporate offices, at your site, or even at your regional Emerson office. You can also receive the same quality training via our live, interactive Emerson Virtual Classroom and save on travel costs. For our complete schedule and further information, contact the Remote Automation Solutions Training Department at 800-338-8158 or email us at [email protected]. ii Revised May-2018 FloBoss 107 Flow Manager Instruction Manual Contents Chapter 1 – General Information 1-1 1.1 Scope of Manual ............................................................................................................................1-2 1.2 FloBoss 107 Overview ...................................................................................................................1-2 1.3 Hardware ........................................................................................................................................1-5
    [Show full text]
  • Ibm Power8 Processors Analýza Výkonnosti Procesorů Ibm Power8
    BRNO UNIVERSITY OF TECHNOLOGY VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SYSTEMS FAKULTA INFORMAČNÍCH TECHNOLOGIÍ ÚSTAV POČÍTAČOVÝCH SYSTÉMŮ PERFORMANCE ANALYSIS OF IBM POWER8 PROCESSORS ANALÝZA VÝKONNOSTI PROCESORŮ IBM POWER8 MASTER’S THESIS DIPLOMOVÁ PRÁCE AUTHOR Bc. JAKUB JELEN AUTOR PRÁCE SUPERVISOR Ing. JIŘÍ JAROŠ, Ph.D. VEDOUCÍ PRÁCE BRNO 2016 Abstract This paper describes the IBM Power8 system in comparison to the Intel Xeon processors, widely used in today’s solutions. The performance is not evaluated only on the whole system level but also on the level of threads, cores and a memory. Different metrics are demonstrated on typical optimized algorithms. The benchmarked Power8 processor pro- vides extremely fast memory providing sustainable bandwidth up to 145 GB/s between main memory and processor, which Intel is unable to compete. Computation power is comparable (Matrix multiplication) or worse (N-body simulation, division, more complex algorithms) in comparison with current Intel Haswell-EP. The IBM Power8 is able to com- pete Intel processors these days and it will be interesting to observe the future generation of Power9 and its performance in comparison to current and future Intel processors. Abstrakt Práce se zabývá systémem IBM Power8 v porovnání s dnes běžně používanými řešeními s procesory Intel Xeon. Výkonnost je vyhodnocována nejen na úrovni celého systému, ale také na úrovni jednotlivých vláken a jader a paměti. Různé metriky jsou demonstrovány na typických optimalizovaných algoritmech. Testovaný stroj Power8 disponuje extrémně rychlou pamětí poskytující rychlost až 145 GB/s mezi pamětí a procesorem, které se dnešní procesory Intel nevyrovnají. Výpočetní síla je pouze srovnatelná (Násobení matic) nebo slabší (N-body simulace, dělení, složitější algoritmy) v porovnání s aktuálním Intel Haswell- EP.
    [Show full text]
  • Computer Architectures an Overview
    Computer Architectures An Overview PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 25 Feb 2012 22:35:32 UTC Contents Articles Microarchitecture 1 x86 7 PowerPC 23 IBM POWER 33 MIPS architecture 39 SPARC 57 ARM architecture 65 DEC Alpha 80 AlphaStation 92 AlphaServer 95 Very long instruction word 103 Instruction-level parallelism 107 Explicitly parallel instruction computing 108 References Article Sources and Contributors 111 Image Sources, Licenses and Contributors 113 Article Licenses License 114 Microarchitecture 1 Microarchitecture In computer engineering, microarchitecture (sometimes abbreviated to µarch or uarch), also called computer organization, is the way a given instruction set architecture (ISA) is implemented on a processor. A given ISA may be implemented with different microarchitectures.[1] Implementations might vary due to different goals of a given design or due to shifts in technology.[2] Computer architecture is the combination of microarchitecture and instruction set design. Relation to instruction set architecture The ISA is roughly the same as the programming model of a processor as seen by an assembly language programmer or compiler writer. The ISA includes the execution model, processor registers, address and data formats among other things. The Intel Core microarchitecture microarchitecture includes the constituent parts of the processor and how these interconnect and interoperate to implement the ISA. The microarchitecture of a machine is usually represented as (more or less detailed) diagrams that describe the interconnections of the various microarchitectural elements of the machine, which may be everything from single gates and registers, to complete arithmetic logic units (ALU)s and even larger elements.
    [Show full text]
  • The POWER4 Processor Introduction and Tuning Guide
    Front cover The POWER4 Processor Introduction and Tuning Guide Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler Steve Behling Ron Bell Peter Farrell Holger Holthoff Frank O’Connell Will Weir ibm.com/redbooks International Technical Support Organization The POWER4 Processor Introduction and Tuning Guide November 2001 SG24-7041-00 Take Note! Before using this information and the product it supports, be sure to read the general information in “Special notices” on page 175. First Edition (November 2001) This edition applies to AIX 5L for POWER Version 5.1 (program number 5765-E61), XL Fortran Version 7.1.1 (5765-C10 and 5765-C11) and subsequent releases running on an IBM ^ pSeries POWER4-based server. Unless otherwise noted, all performance values mentioned in this document were measured on a 1.1 GHz machine, then normalized to 1.3 GHz. Note: This book is based on a pre-GA version of a product and may not apply when the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current information. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright International Business Machines Corporation 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
    [Show full text]
  • Ilore: Discovering a Lineage of Microprocessors
    iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science & Applications Kirk Cameron, Chair Godmar Back Margaret Ellis May 24, 2021 Blacksburg, Virginia Keywords: Computer history, systems, computer architecture, microprocessors Copyright 2021, Samuel Lewis Furman iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org. iLORE: Discovering a Lineage of Microprocessors Samuel Lewis Furman (GENERAL AUDIENCE ABSTRACT) Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing.
    [Show full text]
  • POWER Processor
    POWER Processor Technology Overview Myron Slota POWER Systems, IBM Systems © 2017 IBM Corporation Quarter Century of POWER 22nm Legacy of Leadership Innovation 45/32nm Driving Client Value 65nm POWER8 0.18um 0.25um 130/90nm POWER7/7+ 0.35um Business 0.5um RS64IV Sstar 180/130nm POWER6 0.5um RS64III Pulsar RS64II North Star 0.5um POWER5/5+ RS64I Apache 0.22um Cobra A10 Muskie POWER4/4+ A35 Modern UNIX Era 0.35um Workstation POWER3 -630 0.72um POWER2 P2SC 1.0um RSC 0.25um POWER1 0.35um PC 0.6um 604e 603 601 1990 1995 2000 2005 2010 2015 © 2017 IBM Corporation 2 IBM Optimized Semiconductor Technology World class technology with value-added features for server business. POWER9 is built on 14nm finFET technology transitioned to Global Foundaries 17-layer copper wire Silicon On Insulator On-chip eDRAM (14nm) -Faster Transistor, Less Noise - 6x latency improvement - No off-chip signaling required - 8x bandwidth improvement - 3x less area than SRAM - 5x less energy than SRAM Dense interconnect - Faster connections - Low latency distance paths - High density complex circuits - 2X wire per transistor DT DT eDRAM Cell “IBM is committed to meeting the rising demands of cognitive systems and cloud computing. GF’s leading performance in 7LP process technology, reflecting our joint Research collaboration, will allow IBM Power and Mainframe systems to push beyond limitations to provide high-performance computing solutions while aggressively pursuing 5nm to advance our leadership for years to come.” Tom Rosamilia, Senior Vice President, IBM Systems © 2017
    [Show full text]
  • Qoriq LS1012A SDK V0.3 Contents
    NXP Semiconductors Document Number: QORIQLS1012ASDK03 Rev. 0, Aug 2016 QorIQ LS1012A SDK v0.3 Contents Contents Chapter 1 SDK Overview................................................................................ 8 1.1 What's New.......................................................................................................................... 8 1.2 Components.........................................................................................................................8 1.3 Known Issues.......................................................................................................................9 Chapter 2 Getting Started with Yocto Project.............................................11 2.1 Yocto SDK File System Images..........................................................................................11 2.1.1 fsl-image-minimal.................................................................................................................11 2.1.2 fsl-image-mfgtool................................................................................................................. 11 2.1.3 fsl-image-full........................................................................................................................12 2.1.4 fsl-image-core......................................................................................................................12 2.1.5 fsl-image-virt........................................................................................................................13 2.2 Essential
    [Show full text]
  • The Design Space of Shelving
    The Design Space of Shelving Dezs Sima Institute of Informatics, Kandó Polytechnic Budapest, 19 Nagyszombat utca, 1034 Budapest, Hungary Abstract While using the direct issue mode, dependent instructions cause issue blockages and thus an issue bottleneck. Shelving is a technique to avoid this and to increase the sustained issue rate. It takes advantage of two concepts: (a) the decoupling of dependency checking from instruction issue and (b) significantly widening the instruction window that is scanned in each clock cycle for executable instructions. In this paper we identify and explore the design space of shelving. We first outline its main dimensions, then we present and discuss feasible design alternatives along three of its crucial dimensions. Finally, we point out which design choices have been made in important superscalar processors. For a concise graphical representation of the design space we make use of DS-trees. Keywords: Shelving, Dispatching, Instruction issue, Microarchitecture of superscalar processors, ILP-processing 1. Introduction Although shelving was introduced more than thirty years ago in scalar supercomputers of that time, it has come into widespread use only recently in high end superscalars. While pioneering the basic approaches of parallel instruction execution, that is replication of execution units (EUs) [1] and pipelining [2], designers of Control Data’s 6600 and IBM’s 360/91 made use of shelving to avoid instruction issue blockages caused by data dependencies. In this way, the sustained issue rate and the overall performance of their machines could be increased. Nevertheless, for a number of reasons, including the complexity of its implementation, and the slight performance gain in scalar processors, the concept of shelving was itself shelved for a quarter of a century.
    [Show full text]