Ram K. Krishnamurthy Senior Principal Engineer

Total Page:16

File Type:pdf, Size:1020Kb

Ram K. Krishnamurthy Senior Principal Engineer High Performance Energy Efficient Near Threshold Circuits: Challenges and Opportunities 2012 MICRO Near Threshold Computing Workshop Keynote December 2012 Ram K. Krishnamurthy Senior Principal Engineer Circuits Research Lab, Circuits & Systems Research, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA [email protected] Acknowledgements: Intel Circuits Research Lab, Vivek De, Rick Forand, Wen-Hann Wang, Shekhar Borkar, Greg Taylor, IPR Bangalore Design Lab, Stefan Rusu, Jim Held Era of Tera-scale Computing Teraflops of performance operating on Terabytes of data Entertainment, learning and virtual travel Model-based Apps Recognition TIPS Financial Analytics Mining Synthesis Models GIPS Personal Media Creation and 3D & Management Video Mult- Terascale Performance MIPS Media Multi-core Text KIPS Single-core Health Kilobytes Megabytes Gigabytes Terabytes Dataset Size 2 Tera-scale Platform Vision Special Integrated IO Cache Cache Cache Purpose Engines devices Scalable On-die Interconnect Fabric Last Level Last Level Last Level Integrated Off Die Cache Cache Cache Memory Controllers interconnect Socket High Bandwidth IO Inter- Memory Connect 3 Silicon Process Technology Innovation 65nm 45nm 32nm 22nm 14nm 10nm 7nm 2005 2007 2009 2011 2013 * 2015 * 2017 * 2019+ MANUFACTURING DEVELOPMENT RESEARCH Hi-K Tri-Gate *projected Process innovation leads to energy efficient performance and predictable 2-year technology cycles 4 22nm Performance and Energy Scaling 5 M. Bohr, Intel Developer Forum 2012 Silicon Integration Providing Greater End-User Value • More transistors/area: enables substantial system-on-chip integration opportunities Extreme Scale (Exa-Scale) Computing Research 2W – 100 GigaFLOPS 20MW - ExaFLOPS 10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ/FLOP at the system level J. Rattner, ISCA 2012 Keynote Ultra Low Power Graphics/Video & Security Circuits 10-100X higher performance/watt vs. GP cores Dedicated HW Intel ISSCC, VLSI 2008-2012 More flexible… 100x More efficient… DSPs GOPS/W Microprocessors 10x Source: ISSCC Flexibility vs. energy-efficiency Flexibility P4 x86 PPC PPC MUD Alpha Alpha Sparc Sparc2 Sparc1 MPEG2 MPEG2 Itanium 802.11a PPC770 PPC970 Encrypt SA-DSP Fuj-DSP Fuj-DSP Cell-SPE Fuj-Multi Video ME Video NEC-DSP PPC2-SOI PPC1-SOI KAIST-DSP Hitachi-DSP SIMD Vector SIMD AES Encryption AES SIMD Permutation SIMD DSP functions highly throughput-oriented: Amenable for parallelism/pipelining ⇒ Better power-performance optimization ⇒ Optimal partitioning of tasks between GP processor and dedicated engines 8 Specialized HW Accelerators for ExaExa----ScaleScale General purpose cores, special-purpose accelerators, interconnect fabric Efficient, adaptive, reconfigurable, resilient LowLowLow-Low ---powerpower generalgeneral----purposepurpose core SP HW accelerators Fixed function vs. limited programmability Operation over wide supply voltage range (near-threshold to nominal) 9 NTV Operation & Energy Efficiency 4 2 10 10 65nm CMOS, 50°C 1 65nm CMOS, 50°C 450 10 375 10 3 10 1 300 1 10 2 1 225 9.6X (mW) 150 10 -1 10 1 10 -1 (GOPS/Watt) Total Power (mW) Power Total Energy-Efficiency Energy-Efficiency 75 Subthreshold 320mV -2 Power Leakage Active 1 10 -2 0 10 Maximum (MHz) Frequency 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage (V) Supply Voltage (V) H. Kaul, R. Krishnamurthy et al, ISSCC 2008 Frequency reduces almost Energy efficiency improves by linearly first, then exponentially one order of magnitude at NTV Total power reduces by three to Energy efficiency reduces in four orders of magnitude subthreshold operation Leakage power reduces by two to three orders of magnitude 10 NTV Across Technology Generations H. Kaul, et. al., ISSCC 2009 3 S. K. Hsu, et. al., ISSCC 2012 3 9 10 10 45nm CMOS 8 22nm CMOS, 50°C 50 °C 7 32b Multiply 2 ) 6 10 5 16b SIMD W Multiply 8X 9x m 4 ( 3 2 r 10 10 e 2 w 1 o 300mV P 72b Add 1.1V 0 1 e g Normalized Normalized Energy Efficiency 0.15 0.40 0.65 0.90 1.15 1.40 Vhi 9x 0.15 0.37 0.59 0.74 0.87 0.98 Vlo a k Region Supply Voltage (V) -1 a 10 10 e L A. Agarwal, et. al., ISSCC 2010 Sub-threshold 3.0 10 Reconfigurable Fabric, 32nm CMOS, 50 °C -2 2.5 10 1 (GOPS/W)EfficiencyEnergy 2.0 Register File 0.8mW Permute Crossbar -3 1.5 5.7x 10 -1 1 10 0.2 0.4 0.6 0.8 1.0 1.2 1.0 Supply Voltage (V) 10 -2 0.5 Sub-threshold Region Sub-threshold 340mV NTV operation improves energy -3 Energy Efficiency (TOPS/W) Efficiency Energy 0 10 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Leakage (mW) Active Power efficiency across 45nm-22nm Supply Voltage (V) CMOS 11 NTV Opportunities for Wide Dynamic Range T. Thakkar, Intel Developer Forum 2012 ATOM™ 32nm SOC V/F Islands SoC integration of many unrelated functions in their own power ‘islands’. DDR GPIO • On-die voltage regulation leading to power ‘islands’ that can have different voltage levels. south • Power management that shuts functional units off. audio • Voltage-Frequency pairs; CPU’s can be run in several operating points where its power supply is adjusted to DDR complex reduce power while keeping various functional blocks at constant voltage: – lowest frequency: 100 - 600MHz – medium frequency: 700 - 1500MHz security CPU – burst frequency: 1600 – 2500MHz EMMC • OFF chip drivers have to support various voltage levels NC whereas the controller logic is powered by a lower 2D/3D PLLs clocks voltage : graphics – LPDDR: 1.25V video – MIPI-display: 1.25V DDR– HDMI-display DDR 3.3V Image Signal – SD cards: 2.85V GPIO Processor display – GPIO: 1.25V, 1.80V HDMI DDR GPIO MIPI DDR T. Thakkar, Medfield, Intel Developer Forum 2012 NTV Opportunities for Converged Core 14 T. Piazza, Intel Developer Forum 2012 Impact of Variation on NTV 6 1.0 60% +/- 5% Variation in Vdd or Vt 5 0.8 50% Spread 4 40% 0.6 30% 3 0.4 20% 2 Freq (Relative) Freq 0.2 10% Frequency 1 0.0 0% noise 5% to vulnerability Circuit 0.0 0.2 0.4 0.6 0.8 1.0 0 1.0 0.9 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Vdd (Relative) Vdd scaling towards threshold Threshold (V −V )α frequency ∝ dd t V dd 5% variation in Vt or Vdd results in up to 50% variation in circuit performance 15 Variation Modeling & Measurements 4 65nm CMOS, 50 °C 10 65nm CMOS Typical Die Measurements 1 10 3 ±5% 1.2V Frequency variation 2 across 0-110 °C Frequency variation 10 across fast – slow dies ±2X ±18% 10 1 Normalized Distribution Normalized Maximum Frequency (MHz) Frequency Maximum 50 °C 320mV ±2X 320mV 0 1 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Normalized Frequency Supply Voltage (V) H. Kaul, R. Krishnamurthy et al, ISSCC 2008 Monte-Carlo Simulations 18% nominal frequency spread 65nm CMOS measurements 2X spread at NTV 5% nominal spread due to temperature 16 2X spread at NTV Using Vdd to Compensate for Variation 56 56 65nm CMOS, 65nm CMOS, 320mV Typical Die 320mV, 50C 42 42 28 28 14 14 Frequency (MHz) Frequency Frequency (MHz) Frequency 23MHz 23MHz 0 0 0 50 110 Slow Typical Fast Temperature (C) Process Skew • Adaptive Voltage Compensation for variation tolerance • Adjust supply voltage to maintain constant performance • ±50mV adjustment about 320mV: Nominal 23MHz performance sustained across 0-110°C and Intel Confidential fast-slow skews 17 Subthreshold Leakage at NTV 60% 50% 40% Vdd 40% Increasing Variations 30% 50% Vdd 75% Vdd 20% 100% Vdd SD Leakage Leakage Power SD 10% 0% 45nm 32nm 22nm 14nm 10nm 7nm 5nm NTV operation reduces total power, improves energy efficiency Subthreshold leakage power is substantial portion of the total 18 Low Voltage SRAM and Register File 6T SRAM suffers stability and yield at NTV 6T SRAM cell with larger transistors 8T/10T SRAM for improved stability and yield Variation tolerant register file for NTV wrbl# rdbl wrbl wrbl# rdbl Conventional dual-ended (DE) write cell Dual-ended transmission gatewrbl (Write failure due to strong P and weak N) (DETG) write cell S. Hsu, R. Krishnamurthy et al, ISSCC 2012 19 Low Voltage Latches and Flip-flops Designing flip-flops for NTV Averaging with vector flip-flops Upsized Ck Ck Ck Ck Shared Ck Ck min-sized clock drivers D “0” “1” Ck Ck Q Non-minimum Channel Length Vmin improves by 175 mV Hold time margin by 7 to 30% 20 Low Voltage Logic: Multiplexers & Gates Designing multiplexers for NTV Transmission gates, logic gates Issue: Large off-current paths “1” Weak on-current paths “1” “1” Body effect “1” “0” “0” “0” “0” “0” “0” “0” One-hot 4:1 Encoded 4:1 Up to 3X reduction in worst case Avoid series connected static droop transmission gates Logic fan in limited to 3 stack 21 Low Voltage Level Converters CVSL Level Converter Low Voltage Significant energy High Voltage Circuit Block consumed in contention Circuit Block currents Two-stage cascaded split-output level Ultra-low voltage split-output shifter level shifter VCC MID VCC HIGH VCC HIGH CVSL CVSL Stage Stage 0 VCC LOW OUT VCC LOW MID VCC MID 0 IN H. Kaul, R. Krishnamurthy et al, ISSCC 2009 CVSL split into two stages to reduce contention current Decoupled output from CVSL Decoupled output for smaller CVSL Interrupts contention devices 20% energy reduction Vmin improved by 125 mV 22 Soft Errors and Reliability 10 65nm 1 90nm Assuming 2X bit/latch 130nm count increase per 0.8 180nm generation 250nm Latch 0.6 0.4 Memory 0.2 Relative 130nm to Relative n-SER/cell (sea-level) n-SER/cell 0 1 0.5 1 1.5 2 180 130 90 65 45 32 Voltage (V) Technology (nm) Soft error/bit reduces each generation Soft error at the system level will Impact of NTV on soft error rate continue to increase Positive impact of NTV on reliability Low V lower E fields, low power lower temperature Device aging effects mitigated
Recommended publications
  • On the Hardware Reduction of Z-Datapath of Vectoring CORDIC
    On the Hardware Reduction of z-Datapath of Vectoring CORDIC R. Stapenhurst*, K. Maharatna**, J. Mathew*, J.L.Nunez-Yanez* and D. K. Pradhan* *University of Bristol, Bristol, UK **University of Southampton, Southampton, UK [email protected] Abstract— In this article we present a novel design of a hardware wordlength larger than 18-bits the hardware requirement of it optimal vectoring CORDIC processor. We present a mathematical becomes more than the classical CORDIC. theory to show that using bipolar binary notation it is possible to eliminate all the arithmetic computations required along the z- In this particular work we propose a formulation to eliminate datapath. Using this technique it is possible to achieve three and 1.5 all the arithmetic operations along the z-datapath for conventional times reduction in the number of registers and adder respectively two-sided vector rotation and thereby reducing the hardware compared to classical CORDIC. Following this, a 16-bit vectoring while increasing the accuracy. Also the resulting architecture CORDIC is designed for the application in Synchronizer for IEEE shows significant hardware saving as the wordlength increases. 802.11a standard. The total area and dynamic power consumption Although we stick to the 2’s complement number system, without of the processor is 0.14 mm2 and 700μW respectively when loss of generality, this formulation can be adopted easily for synthesized in 0.18μm CMOS library which shows its effectiveness redundant arithmetic and higher radix formulation. A 16-bit as a low-area low-power processor. processor developed following this formulation requires 0.14 mm2 area and consumes 700 μW dynamic power when synthesized in 0.18μm CMOS library.
    [Show full text]
  • 18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures
    18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/28/2015 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Issues in OoO Execution: Load-Store Handling, … 2 Reminder on Assignments n Lab 2 due next Friday (Feb 6) q Start early! n HW 1 due today n HW 2 out n Remember that all is for your benefit q Homeworks, especially so q All assignments can take time, but the goal is for you to learn very well 3 Lab 1 Grades 25 20 15 10 5 Number of Students 0 30 40 50 60 70 80 90 100 n Mean: 88.0 n Median: 96.0 n Standard Deviation: 16.9 4 Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first, and get it checked off in lab. n Then, implement the MIPS core using a microcoded approach similar to what we will discuss in class. n We are not specifying any particular details of the microcode format or the microarchitecture; you can be creative. n For the extra credit, the microcoded implementation should execute the same programs that your ordinary implementation does, and you should demo it by the normal lab deadline. n You will get maximum 4% of course grade n Document what you have done and demonstrate well 5 Readings for Today n P&P, Revised Appendix C q Microarchitecture of the LC-3b q Appendix A (LC-3b ISA) will be useful in following this n P&H, Appendix D q Mapping Control to Hardware n Optional q Maurice Wilkes, “The Best Way to Design an Automatic Calculating Machine,” Manchester Univ.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Datapath Design I Systems I
    Systems I Datapath Design I Topics Sequential instruction execution cycle Instruction mapping to hardware Instruction decoding Overview How do we build a digital computer? Hardware building blocks: digital logic primitives Instruction set architecture: what HW must implement Principled approach Hardware designed to implement one instruction at a time Plus connect to next instruction Decompose each instruction into a series of steps Expect that most steps will be common to many instructions Extend design from there Overlap execution of multiple instructions (pipelining) Later in this course Parallel execution of many instructions In more advanced computer architecture course 2 Y86 Instruction Set Byte 0 1 2 3 4 5 nop 0 0 addl 6 0 halt 1 0 subl 6 1 rrmovl rA, rB 2 0 rA rB andl 6 2 irmovl V, rB 3 0 8 rB V xorl 6 3 rmmovl rA, D(rB) 4 0 rA rB D jmp 7 0 mrmovl D(rB), rA 5 0 rA rB D jle 7 1 OPl rA, rB 6 fn rA rB jl 7 2 jXX Dest 7 fn Dest je 7 3 call Dest 8 0 Dest jne 7 4 ret 9 0 jge 7 5 pushl rA A 0 rA 8 jg 7 6 popl rA B 0 rA 8 3 Building Blocks fun Combinational Logic A A = Compute Boolean functions of L U inputs B 0 Continuously respond to input changes MUX Operate on data and implement 1 control Storage Elements valA A srcA Store bits valW Register W file dstW Addressable memories valB B Non-addressable registers srcB Clock Loaded only as clock rises Clock 4 Hardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation Parts we want to explore and modify Data
    [Show full text]
  • Liečba Firmy Krízovým Manažérom
    SEPTEMBER- OKTÓBER 2016 Ročník VIII. Magazín o ekonomike, biznise a spoločnosti Cena: 2,20 € LIEČBA FIRMY KRÍZOVÝM MANAŽÉROM Neľahká cesta z červených do čiernych čísel Trendy a výzvy európskej logistiky Firemný blog: robte ho poriadne alebo vôbec Stalo sa, opravíte s naším poistením majetku. Poistenie majetku MÔJ DOMOV Postavte sa s odvahou všetkým nepred- vídaným situáciám, ktoré ohrozujú váš domov. Najoceňovanejšie poistenie majetku Môj domov ich za vás vyrieši rýchlo a fér. allianzsp.sk Infolinka 0800 122 222 VZDELÁVANIE Podchyťme všetky talenty, Magazín o ekonomike, biznise a spoločnosti lebo Európa ich potrebuje V deťoch sa ukrýva veľký potenciál, príliš často však zostáva nevyužitý. Registrované ako periodická tlač Ministerstvom kultúry Slovenskej To je niečo, čo si Európska únia jednoducho nemôže dovoliť: plytvanie republiky pod registračným číslom EV 3451/09, ISSN 1337-9798 ľudskými zdrojmi, ktoré robí ľudí nešťastnými a je takisto kolektívnym Vydanie september – október 2015 zlyhaním. Vydáva: Nemám pritom na mysli len nadanie na štúdium. Je načase uznať ši- Goodwill Publishing, s. r. o. rokú škálu talentu a zručností. Známe sú práce amerického výskumní- IČO: 44 635 770 LB)PXBSEB(BSEOFSB LUPSâJEFOUJmLPWBMWFĔBESVIPWJOUFMJHFODJFPE interpersonálnej po muzikálnu, od priestorovej po jazykovú, logickú Adresa redakcie: alebo intrapersonálnu. Azda všetci súhlasia s tým, že až príliš často sa GOODWILL, Nevädzová 5, 821 01 Bratislava talent hodnotí na základe pevných kritérií, ktoré neodrážajú jeho boha- UFMGBYtHPPEXJMM!HPPEXJMMFVTL tosť ani zložitosť. Musíme sa otvoriť koncepcii talentu a vidieť ďalej, za Ing. Juraj Filin študijné výsledky. Žiaľ, školy majú stále sklon sústrediť sa na úzku ideu šéfredaktor a konateľ spôsobilosti – na akademickú prácu. mMJO!HPPEXJMMFVTLtSFEBLDJB!HPPEXJMMFVTL Potrebujeme talenty pre vyššie vzdelávanie, ale aj pre oblasti odbor- tel.: 0907 78 91 64 ného vzdelávania a prípravy.
    [Show full text]
  • The Economic Impact of Moore's Law: Evidence from When It Faltered
    The Economic Impact of Moore’s Law: Evidence from when it faltered Neil Thompson Sloan School of Management, MIT1 Abstract “Computing performance doubles every couple of years” is the popular re- phrasing of Moore’s Law, which describes the 500,000-fold increase in the number of transistors on modern computer chips. But what impact has this 50- year expansion of the technological frontier of computing had on the productivity of firms? This paper focuses on the surprise change in chip design in the mid-2000s, when Moore’s Law faltered. No longer could it provide ever-faster processors, but instead it provided multicore ones with stagnant speeds. Using the asymmetric impacts from the changeover to multicore, this paper shows that firms that were ill-suited to this change because of their software usage were much less advantaged by later improvements from Moore’s Law. Each standard deviation in this mismatch between firm software and multicore chips cost them 0.5-0.7pp in yearly total factor productivity growth. These losses are permanent, and without adaptation would reflect a lower long-term growth rate for these firms. These findings may help explain larger observed declines in the productivity growth of users of information technology. 1 I would like to thank my PhD advisors David Mowery, Lee Fleming, Brian Wright and Bronwyn Hall for excellent support and advice over the years. Thanks also to Philip Stark for his statistical guidance. This work would not have been possible without the help of computer scientists Horst Simon (Lawrence Berkeley National Lab) and Jim Demmel, Kurt Keutzer, and Dave Patterson in the Berkeley Parallel Computing Lab, I gratefully acknowledge their overall guidance, their help with the Berkeley Software Parallelism Survey and their hospitality in letting me be part of their lab.
    [Show full text]
  • Curtiss-Wright to Display Rugged COTS Modules and System Solutions at Intel Developer Forum 2016
    NEWS RELEASE FOR IMMEDIATE RELEASE Contact: John Wranovics (925) 640-6402 Curtiss-Wright to Display Rugged COTS Modules and System Solutions at Intel Developer Forum 2016 INTEL DEVELOPER FORUM 2016 (IDF16) – SAN FRANCISCO, Calif. (Booth #329) – August 16-18, 2016 – Curtiss-Wright’s Defense Solutions division will highlight its industry-leading open architecture rugged commercial-off-the-shelf (COTS) processing modules and subsystems along with its OpenHPEC™ Accelerator Suite of High Performance Embedded Computing (HPEC) software development tools for the aerospace and defense market at Intel Developer Forum 2016 San Francisco (IDF16: Booth #329). Featured will be demonstrations of glass cockpit applications running on rugged Intel processing modules and the industry’s first VITA 48.8-compliant Air Flow Through (AFT) rugged OpenVPX™ chassis. Curtiss-Wright will also display its Intel® Xeon® processor D-based 3U VPX CHAMP-XD1 and 6U VPX CHAMP-XD2 Digital Signal Processor (DSP) modules, which bring supercomputing-class processing to very compute-intensive C4ISR aerospace and defense applications such as radar processing, Signal Intelligence (SIGINT), and Electronic Warfare (EW). The broad range of Intel-based rugged COTS solutions displayed will include: Rugged Single Board Computer and DSP Modules: 3U VPX and XMC Mobile Xeon processor E3 v5 Modules: At IDF16 Curtiss-Wright is introducing two new small form factor COTS Single Board Computers (SBCs) based on Intel’s latest generation Mobile Xeon processor E3 v5 (formerly known as “Skylake-H”). The new rugged modules, the 3U OpenVPX™ VPX3-1220 and XMC-121 XMC processor mezzanine card, feature a low-power version of the Xeon processor to provide high performance quad-core x86 processing with integrated graphics at typically 50% the power levels of previous solutions.
    [Show full text]
  • LECTURE 5 Single-Cycle Datapath and Control
    Single-Cycle LECTURE 5 Datapath and Control PROCESSORS In lecture 1, we reminded ourselves that the datapath and control are the two components that come together to be collectively known as the processor. • Datapath consists of the functional units of the processor. • Elements that hold data. • Program counter, register file, instruction memory, etc. • Elements that operate on data. • ALU, adders, etc. • Buses for transferring data between elements. • Control commands the datapath regarding when and how to route and operate on data. MIPS To showcase the process of creating a datapath and designing a control, we will be using a subset of the MIPS instruction set. Our available instructions include: • add, sub, and, or, slt • lw, sw • beq, j DATAPATH To start, we will look at the datapath elements needed by every instruction. First, we have instruction memory. Instruction memory is a state element that provides read-access to the instructions of a program and, given an address as input, supplies the corresponding instruction at that address. ­ Code can also be written, e.g., self-modifying code DATAPATH Next, we have the program counter or PC. The PC is a state element that holds the address of the current instruction. Essentially, it is just a 32-bit register which holds the instruction address and is updated at the end of every clock cycle. ­ Normally PC increments sequentially except for branch instructions The arrows on either side indicate that the PC state element is both readable and writeable. DATAPATH Lastly, we have the adder. The adder is responsible for incrementing the PC to hold the address of the next instruction.
    [Show full text]
  • Upgrading and Repairing Pcs, 21St Edition Editor-In-Chief Greg Wiegand Copyright © 2013 by Pearson Education, Inc
    Contents at a Glance Introduction 1 1 Development of the PC 5 2 PC Components, Features, and System Design 19 3 Processor Types and Specifications 29 4 Motherboards and Buses 155 5 BIOS 263 UPGRADING 6 Memory 325 7 The ATA/IDE Interface 377 AND 8 Magnetic Storage Principles 439 9 Hard Disk Storage 461 REPAIRING PCs 10 Flash and Removable Storage 507 21st Edition 11 Optical Storage 525 12 Video Hardware 609 13 Audio Hardware 679 14 External I/O Interfaces 703 15 Input Devices 739 16 Internet Connectivity 775 17 Local Area Networking 799 18 Power Supplies 845 19 Building or Upgrading Systems 929 20 PC Diagnostics, Testing, and Maintenance 975 Index 1035 Scott Mueller 800 East 96th Street, Indianapolis, Indiana 46240 Upgrading.indb i 2/15/13 10:33 AM Upgrading and Repairing PCs, 21st Edition Editor-in-Chief Greg Wiegand Copyright © 2013 by Pearson Education, Inc. Acquisitions Editor All rights reserved. No part of this book shall be reproduced, stored in a retrieval Rick Kughen system, or transmitted by any means, electronic, mechanical, photocopying, Development Editor recording, or otherwise, without written permission from the publisher. No patent Todd Brakke liability is assumed with respect to the use of the information contained herein. Managing Editor Although every precaution has been taken in the preparation of this book, the Sandra Schroeder publisher and author assume no responsibility for errors or omissions. Nor is any Project Editor liability assumed for damages resulting from the use of the information contained Mandie Frank herein. Copy Editor ISBN-13: 978-0-7897-5000-6 Sheri Cain ISBN-10: 0-7897-5000-7 Indexer Library of Congress Cataloging-in-Publication Data in on file.
    [Show full text]
  • Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors
    Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors Ruby Lee Hewlett-Packard Company HotChips IX Stanford, CA, August 24-26,1997 Outline Introduction PA-RISC MAX-2 features and examples Mix Permute Multiply with Shift&Add Conditionals with Saturation Arith (e.g., Absolute Values) Performance Comparison with / without MAX-2 General-Purpose Workloads will include Increasing Amounts of Media Processing MM a b a b 2 1 2 1 b c b c functionality 5 2 5 2 A B C D 1 2 22 2 2 33 3 4 55 59 A B C D 1 2 A B C D 22 1 2 22 2 2 2 2 33 33 3 4 55 59 3 4 55 59 Distributed Multimedia Real-time Information Access Communications Tool Tool Computation Tool time 1980 1990 2000 Multimedia Extensions for General-Purpose Processors MAX-1 for HP PA-RISC (product Jan '94) VIS for Sun Sparc (H2 '95) MAX-2 for HP PA-RISC (product Mar '96) MMX for Intel x86 (chips Jan '97) MDMX for SGI MIPS-V (tbd) MVI for DEC Alpha (tbd) Ideally, different media streams map onto both the integer and floating-point datapaths of microprocessors images GR: GR: 32x32 video 32x64 ALU SMU FP: graphics FP:16x64 Mem 32x64 audio FMAC PA-RISC 2.0 Processor Datapath Subword Parallelism in a General-Purpose Processor with Multimedia Extensions General Regs. y5 y6 y7 y8 x5 x6 x7 x8 x1 x2 x3 x4 y1 y2 y3 y4 Partitionable Partitionable 64-bit ALU 64-bit ALU 8 ops / cycle Subword Parallel MAX-2 Instructions in PA-RISC 2.0 Parallel Add (modulo or saturation) Parallel Subtract (modulo or saturation) Parallel Shift Right (1,2 or 3 bits) and Add Parallel Shift Left (1,2 or 3 bits) and Add Parallel Average Parallel Shift Right (n bits) Parallel Shift Left (n bits) Mix Permute MAX-2 Leverages Existing Processing Resources FP: INTEGER FLOAT GR: 16x64 General Regs.
    [Show full text]
  • New Intel-Powered Classmate Pc Design
    Intel Corporation 2200 Mission College Blvd. P.O. Box 58119 Santa Clara, CA 95052-8119 News Fact Sheet CONTACTS: Agnes Kwan Nor Badron 408-398-2573 +86 21 5460-4510 ext 2228 [email protected] [email protected] INTEL PROVIDES SNEAK PEEK OF NEW INTEL-POWERED CLASSMATE PC DESIGN INTEL DEVELOPER FORUM, San Francisco, Aug. 20, 2008 – Intel is expanding its offerings for the Intel-powered classmate PC category by introducing a design that has tablet, touch screen and motion-sensing interaction features. There are a vast number of different education needs among the 1.3 billion students in the world; the new classmate PC design aims to create more choices to meet these varying learning needs. “Understanding that there is no one-size-fits-all when it comes to education, we are passionate about transforming the way students learn,” said Lila Ibrahim, general manager of Intel’s Emerging Markets Platform Group. “We want to offer more choices to meet the diversity of student learning needs across the world. “Our ethnographic research has shown us that students responded well to tablet and touch screen technology,” Ibrahim added. “The creativity, interactivity and user-friendliness of the new design will enhance the learning experiences for these children. This is important for both emerging and mature markets where technology is increasing being seen as a key tool in encouraging learning and facilitating teaching.” New Design, Same Philosophy The new design is based on findings from ethnographic research and pilots from the past two years. The research pointed out that students naturally collaborate to learn in groups, and – more – Intel/Page 2 they will benefit from the mobility and flexibility of notebooks versus being tethered to their desks.
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]