A Microcode-Based Control Unit for Deep Learning Processors

Total Page:16

File Type:pdf, Size:1020Kb

A Microcode-Based Control Unit for Deep Learning Processors RAW2020, May 18-19, 2020 Kyutech A Microcode-based Control Unit for Deep Learning Processors Qian Zhao1 ([email protected]) , Yasuhiro Nakahara2, Motoki Amagasaki2, Masahiro Iida2, and Takaichi Yoshida1 1 Kyushu Institute of Technology, 2 Kumamoto University Japan Introduction A fixed ISA-based controller may limit the flexibility of DPU Instruction mem Control store (ISA-based Program) (Horizontal microcode) SW HW We propose a Control Unit microcode-Based (Instruction to control unit, which Once the control Control Unit control sequence) allows control logic logic is hard-wired, (Soft-sequencer) to Be determined By we cannot change it software. for feature upgrade, bug fix, etc. Processing Unit Processing Unit (PEs, Data mem, …) (PEs, Data mem, …) ISA architecture Our proposal Provide full control over the DPU in a writable firmware, and thus delivers high flexibility and shorter dev. life cycle. 1 ISA: Instruction set architecture, DPU: Deep learning processing unit DPU architecture using our approach DPU Microcode MEM Soft-sequencer-based Control Unit - SW-defined control sequence for entire Fixed datapath system - Cycle-accurate Pre- & Post- Processing signal generation processing logics block array (PB array) Weight MEM Feature MEM Datapath Control signals Microcode-based control unit is suitable for DPU because: • DPUs have fewer instructions than general-purpose processors. • The control flow and data flow of DPUs are much less complex and more predictable than those of general-purpose processors. 2 Soft-sequencer control unit design Soft-sequencer structure Microcode format Control store Address Microcode Address Microcode PC CUSTOM WAIT generation controller controller {pc[14:0],1'b0} unit (AGU) CLK Output latches RSTL PURGE Ctrl signals from Address Custom ctrl microcode signals signals Soft-sequencer fetches a microcode and generates control signal sequences according to the binary status saved in it. • Simple flow control like loops are implemented in the PC controller, allowing the size of the microcode to be significantly shortened. • The AGU holds and computes addresses in user memory. • A designer can add a custom controller to generate signals by combining other control signals. 3 State transition of PC and AGU design PC=PC-1 Sequential execution INST_WAIT RSTL & WAIT=1 Waiting PURGE WAIT=0 PC=0 PC=PC+1 INST_JUMP CNT= Initial Normal INST_COUNTER Loop execution PC=INST_PC PC state Loop CNT=1 transition diagram CNT>1 PC=PC+1 CNT=CNT-1 INST_JUMP ADDR=INST_ADD R RSTL INST_ADDR_WE Set with PURGE Microcode • In our DPU design, three addresses for ADDR=0 ADDR=ADDR+ !INST_ADDR_WEINST_ADDR_BAK INST_ADDR &ADDR_BAK==0 feature memory write, feature memory !INST_ADDR_BAK Initial Normal read, and weight memory read are INST_ADDR_BAK !INST_ADDR_BA&ADDR_BAK!=0 generated By the AGU. K ADDR_BAK=ADDR • FlexiBle address sequence generation, Address like: AUG state backup • 1, 2, 3, 4 transition diagram • 1, 3, 5, 7 ADDR=ADDR_BAK ADDR_BAK = 0 • 1, 2, 3, 4, 2, 3, 4, 5 4 (using Backup & restore) Restore from backup Implementation case study FPGA implementation Resource utilization • Device: ADM-PCIE-KU3 • DPU: 32*32 PB array • Freq.: 100MHz • Power: 1.255W in total, control unit(<1%), control store(<3%) ASIC implementation • Process: TSMC 40-nm Processing • DPU: 64*64 PB array block array • Freq.: 350MHz • Area: control unit (5%), control store (5.2%) Weight MEM Weight Feature MEM Feature • Power: 0.516W in total, control unit(<0.1%), Soft-sequencer & control store(5.2%) Control store logic Control store Microprogramming DPU layout • Demo DNN: 4 CNN layers, 2 FC layers, CIFAR-10 task • Lines-of-code: 759 lines of microprogram 5 Conclusion • Problem: A fixed ISA-based controller may limit the flexibility of DPU. • Approach: We propose a microcode-based control unit, which allows control logic to be determined by SW. • Pros: Full control over the DPU in a writable firmware, and thus delivers high flexibility and shorter dev. life cycle. • Result: We evaluate the proposed control unit using two DPU design cases implemented by FPGA and ASIC. The results show that high flexibility of the proposed control unit can be achieved without obvious performance overhead. 6.
Recommended publications
  • Microcode Revision Guidance August 31, 2019 MCU Recommendations
    microcode revision guidance August 31, 2019 MCU Recommendations Section 1 – Planned microcode updates • Provides details on Intel microcode updates currently planned or available and corresponding to Intel-SA-00233 published June 18, 2019. • Changes from prior revision(s) will be highlighted in yellow. Section 2 – No planned microcode updates • Products for which Intel does not plan to release microcode updates. This includes products previously identified as such. LEGEND: Production Status: • Planned – Intel is planning on releasing a MCU at a future date. • Beta – Intel has released this production signed MCU under NDA for all customers to validate. • Production – Intel has completed all validation and is authorizing customers to use this MCU in a production environment.
    [Show full text]
  • 18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures
    18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/28/2015 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Issues in OoO Execution: Load-Store Handling, … 2 Reminder on Assignments n Lab 2 due next Friday (Feb 6) q Start early! n HW 1 due today n HW 2 out n Remember that all is for your benefit q Homeworks, especially so q All assignments can take time, but the goal is for you to learn very well 3 Lab 1 Grades 25 20 15 10 5 Number of Students 0 30 40 50 60 70 80 90 100 n Mean: 88.0 n Median: 96.0 n Standard Deviation: 16.9 4 Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first, and get it checked off in lab. n Then, implement the MIPS core using a microcoded approach similar to what we will discuss in class. n We are not specifying any particular details of the microcode format or the microarchitecture; you can be creative. n For the extra credit, the microcoded implementation should execute the same programs that your ordinary implementation does, and you should demo it by the normal lab deadline. n You will get maximum 4% of course grade n Document what you have done and demonstrate well 5 Readings for Today n P&P, Revised Appendix C q Microarchitecture of the LC-3b q Appendix A (LC-3b ISA) will be useful in following this n P&H, Appendix D q Mapping Control to Hardware n Optional q Maurice Wilkes, “The Best Way to Design an Automatic Calculating Machine,” Manchester Univ.
    [Show full text]
  • The Von Neumann Computer Model 5/30/17, 10:03 PM
    The von Neumann Computer Model 5/30/17, 10:03 PM CIS-77 Home http://www.c-jump.com/CIS77/CIS77syllabus.htm The von Neumann Computer Model 1. The von Neumann Computer Model 2. Components of the Von Neumann Model 3. Communication Between Memory and Processing Unit 4. CPU data-path 5. Memory Operations 6. Understanding the MAR and the MDR 7. Understanding the MAR and the MDR, Cont. 8. ALU, the Processing Unit 9. ALU and the Word Length 10. Control Unit 11. Control Unit, Cont. 12. Input/Output 13. Input/Output Ports 14. Input/Output Address Space 15. Console Input/Output in Protected Memory Mode 16. Instruction Processing 17. Instruction Components 18. Why Learn Intel x86 ISA ? 19. Design of the x86 CPU Instruction Set 20. CPU Instruction Set 21. History of IBM PC 22. Early x86 Processor Family 23. 8086 and 8088 CPU 24. 80186 CPU 25. 80286 CPU 26. 80386 CPU 27. 80386 CPU, Cont. 28. 80486 CPU 29. Pentium (Intel 80586) 30. Pentium Pro 31. Pentium II 32. Itanium processor 1. The von Neumann Computer Model Von Neumann computer systems contain three main building blocks: The following block diagram shows major relationship between CPU components: the central processing unit (CPU), memory, and input/output devices (I/O). These three components are connected together using the system bus. The most prominent items within the CPU are the registers: they can be manipulated directly by a computer program. http://www.c-jump.com/CIS77/CPU/VonNeumann/lecture.html Page 1 of 15 IPR2017-01532 FanDuel, et al.
    [Show full text]
  • A 4.7 Million-Transistor CISC Microprocessor
    Auriga2: A 4.7 Million-Transistor CISC Microprocessor J.P. Tual, M. Thill, C. Bernard, H.N. Nguyen F. Mottini, M. Moreau, P. Vallet Hardware Development Paris & Angers BULL S.A. 78340 Les Clayes-sous-Bois, FRANCE Tel: (+33)-1-30-80-7304 Fax: (+33)-1-30-80-7163 Mail: [email protected] Abstract- With the introduction of the high range version of parallel multi-processor architecture. It is used in a family of the DPS7000 mainframe family, Bull is providing a processor systems able to handle up to 24 such microprocessors, which integrates the DPS7000 CPU and first level of cache on capable to support 10 000 simultaneously connected users. one VLSI chip containing 4.7M transistors and using a 0.5 For the development of this complex circuit, a system level µm, 3Mlayers CMOS technology. This enhanced CPU has design methodology has been put in place, putting high been designed to provide a high integration, high performance emphasis on high-level verification issues. A lot of home- and low cost systems. Up to 24 such processors can be made CAD tools were developed, to meet the stringent integrated in a single system, enabling performance levels in performance/area constraints. In particular, an integrated the range of 850 TPC-A (Oracle) with about 12 000 Logic Synthesis and Formal Verification environment tool simultaneously active connections. The design methodology has been developed, to deal with complex circuitry issues involved massive use of formal verification and symbolic and to enable the designer to shorten the iteration loop layout techniques, enabling to reach first pass right silicon on between logical design and physical implementation of the several foundries.
    [Show full text]
  • Embedded Multi-Core Processing for Networking
    12 Embedded Multi-Core Processing for Networking Theofanis Orphanoudakis University of Peloponnese Tripoli, Greece [email protected] Stylianos Perissakis Intracom Telecom Athens, Greece [email protected] CONTENTS 12.1 Introduction ............................ 400 12.2 Overview of Proposed NPU Architectures ............ 403 12.2.1 Multi-Core Embedded Systems for Multi-Service Broadband Access and Multimedia Home Networks . 403 12.2.2 SoC Integration of Network Components and Examples of Commercial Access NPUs .............. 405 12.2.3 NPU Architectures for Core Network Nodes and High-Speed Networking and Switching ......... 407 12.3 Programmable Packet Processing Engines ............ 412 12.3.1 Parallelism ........................ 413 12.3.2 Multi-Threading Support ................ 418 12.3.3 Specialized Instruction Set Architectures ....... 421 12.4 Address Lookup and Packet Classification Engines ....... 422 12.4.1 Classification Techniques ................ 424 12.4.1.1 Trie-based Algorithms ............ 425 12.4.1.2 Hierarchical Intelligent Cuttings (HiCuts) . 425 12.4.2 Case Studies ....................... 426 12.5 Packet Buffering and Queue Management Engines ....... 431 399 400 Multi-Core Embedded Systems 12.5.1 Performance Issues ................... 433 12.5.1.1 External DRAMMemory Bottlenecks ... 433 12.5.1.2 Evaluation of Queue Management Functions: INTEL IXP1200 Case ................. 434 12.5.2 Design of Specialized Core for Implementation of Queue Management in Hardware ................ 435 12.5.2.1 Optimization Techniques .......... 439 12.5.2.2 Performance Evaluation of Hardware Queue Management Engine ............. 440 12.6 Scheduling Engines ......................... 442 12.6.1 Data Structures in Scheduling Architectures ..... 443 12.6.2 Task Scheduling ..................... 444 12.6.2.1 Load Balancing ................ 445 12.6.3 Traffic Scheduling ...................
    [Show full text]
  • The Central Processor Unit
    Systems Architecture The Central Processing Unit The Central Processing Unit – p. 1/11 The Computer System Application High-level Language Operating System Assembly Language Machine level Microprogram Digital logic Hardware / Software Interface The Central Processing Unit – p. 2/11 CPU Structure External Memory MAR: Memory MBR: Memory Address Register Buffer Register Address Incrementer R15 / PC R11 R7 R3 R14 / LR R10 R6 R2 R13 / SP R9 R5 R1 R12 R8 R4 R0 User Registers Booth’s Multiplier Barrel IR Shifter Control Unit CPSR 32-Bit ALU The Central Processing Unit – p. 3/11 CPU Registers Internal Registers Condition Flags PC Program Counter C Carry IR Instruction Register Z Zero MAR Memory Address Register N Negative MBR Memory Buffer Register V Overflow CPSR Current Processor Status Register Internal Devices User Registers ALU Arithmetic Logic Unit Rn Register n CU Control Unit n = 0 . 15 M Memory Store SP Stack Pointer MMU Mem Management Unit LR Link Register Note that each CPU has a different set of User Registers The Central Processing Unit – p. 4/11 Current Process Status Register • Holds a number of status flags: N True if result of last operation is Negative Z True if result of last operation was Zero or equal C True if an unsigned borrow (Carry over) occurred Value of last bit shifted V True if a signed borrow (oVerflow) occurred • Current execution mode: User Normal “user” program execution mode System Privileged operating system tasks Some operations can only be preformed in a System mode The Central Processing Unit – p. 5/11 Register Transfer Language NAME Value of register or unit ← Transfer of data MAR ← PC x: Guard, only if x true hcci: MAR ← PC (field) Specific field of unit ALU(C) ← 1 (name), bit (n) or range (n:m) R0 ← MBR(0:7) Rn User Register n R0 ← MBR num Decimal number R0 ← 128 2_num Binary number R1 ← 2_0100 0001 0xnum Hexadecimal number R2 ← 0x40 M(addr) Memory Access (addr) MBR ← M(MAR) IR(field) Specified field of IR CU ← IR(op-code) ALU(field) Specified field of the ALU(C) ← 1 Arithmetic and Logic Unit The Central Processing Unit – p.
    [Show full text]
  • Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M
    Demystifying Internet of Things Security Successful IoT Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M. Wheeler Demystifying Internet of Things Security Successful IoT Device/Edge and Platform Security Deployment Sunil Cheruvu Anil Kumar Ned Smith David M. Wheeler Demystifying Internet of Things Security: Successful IoT Device/Edge and Platform Security Deployment Sunil Cheruvu Anil Kumar Chandler, AZ, USA Chandler, AZ, USA Ned Smith David M. Wheeler Beaverton, OR, USA Gilbert, AZ, USA ISBN-13 (pbk): 978-1-4842-2895-1 ISBN-13 (electronic): 978-1-4842-2896-8 https://doi.org/10.1007/978-1-4842-2896-8 Copyright © 2020 by The Editor(s) (if applicable) and The Author(s) This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material.
    [Show full text]
  • Pep8cpu: a Programmable Simulator for a Central Processing Unit J
    Pep8CPU: A Programmable Simulator for a Central Processing Unit J. Stanley Warford Ryan Okelberry Pepperdine University Novell 24255 Pacific Coast Highway 1800 South Novell Place Malibu, CA 90265 Provo, UT 84606 [email protected] [email protected] ABSTRACT baum [5]: application, high-order language, assembly, operating This paper presents a software simulator for a central processing system, instruction set architecture (ISA), microcode, and logic unit. The simulator features two modes of operation. In the first gate. mode, students enter individual control signals for the multiplex- For a number of years we have used an assembler/simulator for ers, function controls for the ALU, memory read/write controls, Pep/8 in the Computer Systems course to give students a hands-on register addresses, and clock pulses for the registers required for a experience at the high-order language, assembly, and ISA levels. single CPU cycle via a graphical user interface. In the second This paper presents a software package developed by an under- mode, students write a control sequence in a text window for the graduate student, now a software engineer at Novell, who took the cycles necessary to implement a single instruction set architecture Computer Organization course and was motivated to develop a (ISA) instruction. The simulator parses the sequence and allows programmable simulator at the microcode level. students to single step through its execution showing the color- Yurcik gives a survey of machine simulators [8] and maintains a coded data flow through the CPU. The paper concludes with a Web site titled Computer Architecture Simulators [9] with links to description of the use of the software in the Computer Organiza- papers and internet sources for machine simulators.
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Control Unit Operation
    PART SIX: THE CONTROL UNIT CHAPTER 19 CONTROL UNIT OPERATION 19.1 MICRO-OPERATIONS ............................................................... 3 The Fetch Cycle ...................................................................... 5 The Indirect Cycle................................................................... 8 The Interrupt Cycle ................................................................. 9 The Execute Cycle................................................................... 9 The Instruction Cycle............................................................. 12 19.2 CONTROL OF THE PROCESSOR ............................................... 13 Functional Requirements........................................................ 13 Control Signals ..................................................................... 16 A Control Signals Example ..................................................... 19 Internal Processor Organization .............................................. 23 The Intel 8085...................................................................... 24 19.3 HARDWIRED IMPLEMENTATION .............................................. 30 Control Unit Inputs ............................................................... 30 Control Unit Logic ................................................................. 33 19.4 RECOMMENDED READING ...................................................... 35 19.5 KEY TERMS, REVIEW QUESTIONS, AND PROBLEMS ................... 35 Key Terms ..........................................................................
    [Show full text]
  • IBM Z Connectivity Handbook
    Front cover IBM Z Connectivity Handbook Octavian Lascu John Troy Anna Shugol Frank Packheiser Kazuhiro Nakajima Paul Schouten Hervey Kamga Jannie Houlbjerg Bo XU Redbooks IBM Redbooks IBM Z Connectivity Handbook August 2020 SG24-5444-20 Note: Before using this information and the product it supports, read the information in “Notices” on page vii. Twentyfirst Edition (August 2020) This edition applies to connectivity options available on the IBM z15 (M/T 8561), IBM z15 (M/T 8562), IBM z14 (M/T 3906), IBM z14 Model ZR1 (M/T 3907), IBM z13, and IBM z13s. © Copyright International Business Machines Corporation 2020. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . vii Trademarks . viii Preface . ix Authors. ix Now you can become a published author, too! . xi Comments welcome. xi Stay connected to IBM Redbooks . xi Chapter 1. Introduction. 1 1.1 I/O channel overview. 2 1.1.1 I/O hardware infrastructure . 2 1.1.2 I/O connectivity features . 3 1.2 FICON Express . 4 1.3 zHyperLink Express . 5 1.4 Open Systems Adapter-Express. 6 1.5 HiperSockets. 7 1.6 Parallel Sysplex and coupling links . 8 1.7 Shared Memory Communications. 9 1.8 I/O feature support . 10 1.9 Special-purpose feature support . 12 1.9.1 Crypto Express features . 12 1.9.2 Flash Express feature . 12 1.9.3 zEDC Express feature . 13 Chapter 2. Channel subsystem overview . 15 2.1 CSS description . 16 2.1.1 CSS elements .
    [Show full text]
  • Introduction to Microcoded Implementation of a CPU Architecture
    Introduction to Microcoded Implementation of a CPU Architecture N.S. Matloff, revised by D. Franklin January 30, 1999, revised March 2004 1 Microcoding Throughout the years, Microcoding has changed dramatically. The debate over simple computers vs complex computers once raged within the architecture community. In the end, the most popular microcoded computers survived for three reasons - marketshare, technological improvements, and the embracing of the principles used in simple computers. So the two eventually merged into one. To truly understand microcoding, one must understand why they were built, what they are, why they survived, and, finally, what they look like today. 1.1 Motivation Strictly speaking, the term architecture for a CPU refers only to \what the assembly language programmer" sees|the instruction set, addressing modes, and register set. For a given target architecture, i.e. the architecture we wish to build, various implementations are possible. We could have many different internal designs of the CPU chip, all of which produced the same effect, namely the same instruction set, addressing modes, etc. The different internal designs could then all be produced for the different models of that CPU, as in the familiar Intel case. The different models would have different speed capabilities, and probably different prices to the consumer. But the same machine languge program, say a .EXE file in the Intel/DOS case, would run on any CPU in the family. When desigining an instruction set architecture, there is a tradeoff between software and hardware. If you provide very few instructions, it takes more instructions to perform the same task, but the hardware can be very simple.
    [Show full text]