The 2014 MICRO Test of Time Award Winners: from 1978 to 1992

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Awards ................................................................................................................................................................ The 2014 MICRO Test of Time Award Winners: From 1978 to 1992 ONUR MUTLU Carnegie Mellon University RICH BELGARD ......As you may know, the Interna- tions. The two-level control store is tes code-generation choices for correctly tional Symposium on Microarchitecture essentially two carefully codesigned pro- and efficiently optimizing instruction (MICRO)—the flagship microarchitecture grammable logic arrays (PLAs) that schedules of loops for various architec- conference, and a premier computer together comprise more compact stor- tures, including very long instruction word architecture conference for nearly five age for microinstructions than a single (VLIW) and superscalar. The paper covers decades—selected 10 papers as recipi- monolithic control store. It was born architectures incorporating a varying set ents of the first set of MICRO Test of from the necessity to “maximize the of features for loop schedule optimiza- Time (ToT) Awards in December 2014. contribution of every transistor spent” tion, using the notions of software pipelin- We announced the winning papers and (to quote Tredennick’s retrospective) in ing and modulo scheduling. The work is described the selection process in the the design of the Motorola MC68000 based on the authors’ extensive (about a March/April 2015 issue of IEEE Micro.1 processor. The paper also described in decade long) experience in hardware/soft- The authors of these 10 distinguished detail the microprogrammed control logic ware codesign for realizing Cydrome’s papers were invited to write short retro- implementation of a single-chip micro- Cydra 5 processor. Yet, the paper’s loop spectives to reflect on their work, which architecture, based on the MC68000 code scheduling strategies apply far was done at least 20 years ago. This experience. In his retrospective, Treden- beyond the extensive architectural sup- issue features retrospectives written by nick describes his experience at Motor- port provided by the Cydra 5 for loop the original coauthors of two of the ola that led to this paper and discusses scheduling purposes, as Schlansker’s ret- award-winning papers. We briefly intro- his subsequent experiences in industry, rospective beautifully describes. duce these papers and retrospectives, which were partially shaped by his and we hope that you will enjoy reading involvement with the MC68000. He also s we conclude, we would like to them as much as we have. muses about the connection between A take the opportunity to pay tribute The first retrospective is for the old- design and design automation proc- to the extremely valuable impact that est paper that won the 2014 MICRO ToT esses, which makes the retrospective a Bob Rau has had in our field, especially in Award. “Microprogrammed Implemen- fun historical perspective and a delightful the development of compiler technology tation of a Single Chip Microprocessor” read for the IEEE Micro audience. and VLIW processors, as well as hard- by Skip Stritter and Nick Tredennick was The second retrospective is for one of ware/software cooperation in instruc- published in MICRO 1978.2 It introduced the youngest papers that won the 2014 tion-level parallelism. It has been 13 the idea of a two-level control store (text- MICRO ToT Award. “Code Generation years since Bob died, but his impact is book material in computer architecture Schema for Modulo Scheduled Loops,” wonderfully felt in the compiler technol- today) with the goal of minimizing the authored by Bob Ramakrishna Rau, ogy commonly in use today, along with chip real estate dedicated to the control Michael S. Schlansker, and P.P. Tirumalai, the many clearly articulated technical logic used in microprogrammed pro- was published in MICRO 1992.3 The articles he contributed to academic litera- cessor designs, and in particular the paper provides a “recipe book” (to quote ture. His works are taught in many mod- memory used to store the microinstruc- Schlansker) that discusses and enumera- ern compiler and computer architecture ....................................................... 60 Published by the IEEE Computer Society 0272-1732/16/$33.00 c 2016 IEEE classes today. Bob was one of the most the scaling of the underlying circuit and 3. B. Ramakrishna Rau, Michael S. prominent contributors to MICRO for device technologies. MICRO Schlansker, and P.P. Tirumalai, “Code decades, and our selection of the 1992 Generation Schema for Modulo Sched- article, for which he was the primary ............................................................ uled Loops,” Proc. 25th Ann. Int’l Symp. driver (according to Schlansker’s retro- References Microarchitecture, 1992, pp. 158–169. spective), as part of the first set of 1. O. Mutlu and R. Belgard, “Introducing the MICRO Test of Time Awards: MICRO ToT Awards points to the techni- Onur Mutlu is the Strecker Early Career Concept, Process, 2014 Winners, and cal excellence and value of insight he Professor at Carnegie Mellon Univer- the Future,” IEEE Micro, vol. 35, no. upheld as a leading member of our com- sity. Contact him at [email protected]. munity. We hope these two key values 2, 2015, pp. 85–87. continue to thrive as microarchitecture/ 2. S. Stritter and N. Tredennick, Rich Belgard is an independent consul- architecture and hardware/software “Microprogrammed Implementation tant for computer manufacturers, soft- codesign become even more important of a Single Chip Microprocessor,” ware companies, and investor groups with fundamental challenges threatening Proc. 11th Ann. Workshop Microprog- and an expert and consultant to law firms. the large improvements obtained from ramming, 1978, pp. 8–16. Contact him at [email protected]. .............................................................................................................................................................................................. Evolution of Microprocessor Logic Design NICK TREDENNICK Jonetix ......In the summer of 1977, I was wanted me to work on the design of the The microprocessor’s entire design had teaching as an assistant professor at the on-chip cache, but that he first needed to fit on a single power-, pin-, and transis- University of Texas in Austin when Tom me to begin work on the microproces- tor-constrained, size-limited silicon chip. Gunter walked into my office and intro- sor’s logic design “until we find a compe- All of the microprocessor’s comput- duced himself. He asked if I’d like to work tent logic designer.” Of course, that ing resources (data registers, address for Motorola on a microprocessor design never happened, and I spent my time registers, program counter, and arith- project. My areas of expertise were com- doing the logic design for what became metic units), interrupt logic, interface puters and logic design, and I had a little the MC68000. logic (pin and external bus control), and experience with microprocessor applica- I began looking for books and articles control logic had to fit inside the transis- tions, but no experience with microproc- on microprocessor logic design. I was tor, area, and power budget. Since we essor design or semiconductor design. unable to find documentation for any began by doubling or more than doubling Nevertheless, there was mutual interest microprocessor logic design methods. the width of the data and address regis- and I took a job with Motorola beginning That seemed odd, given that the Design ters and the arithmetic units, as well as in September 1977. The project was a Automation Conference was already 14 substantially increasing the number of next-generation microprocessor design years old in 1977. Just what processes data and address registers compared to called MACS (Motorola Advanced Com- were all those software engineers auto- an 8-bit accumulator-based design, we puter System). Motorola’s previous mating? Well, OK, I’d have to make up quickly ate into the transistor-budget microprocessor designs had been 8-bit the design process as I went along. increases provided by our move to the accumulator-based designs suitable for At the time, the biggest differences next most advanced semiconductor embedded applications; MACS was to be between computer design and micro- process. The consequence of these a 16-/32-bit design more suitable for com- processor design were in the constraints decisions was that, in the implementa- puter applications. Tom said he eventually placed on the microprocessor’s designer. tion of the control logic, we had to ............................................................. JANUARY/FEBRUARY 2016 61 .............................................................................................................................................................................................. AWARDS maximize the contribution of every tran- sequences. One instruction decoder First, I decided to make a list of the prob- sistor spent. pointed to the operand address calcula- lematic design decisions that I had made In 1977, moving to the next semicon- tion sequence and a second pointed to during the project, so that I could avoid ductor process meant designers had the required operation sequence. The those errors in the next design. Second, somewhat more than twice the number of address calculation sequence computed because available design tools did not transistors enabled by the previous-genera- the operand address and sent a request support
Recommended publications
  • Review Memory Disambiguation Review Explicit Register Renaming

    Review Memory Disambiguation Review Explicit Register Renaming

    5HYLHZ5HRUGHU%XIIHU 52% &6 *UDGXDWH&RPSXWHU$UFKLWHFWXUH 8VHRIUHRUGHUEXIIHU /HFWXUH ² ,QRUGHULVVXH2XWRIRUGHUH[HFXWLRQ,QRUGHUFRPPLW ² +ROGVUHVXOWVXQWLOWKH\FDQEHFRPPLWWHGLQRUGHU ,QVWUXFWLRQ/HYHO3DUDOOHOLVP ª 6HUYHVDVVRXUFHRIYDOXHVXQWLOLQVWUXFWLRQVFRPPLWWHG ² 3URYLGHVVXSSRUWIRUSUHFLVHH[FHSWLRQV6SHFXODWLRQVLPSO\WKURZRXW *HWWLQJWKH&3, LQVWUXFWLRQVODWHUWKDQH[FHSWHGLQVWUXFWLRQ ² &RPPLWVXVHUYLVLEOHVWDWHLQLQVWUXFWLRQRUGHU ² 6WRUHVVHQWWRPHPRU\V\VWHPRQO\ZKHQWKH\UHDFKKHDGRIEXIIHU 6HSWHPEHU ,Q2UGHU&RPPLW LVLPSRUWDQWEHFDXVH 3URI-RKQ.XELDWRZLF] ² $OORZVWKHJHQHUDWLRQRISUHFLVHH[FHSWLRQV ² $OORZVVSHFXODWLRQDFURVVEUDQFKHV &6.XELDWRZLF] &6.XELDWRZLF] /HF /HF 5HYLHZ0HPRU\'LVDPELJXDWLRQ 5HYLHZ([SOLFLW5HJLVWHU5HQDPLQJ 4XHVWLRQ*LYHQDORDGWKDWIROORZVDVWRUHLQSURJUDP 0DNHXVHRIDSK\VLFDO UHJLVWHUILOHWKDWLVODUJHUWKDQ RUGHUDUHWKHWZRUHODWHG" QXPEHURIUHJLVWHUVVSHFLILHGE\,6$ ² 7U\LQJWRGHWHFW5$:KD]DUGVWKURXJKPHPRU\ .H\LQVLJKW$OORFDWHDQHZSK\VLFDOGHVWLQDWLRQUHJLVWHU ² 6WRUHVFRPPLWLQRUGHU 52% VRQR:$5:$:PHPRU\KD]DUGV IRUHYHU\LQVWUXFWLRQWKDWZULWHV ,PSOHPHQWDWLRQ ² 5HPRYHVDOOFKDQFHRI:$5RU:$:KD]DUGV ² .HHSTXHXHRIVWRUHVLQSURJRUGHU ² 6LPLODUWRFRPSLOHUWUDQVIRUPDWLRQFDOOHG6WDWLF6LQJOH$VVLJQPHQW ² :DWFKIRUSRVLWLRQRIQHZORDGVUHODWLYHWRH[LVWLQJVWRUHV ª /LNHKDUGZDUHEDVHGG\QDPLFFRPSLODWLRQ" :KHQKDYHDGGUHVVIRUORDGFKHFNVWRUHTXHXH 0HFKDQLVP".HHSDWUDQVODWLRQWDEOH ² ,IDQ\ VWRUHSULRUWRORDGLVZDLWLQJIRULWVDGGUHVVVWDOOORDG ² ,6$UHJLVWHU⇒ SK\VLFDOUHJLVWHUPDSSLQJ ² ,IORDGDGGUHVVPDWFKHVHDUOLHUVWRUHDGGUHVV DVVRFLDWLYHORRNXS ² :KHQUHJLVWHUZULWWHQUHSODFHHQWU\ZLWKQHZUHJLVWHUIURPIUHHOLVW WKHQZHKDYHDPHPRU\LQGXFHG
  • Microcode Revision Guidance August 31, 2019 MCU Recommendations

    Microcode Revision Guidance August 31, 2019 MCU Recommendations

    microcode revision guidance August 31, 2019 MCU Recommendations Section 1 – Planned microcode updates • Provides details on Intel microcode updates currently planned or available and corresponding to Intel-SA-00233 published June 18, 2019. • Changes from prior revision(s) will be highlighted in yellow. Section 2 – No planned microcode updates • Products for which Intel does not plan to release microcode updates. This includes products previously identified as such. LEGEND: Production Status: • Planned – Intel is planning on releasing a MCU at a future date. • Beta – Intel has released this production signed MCU under NDA for all customers to validate. • Production – Intel has completed all validation and is authorizing customers to use this MCU in a production environment.
  • 18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures

    18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures

    18-447 Computer Architecture Lecture 6: Multi-Cycle and Microprogrammed Microarchitectures Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/28/2015 Agenda for Today & Next Few Lectures n Single-cycle Microarchitectures n Multi-cycle and Microprogrammed Microarchitectures n Pipelining n Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, … n Out-of-Order Execution n Issues in OoO Execution: Load-Store Handling, … 2 Reminder on Assignments n Lab 2 due next Friday (Feb 6) q Start early! n HW 1 due today n HW 2 out n Remember that all is for your benefit q Homeworks, especially so q All assignments can take time, but the goal is for you to learn very well 3 Lab 1 Grades 25 20 15 10 5 Number of Students 0 30 40 50 60 70 80 90 100 n Mean: 88.0 n Median: 96.0 n Standard Deviation: 16.9 4 Extra Credit for Lab Assignment 2 n Complete your normal (single-cycle) implementation first, and get it checked off in lab. n Then, implement the MIPS core using a microcoded approach similar to what we will discuss in class. n We are not specifying any particular details of the microcode format or the microarchitecture; you can be creative. n For the extra credit, the microcoded implementation should execute the same programs that your ordinary implementation does, and you should demo it by the normal lab deadline. n You will get maximum 4% of course grade n Document what you have done and demonstrate well 5 Readings for Today n P&P, Revised Appendix C q Microarchitecture of the LC-3b q Appendix A (LC-3b ISA) will be useful in following this n P&H, Appendix D q Mapping Control to Hardware n Optional q Maurice Wilkes, “The Best Way to Design an Automatic Calculating Machine,” Manchester Univ.
  • ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills

    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills

    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills The ARM Cortex­A product line has changed significantly since the introduction of the Cortex­A8 in 2005. ARM’s next major leap came with the Cortex­A9 which was the first design to incorporate multiple cores. The next advance was the development of the big.LITTLE architecture, which incorporates both high performance (A15) and high efficiency(A7) cores. Most recently the A57 and A53 have added 64­bit support to the product line. The ARM Cortex series cores are all made up of the main processing unit, a L1 instruction cache, a L1 data cache, an advanced SIMD core and a floating point core. Each processor then has an additional L2 cache shared between all cores (if there are multiple), debug support and an interface bus for communicating with the rest of the system. Multi­core processors (such as the A53 and A57) also include additional hardware to facilitate coherency between cores. The ARM Cortex­A57 is a 64­bit processor that supports 1 to 4 cores. The instruction pipeline in each core supports fetching up to three instructions per cycle to send down the pipeline. The instruction pipeline is made up of a 12 stage in order pipeline and a collection of parallel pipelines that range in size from 3 to 15 stages as seen below. The ARM Cortex­A53 is similar to the A57, but is designed to be more power efficient at the cost of processing power. The A57 in order pipeline is made up of 5 stages of instruction fetch and 7 stages of instruction decode and register renaming.
  • A 4.7 Million-Transistor CISC Microprocessor

    A 4.7 Million-Transistor CISC Microprocessor

    Auriga2: A 4.7 Million-Transistor CISC Microprocessor J.P. Tual, M. Thill, C. Bernard, H.N. Nguyen F. Mottini, M. Moreau, P. Vallet Hardware Development Paris & Angers BULL S.A. 78340 Les Clayes-sous-Bois, FRANCE Tel: (+33)-1-30-80-7304 Fax: (+33)-1-30-80-7163 Mail: [email protected] Abstract- With the introduction of the high range version of parallel multi-processor architecture. It is used in a family of the DPS7000 mainframe family, Bull is providing a processor systems able to handle up to 24 such microprocessors, which integrates the DPS7000 CPU and first level of cache on capable to support 10 000 simultaneously connected users. one VLSI chip containing 4.7M transistors and using a 0.5 For the development of this complex circuit, a system level µm, 3Mlayers CMOS technology. This enhanced CPU has design methodology has been put in place, putting high been designed to provide a high integration, high performance emphasis on high-level verification issues. A lot of home- and low cost systems. Up to 24 such processors can be made CAD tools were developed, to meet the stringent integrated in a single system, enabling performance levels in performance/area constraints. In particular, an integrated the range of 850 TPC-A (Oracle) with about 12 000 Logic Synthesis and Formal Verification environment tool simultaneously active connections. The design methodology has been developed, to deal with complex circuitry issues involved massive use of formal verification and symbolic and to enable the designer to shorten the iteration loop layout techniques, enabling to reach first pass right silicon on between logical design and physical implementation of the several foundries.
  • Out-Of-Order Execution & Register Renaming

    Out-Of-Order Execution & Register Renaming

    Asanovic/Devadas Spring 2002 6.823 Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002 6.823 Scoreboard for In-order Issue Busy[unit#] : a bit-vector to indicate unit’s availability. (unit = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? not Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest] Asanovic/Devadas Spring 2002 Out-of-Order Dispatch 6.823 ALU Mem IF ID Issue WB Fadd Fmul • Issue stage buffer holds multiple instructions waiting to issue. • Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard. • Any instruction in buffer whose RAW hazards are satisfied can be dispatched (for now, at most one dispatch per cycle). On a write back (WB), new instructions may get enabled. Asanovic/Devadas Spring 2002 6.823 Out-of-Order Issue: an example latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 3 4 SUBD F8, F2, F2 1 5 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) . 2 3 4 4 3 5 . 5 6 6 Out-of-order: 1 (2,1) 4 4 . 2 3 . 3 5 .
  • Dynamic Register Renaming Through Virtual-Physical Registers

    Dynamic Register Renaming Through Virtual-Physical Registers

    Dynamic Register Renaming Through Virtual-Physical Registers † Teresa Monreal [email protected] Antonio González* [email protected] Mateo Valero* [email protected] †† José González [email protected] † Víctor Viñals [email protected] †Departamento de Informática e Ing. de Sistemas. Centro Politécnico Superior - Univ. de Zaragoza, Zaragoza, Spain. *Departament d’ Arquitectura de Computadors. Universitat Politècnica de Catalunya, Barcelona, Spain. ††Departamento de Ingeniería y Tecnología de Computadores. Universidad de Murcia, Murcia, Spain. Abstract Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper present a novel dynamic register renaming scheme that delays the allocation of physical registers until a late stage in the pipeline. We show that it can provide important savings in number of physical registers so it can significantly shorter the register file access time. Delaying the allocation of physical registers requires some artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which are tags that do not require any storage location. The proposed renaming scheme shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance. 1. Introduction Dynamically-scheduled superscalar processors exploit instruction-level parallelism (ILP) by overlapping the execution of instructions in an instruction window.
  • Embedded Multi-Core Processing for Networking

    Embedded Multi-Core Processing for Networking

    12 Embedded Multi-Core Processing for Networking Theofanis Orphanoudakis University of Peloponnese Tripoli, Greece [email protected] Stylianos Perissakis Intracom Telecom Athens, Greece [email protected] CONTENTS 12.1 Introduction ............................ 400 12.2 Overview of Proposed NPU Architectures ............ 403 12.2.1 Multi-Core Embedded Systems for Multi-Service Broadband Access and Multimedia Home Networks . 403 12.2.2 SoC Integration of Network Components and Examples of Commercial Access NPUs .............. 405 12.2.3 NPU Architectures for Core Network Nodes and High-Speed Networking and Switching ......... 407 12.3 Programmable Packet Processing Engines ............ 412 12.3.1 Parallelism ........................ 413 12.3.2 Multi-Threading Support ................ 418 12.3.3 Specialized Instruction Set Architectures ....... 421 12.4 Address Lookup and Packet Classification Engines ....... 422 12.4.1 Classification Techniques ................ 424 12.4.1.1 Trie-based Algorithms ............ 425 12.4.1.2 Hierarchical Intelligent Cuttings (HiCuts) . 425 12.4.2 Case Studies ....................... 426 12.5 Packet Buffering and Queue Management Engines ....... 431 399 400 Multi-Core Embedded Systems 12.5.1 Performance Issues ................... 433 12.5.1.1 External DRAMMemory Bottlenecks ... 433 12.5.1.2 Evaluation of Queue Management Functions: INTEL IXP1200 Case ................. 434 12.5.2 Design of Specialized Core for Implementation of Queue Management in Hardware ................ 435 12.5.2.1 Optimization Techniques .......... 439 12.5.2.2 Performance Evaluation of Hardware Queue Management Engine ............. 440 12.6 Scheduling Engines ......................... 442 12.6.1 Data Structures in Scheduling Architectures ..... 443 12.6.2 Task Scheduling ..................... 444 12.6.2.1 Load Balancing ................ 445 12.6.3 Traffic Scheduling ...................
  • Computer Architecture Out-Of-Order Execution

    Computer Architecture Out-Of-Order Execution

    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
  • The Central Processor Unit

    The Central Processor Unit

    Systems Architecture The Central Processing Unit The Central Processing Unit – p. 1/11 The Computer System Application High-level Language Operating System Assembly Language Machine level Microprogram Digital logic Hardware / Software Interface The Central Processing Unit – p. 2/11 CPU Structure External Memory MAR: Memory MBR: Memory Address Register Buffer Register Address Incrementer R15 / PC R11 R7 R3 R14 / LR R10 R6 R2 R13 / SP R9 R5 R1 R12 R8 R4 R0 User Registers Booth’s Multiplier Barrel IR Shifter Control Unit CPSR 32-Bit ALU The Central Processing Unit – p. 3/11 CPU Registers Internal Registers Condition Flags PC Program Counter C Carry IR Instruction Register Z Zero MAR Memory Address Register N Negative MBR Memory Buffer Register V Overflow CPSR Current Processor Status Register Internal Devices User Registers ALU Arithmetic Logic Unit Rn Register n CU Control Unit n = 0 . 15 M Memory Store SP Stack Pointer MMU Mem Management Unit LR Link Register Note that each CPU has a different set of User Registers The Central Processing Unit – p. 4/11 Current Process Status Register • Holds a number of status flags: N True if result of last operation is Negative Z True if result of last operation was Zero or equal C True if an unsigned borrow (Carry over) occurred Value of last bit shifted V True if a signed borrow (oVerflow) occurred • Current execution mode: User Normal “user” program execution mode System Privileged operating system tasks Some operations can only be preformed in a System mode The Central Processing Unit – p. 5/11 Register Transfer Language NAME Value of register or unit ← Transfer of data MAR ← PC x: Guard, only if x true hcci: MAR ← PC (field) Specific field of unit ALU(C) ← 1 (name), bit (n) or range (n:m) R0 ← MBR(0:7) Rn User Register n R0 ← MBR num Decimal number R0 ← 128 2_num Binary number R1 ← 2_0100 0001 0xnum Hexadecimal number R2 ← 0x40 M(addr) Memory Access (addr) MBR ← M(MAR) IR(field) Specified field of IR CU ← IR(op-code) ALU(field) Specified field of the ALU(C) ← 1 Arithmetic and Logic Unit The Central Processing Unit – p.
  • Hardware-Sensitive Database Operations - II

    Hardware-Sensitive Database Operations - II

    Faculty of Computer Science Database and Software Engineering Group Hardware-Sensitive Database Operations - II Balasubramanian (Bala) Gurumurthy Advanced Topics in Databases, 2019/May/17 Otto-von-Guericke University of Magdeburg So Far... ● Hardware evolution and current challenges ● Hardware-oblivious vs. hardware-sensitive programming ● Pipelining in RISC computing ● Pipeline Hazards ○ Structural hazard ○ Data hazard ○ Control hazard ● Resolving hazards ○ Loop-Unrolling ○ Predication Bala Gurumurthy | Hardware-Sensitive Database Operations Part II We will see ● Vectorization ○ SIMD Execution ○ SIMD in DBMS Operation ● GPUs in DBMSs ○ Processing Model ○ Handling Synchronization Bala Gurumurthy | Hardware-Sensitive Database Operations Vectorization Leveraging Modern Processing Capabilities Hardware Parallelism One we know already: Pipelining ● Separate chip regions for individual tasks to execute independently ● Advantage: parallelism + sequential execution semantics ● We discussed problems of hazards ● VLSI tech. limits degree up to which pipelining is feasible [Kaeslin, 2008] Bala Gurumurthy | Hardware-Sensitive Database Operations Hardware Parallelism Chip area can be used for other types of parallelism: Computer systems typically use identical hardware circuits, but their function may be controlled by different instruction stream Si: Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture? Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture?
  • Trends in Processor Architecture

    Trends in Processor Architecture

    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).