The Design Space of Register Renaming Techniques in Superscalar Processors

Total Page:16

File Type:pdf, Size:1020Kb

The Design Space of Register Renaming Techniques in Superscalar Processors The Design Space of Register Renaming Techniques in Superscalar Processors Dezsõ Sima Kandó Polytechnic, Institute of Informatics, Budapest Abstract In this paper we first survey techniques which are used or proposed to cope with data dependencies. Then we focus on register renaming, a technique that has appeared in virtually all recent superscalars to boost performance. To elucidate this complex issue we next give an overview of the rename process and identify the design space of feasible techniques used to rename registers in superscalars. We discuss the main dimensions of the design space and in each dimension we indicate possible design choices. Finally, we identify and discuss basic alternatives and possible implementation schemes of register renaming. Keywords: Register Renaming, Register Rename Techniques, False Data Dependency Elimination, Microarchitecture of Superscalar Processors, Design Space. Introduction Instruction level parallel execution (ILP) is constrained by both control dependencies and by data dependencies. Control dependencies, caused by conditional control transfer instructions, call for a serialization of instructions at conditional forks and joins along the control flow graph of the program. On the other hand, data dependencies arising between the operands of subsequent instructions, enforce serialization of the affected instructions1-3. The standard technique to cope with control dependencies is to use speculative branch processing. Pioneered as early as in the 1960s4,5, speculative branch processing was first studied and employed around 19806-10. It finally came into widespread use in advanced pipelined microprocessors in the early 1990’s. Superscalars have inherited this technique and have given rise to a considerable sophistication of the speculation technique used3,11,12. The Design Space of Register Renaming Techniques in Superscalar Processors Data dependencies may appear either between operands of subsequent instructions in a straight line code segment or between operands of instructions belonging to subsequent loop iterations. In straight line code read after write (RAW), write after read (WAR) or write after write (WAW) dependencies may be encountered (see box “Types of data dependencies”). To tackle data dependencies occurring either between register operands or between memory operands various static, dynamic or hybrid techniques were introduced. Of these, in Figure 1 we summarize those dynamic techniques which are used in or have been proposed for superscalars to cope with data dependencies in straight line code. 'DWDÃGHSHQGHQFLHVÃLQÃVWUDLJWÃOLQHÃFRGH 'DWDÃGHSHQGHQFLHVÃEHWZHHQÃUHJLVWHUÃRSHUDQGV 'DWDÃGHSHQGHQFLHVÃEHWZHHQÃPHPRU\ÃRSHUDQGV 5$: :$5 :$: 5$: :$5 :$: ,QVWUXFWLRQÃVKHOYLQJ 5HVXOWÃE\SDVVLQJ 5HJLVWHUÃUHQDPLQJ /RDGÃE\SDVVLQJ 0HPRU\ÃUHQDPLQJ 2XWRIRUGHUÃ ,QRUGHUÃH[HFXWLRQÃRIÃVWRUHV VSHFXODWLYHÃ H[HFXWLRQ 2XWRIRUGHUÃVSHFXODWLYHÃH[HFXWLRQ RIÃORDGV RIÃVWRUHV '\QDPLFDOO\ VSHFXODWHG ORDGV RSHUDQG )3LQVWUXFWLRQV 6WRUHÃIRUZDUGLQJ 6SHFXODWLYH VWRUHÃIRUZDUGLQJ /RDGÃYDOXH 9DOXHÃSUHGLFWLRQ SUHGLFWLRQ /RDGÃYDOXH 9DOXHÃUHXVH UHXVH 6WDQGDUGÃWHFKQLTXHVÃ (PHUJLQJÃÃWHFKQLTXHV 5HVHDUFKÃWRSLFV Figure 1 : Major dynamic techniques to cope with data dependencies in straight line code (assuming a load/store architecture) Instruction shelving (also known as indirect issue or dynamic instruction scheduling) removes the issue bottleneck which arises in the direct issue mode, when executable instructions are directly issued to the execution units. In this case, all kinds of data- and control dependencies as well as busy execution units block instruction issue. The main idea of shelving is to decouple instruction issue and dependency checks. In this scheme instructions are first issued into shelves, basically without any dependency checks. Shelved 2 The Design Space of Register Renaming Techniques in Superscalar Processors instructions are subsequently checked for dependencies and executable instructions are forwarded to free execution units in a separate processing step, called dispatching, see box “The principle of instruction shelving”. Here we note that the above distinction between the terms instruction issue and instruction dispatch is not common in the literature, and both terms are used in either interpretation. Next we will focus on techniques used to cope with data dependencies occurring between register operands. Register renaming (or renaming for short) is a widely used technique to remove false data dependencies (WAR and WAW dependencies) between register operands of subsequent instructions in straight-line code. In the following sections of our paper we will give a detailed review of various techniques used to implement register renaming. Assuming that the processor removes false data dependencies between register operands by renaming and encounters control dependencies by speculative branch processing, only RAW dependencies (and resource constraints) can restrict instructions from being executed in parallel. In this case the processor executes instructions held in the shelves, in other wording in the dispatch window, according to the dataflow principle. This means that assuming renaming and speculative branch processing, basically producer- consumer relations give rise to serialization constraints and constitute thereby the hard limit of the parallelization, called the dataflow limit. As a consequence, much research work has already been done and is currently being carried out by focusing on different aspects of the problem. Below we review two techniques which are already used in a number of superscalars to attack the dataflow limit and two further techniques, which are still in the research stage. Result bypassing is now a standard technique which is employed to lessen the impediments of RAW dependencies between instructions. Result bypassing means forwarding the generated results from the outputs of the execution units immediately to their inputs, omitting a register write and a register read operation in the producer-consumer chain. Three-operand floating point instructions eliminate RAW dependencies arising when the dot product (A*B+C) is to be calculated in floating point code by using traditional two- operand instructions. Introduced mainly in the first halve of the 1990s in the POWER13, PowerPC14, PA-RISC 2.015 and MIPS IV16 instruction set architectures, three operand floating point multiply-add instructions (called also multiply-and accumulate or fused 3 The Design Space of Register Renaming Techniques in Superscalar Processors multiply-add instructions) allow to execute the dot product immediately, without causing any RAW dependency. In order to exceed the dataflow limit related to register operands, two techniques have been the focus of recent investigations: (i) value prediction and (ii) value reuse. With value prediction17-19 (called also as data value speculation) a guess is first made as to whether the result of an operation is predictable based on the execution history. If it is considered to be predictable, the processor may execute the dependent instruction using the predicted source operand value in parallel with the instruction which generates the required result. Once the producer instruction has been executed, the processor checks whether the speculative result is actually correct. If it is, the processor validates the computation. If not, a recovery mechanism is activated to undo the faulty execution and resume the execution using the correct result. With value reuse20-24 (memoing, dynamic instruction reuse) the processor lessens the aftermath of RAW dependencies by removing repeated complex computations (first of all divide and square root calculations). The basic idea is as follows. The processor stores relevant components of complex computations (the source operands, the operation and the result) carried out with register operands in an appropriate operation cache (dynamic look up table). Thus a number of previous computations is held in a gliding operation window. If a new complex computation is to be performed with register operands, the processor checks whether it has already executed the same computation before. If it has, the processor reuses the result of the previous computation held in the operation cache instead of calculating the result anew. Now we turn to memory operands. Concerning memory operands we restrict our discussion to load/store architectures. This is justified by the fact that even superscalar processors with an underlying memory architecture (CISC architecture) usually employ an internal CISC/RISC conversion25-31. Assuming a load/store architecture, only load and store instructions access the memory, so memory data dependencies are restricted to dependencies between load and store instructions. Dependencies arise if they address the same memory location. Beyond shelving, the following techniques are used or suggested to cope with memory data dependencies. WAR and WAW memory dependencies are usually encountered by executing stores in-order. In this case the results conveyed by the store instructions are written into the memory (into the cache) in program order, thus neither WAW nor WAR dependencies can 4 The Design Space of Register Renaming Techniques in Superscalar Processors arise. On the other hand, there are three main approaches to address RAW dependencies between load and store instructions: (i) load bypassing, (ii) out-of-order execution of loads (this actually includes a number of techniques), and (iii) avoiding memory RAW dependencies
Recommended publications
  • Review Memory Disambiguation Review Explicit Register Renaming
    5HYLHZ5HRUGHU%XIIHU 52% &6 *UDGXDWH&RPSXWHU$UFKLWHFWXUH 8VHRIUHRUGHUEXIIHU /HFWXUH ² ,QRUGHULVVXH2XWRIRUGHUH[HFXWLRQ,QRUGHUFRPPLW ² +ROGVUHVXOWVXQWLOWKH\FDQEHFRPPLWWHGLQRUGHU ,QVWUXFWLRQ/HYHO3DUDOOHOLVP ª 6HUYHVDVVRXUFHRIYDOXHVXQWLOLQVWUXFWLRQVFRPPLWWHG ² 3URYLGHVVXSSRUWIRUSUHFLVHH[FHSWLRQV6SHFXODWLRQVLPSO\WKURZRXW *HWWLQJWKH&3, LQVWUXFWLRQVODWHUWKDQH[FHSWHGLQVWUXFWLRQ ² &RPPLWVXVHUYLVLEOHVWDWHLQLQVWUXFWLRQRUGHU ² 6WRUHVVHQWWRPHPRU\V\VWHPRQO\ZKHQWKH\UHDFKKHDGRIEXIIHU 6HSWHPEHU ,Q2UGHU&RPPLW LVLPSRUWDQWEHFDXVH 3URI-RKQ.XELDWRZLF] ² $OORZVWKHJHQHUDWLRQRISUHFLVHH[FHSWLRQV ² $OORZVVSHFXODWLRQDFURVVEUDQFKHV &6.XELDWRZLF] &6.XELDWRZLF] /HF /HF 5HYLHZ0HPRU\'LVDPELJXDWLRQ 5HYLHZ([SOLFLW5HJLVWHU5HQDPLQJ 4XHVWLRQ*LYHQDORDGWKDWIROORZVDVWRUHLQSURJUDP 0DNHXVHRIDSK\VLFDO UHJLVWHUILOHWKDWLVODUJHUWKDQ RUGHUDUHWKHWZRUHODWHG" QXPEHURIUHJLVWHUVVSHFLILHGE\,6$ ² 7U\LQJWRGHWHFW5$:KD]DUGVWKURXJKPHPRU\ .H\LQVLJKW$OORFDWHDQHZSK\VLFDOGHVWLQDWLRQUHJLVWHU ² 6WRUHVFRPPLWLQRUGHU 52% VRQR:$5:$:PHPRU\KD]DUGV IRUHYHU\LQVWUXFWLRQWKDWZULWHV ,PSOHPHQWDWLRQ ² 5HPRYHVDOOFKDQFHRI:$5RU:$:KD]DUGV ² .HHSTXHXHRIVWRUHVLQSURJRUGHU ² 6LPLODUWRFRPSLOHUWUDQVIRUPDWLRQFDOOHG6WDWLF6LQJOH$VVLJQPHQW ² :DWFKIRUSRVLWLRQRIQHZORDGVUHODWLYHWRH[LVWLQJVWRUHV ª /LNHKDUGZDUHEDVHGG\QDPLFFRPSLODWLRQ" :KHQKDYHDGGUHVVIRUORDGFKHFNVWRUHTXHXH 0HFKDQLVP".HHSDWUDQVODWLRQWDEOH ² ,IDQ\ VWRUHSULRUWRORDGLVZDLWLQJIRULWVDGGUHVVVWDOOORDG ² ,6$UHJLVWHU⇒ SK\VLFDOUHJLVWHUPDSSLQJ ² ,IORDGDGGUHVVPDWFKHVHDUOLHUVWRUHDGGUHVV DVVRFLDWLYHORRNXS ² :KHQUHJLVWHUZULWWHQUHSODFHHQWU\ZLWKQHZUHJLVWHUIURPIUHHOLVW WKHQZHKDYHDPHPRU\LQGXFHG
    [Show full text]
  • The Central Processing Unit(CPU). the Brain of Any Computer System Is the CPU
    Computer Fundamentals 1'stage Lec. (8 ) College of Computer Technology Dept.Information Networks The central processing unit(CPU). The brain of any computer system is the CPU. It controls the functioning of the other units and process the data. The CPU is sometimes called the processor, or in the personal computer field called “microprocessor”. It is a single integrated circuit that contains all the electronics needed to execute a program. The processor calculates (add, multiplies and so on), performs logical operations (compares numbers and make decisions), and controls the transfer of data among devices. The processor acts as the controller of all actions or services provided by the system. Processor actions are synchronized to its clock input. A clock signal consists of clock cycles. The time to complete a clock cycle is called the clock period. Normally, we use the clock frequency, which is the inverse of the clock period, to specify the clock. The clock frequency is measured in Hertz, which represents one cycle/second. Hertz is abbreviated as Hz. Usually, we use mega Hertz (MHz) and giga Hertz (GHz) as in 1.8 GHz Pentium. The processor can be thought of as executing the following cycle forever: 1. Fetch an instruction from the memory, 2. Decode the instruction (i.e., determine the instruction type), 3. Execute the instruction (i.e., perform the action specified by the instruction). Execution of an instruction involves fetching any required operands, performing the specified operation, and writing the results back. This process is often referred to as the fetch- execute cycle, or simply the execution cycle.
    [Show full text]
  • The Microarchitecture of a Low Power Register File
    The Microarchitecture of a Low Power Register File Nam Sung Kim and Trevor Mudge Advanced Computer Architecture Lab The University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {kimns, tnm}@eecs.umich.edu ABSTRACT Alpha 21464, the 512-entry 16-read and 8-write (16-r/8-w) ports register file consumed more power and was larger than The access time, energy and area of the register file are often the 64 KB primary caches. To reduce the cycle time impact, it critical to overall performance in wide-issue microprocessors, was implemented as two 8-r/8-w split register files [9], see because these terms grow superlinearly with the number of read Figure 1. Figure 1-(a) shows the 16-r/8-w file implemented and write ports that are required to support wide-issue. This paper directly as a monolithic structure. Figure 1-(b) shows it presents two techniques to reduce the number of ports of a register implemented as the two 8-r/8-w register files. The monolithic file intended for a wide-issue microprocessor without hardly any register file design is slow because each memory cell in the impact on IPC. Our results show that it is possible to replace a register file has to drive a large number of bit-lines. In register file with 16 read and 8 write ports, intended for an eight- contrast, the split register file is fast, but duplicates the issue processor, with a register file with just 8 read and 8 write contents of the register file in two memory arrays, resulting in ports so that the impact on IPC is a few percent.
    [Show full text]
  • ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills
    ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills The ARM Cortex­A product line has changed significantly since the introduction of the Cortex­A8 in 2005. ARM’s next major leap came with the Cortex­A9 which was the first design to incorporate multiple cores. The next advance was the development of the big.LITTLE architecture, which incorporates both high performance (A15) and high efficiency(A7) cores. Most recently the A57 and A53 have added 64­bit support to the product line. The ARM Cortex series cores are all made up of the main processing unit, a L1 instruction cache, a L1 data cache, an advanced SIMD core and a floating point core. Each processor then has an additional L2 cache shared between all cores (if there are multiple), debug support and an interface bus for communicating with the rest of the system. Multi­core processors (such as the A53 and A57) also include additional hardware to facilitate coherency between cores. The ARM Cortex­A57 is a 64­bit processor that supports 1 to 4 cores. The instruction pipeline in each core supports fetching up to three instructions per cycle to send down the pipeline. The instruction pipeline is made up of a 12 stage in order pipeline and a collection of parallel pipelines that range in size from 3 to 15 stages as seen below. The ARM Cortex­A53 is similar to the A57, but is designed to be more power efficient at the cost of processing power. The A57 in order pipeline is made up of 5 stages of instruction fetch and 7 stages of instruction decode and register renaming.
    [Show full text]
  • 1.1.2. Register File
    國 立 交 通 大 學 資訊科學與工程研究所 碩 士 論 文 同步多執行緒架構中可彈性切割與可延展的暫存 器檔案設計之研究 Design of a Flexibly Splittable and Stretchable Register File for SMT Architectures 研 究 生:鐘立傑 指導教授:單智君 教授 中 華 民 國 九 十 六 年 八 月 I II III IV 同步多執行緒架構中可彈性切割與可延展的暫存 器檔案設計之研究 學生:鐘立傑 指導教授:單智君 博士 國立交通大學資訊科學與工程研究所 碩士班 摘 要 如何利用最少的硬體資源來支援同步多執行緒是一個很重要的研究議題,暫存 器檔案(Register file)在微處理器晶片面積中佔有顯著的比例。而且為了支援同步多 執行緒,每一個執行緒享有自己的一份暫存器檔案,這樣的設計會增加晶片的面積。 在本篇論文中,我們提出了一份可彈性切割與可延展的暫存器檔案設計,在這 個設計裡:1.我們可以在需要的時候彈性切割一份暫存器檔案給兩個執行緒來同時 使用,2.適當的延伸暫存器檔案的大小來增加兩個執行緒共用的機會。 藉由我們設計可以得到的益處有:1.增加硬體資源的使用率,2. 減少對於記憶 體的存取以及 3.提升系統的效能。此外我們設計概念可以任意的滿足不同的應用程 式的需求。 V Design of a Flexibly Splittable and Stretchable Register File for SMT Architectures Student:Li-Jie Jhing Advisor:Dr, Jean Jyh-Jiun Shann Institute of Computer Science and Engineering National Chiao-Tung University Abstract How to support simultaneous multithreading (SMT) with minimum resource hence becomes a critical research issue. The register file in a microprocessor typically occupies a significant portion of the chip area, and in order to support SMT, each thread will have a copy of register file. That will increase the area overhead. In this thesis, we propose a register file design techniques that can 1. Split a copy of physical register file flexibly into two independent register sets when required, simultaneously operable for two independent threads. 2. Stretch the size of the physical register file arbitrarily, to increase probability of sharing by two threads. Benefits of these designs are: 1. Increased hardware resource utilization. 2. Reduced memory
    [Show full text]
  • Out-Of-Order Execution & Register Renaming
    Asanovic/Devadas Spring 2002 6.823 Out-of-Order Execution & Register Renaming Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Asanovic/Devadas Spring 2002 6.823 Scoreboard for In-order Issue Busy[unit#] : a bit-vector to indicate unit’s availability. (unit = Int, Add, Mult, Div) These bits are hardwired to FU's. WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? not Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest] Asanovic/Devadas Spring 2002 Out-of-Order Dispatch 6.823 ALU Mem IF ID Issue WB Fadd Fmul • Issue stage buffer holds multiple instructions waiting to issue. • Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard. • Any instruction in buffer whose RAW hazards are satisfied can be dispatched (for now, at most one dispatch per cycle). On a write back (WB), new instructions may get enabled. Asanovic/Devadas Spring 2002 6.823 Out-of-Order Issue: an example latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 3 4 SUBD F8, F2, F2 1 5 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) . 2 3 4 4 3 5 . 5 6 6 Out-of-order: 1 (2,1) 4 4 . 2 3 . 3 5 .
    [Show full text]
  • Dynamic Register Renaming Through Virtual-Physical Registers
    Dynamic Register Renaming Through Virtual-Physical Registers † Teresa Monreal [email protected] Antonio González* [email protected] Mateo Valero* [email protected] †† José González [email protected] † Víctor Viñals [email protected] †Departamento de Informática e Ing. de Sistemas. Centro Politécnico Superior - Univ. de Zaragoza, Zaragoza, Spain. *Departament d’ Arquitectura de Computadors. Universitat Politècnica de Catalunya, Barcelona, Spain. ††Departamento de Ingeniería y Tecnología de Computadores. Universidad de Murcia, Murcia, Spain. Abstract Register file access time represents one of the critical delays of current microprocessors, and it is expected to become more critical as future processors increase the instruction window size and the issue width. This paper present a novel dynamic register renaming scheme that delays the allocation of physical registers until a late stage in the pipeline. We show that it can provide important savings in number of physical registers so it can significantly shorter the register file access time. Delaying the allocation of physical registers requires some artifact to keep track of dependences. This is achieved by introducing the concept of virtual-physical registers, which are tags that do not require any storage location. The proposed renaming scheme shortens the average number of cycles that each physical register is allocated, and allows for an early execution of instructions since they can obtain a physical register for its destination earlier than with the conventional scheme. Early execution is especially beneficial for branches and memory operations, since the former can be resolved earlier and the latter can prefetch their data in advance. 1. Introduction Dynamically-scheduled superscalar processors exploit instruction-level parallelism (ILP) by overlapping the execution of instructions in an instruction window.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • Hardware-Sensitive Database Operations - II
    Faculty of Computer Science Database and Software Engineering Group Hardware-Sensitive Database Operations - II Balasubramanian (Bala) Gurumurthy Advanced Topics in Databases, 2019/May/17 Otto-von-Guericke University of Magdeburg So Far... ● Hardware evolution and current challenges ● Hardware-oblivious vs. hardware-sensitive programming ● Pipelining in RISC computing ● Pipeline Hazards ○ Structural hazard ○ Data hazard ○ Control hazard ● Resolving hazards ○ Loop-Unrolling ○ Predication Bala Gurumurthy | Hardware-Sensitive Database Operations Part II We will see ● Vectorization ○ SIMD Execution ○ SIMD in DBMS Operation ● GPUs in DBMSs ○ Processing Model ○ Handling Synchronization Bala Gurumurthy | Hardware-Sensitive Database Operations Vectorization Leveraging Modern Processing Capabilities Hardware Parallelism One we know already: Pipelining ● Separate chip regions for individual tasks to execute independently ● Advantage: parallelism + sequential execution semantics ● We discussed problems of hazards ● VLSI tech. limits degree up to which pipelining is feasible [Kaeslin, 2008] Bala Gurumurthy | Hardware-Sensitive Database Operations Hardware Parallelism Chip area can be used for other types of parallelism: Computer systems typically use identical hardware circuits, but their function may be controlled by different instruction stream Si: Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture? Bala Gurumurthy | Hardware-Sensitive Database Operations Special instances Example of this architecture?
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Memory Hierarchy
    Carnegie Mellon Introduction to Cloud Computing Systems I 15‐319, spring 2010 4th Lecture, Jan 26th Majd F. Sakr 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon Lecture Motivation Overview of a Cloud architecture Systems review . ISA . Memory hierarchy . OS . Networks Next time . Impact on cloud performance 15-319 Introduction to Cloud Computing 2 Spring 2010 © Carnegie Mellon Intel Open Cirrus Cloud ‐ Pittsburgh 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon Blade Performance Consider bandwidth and latency between these layers Quad Core Processor Disk core core core core L1 L2 Quad Core Processor L3 core core core core L1 Memory L2 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon Cloud Performance Consider bandwidth and latency of all layers 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon How could these different bandwidths impact the Cloud’s performance? To understand this, we need to go through a summary of the machine’s components? 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon Single CPU Machine’s Component . ISA (Processor) . Main Memory . Disk . Operating System . Network 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon A Single‐CPU Computer Components Datapath Control Register PC File Main IR I/O Bridge Memory Control ALU Unit CPU I/O Bus Disk Network 15-319 Introduction to Cloud Computing Spring 2010 © Carnegie Mellon The Von Neuman machine ‐ Completed 1952 Main memory storing programs and data
    [Show full text]