The Design Space of Register Renaming Techniques in Superscalar Processors

The Design Space of Register Renaming Techniques in Superscalar Processors Dezsõ Sima Kandó Polytechnic, Institute of Informatics, Budapest Abstract In this paper we first survey techniques which are used or proposed to cope with data dependencies. Then we focus on register renaming, a technique that has appeared in virtually all recent superscalars to boost performance. To elucidate this complex issue we next give an overview of the rename process and identify the design space of feasible techniques used to rename registers in superscalars. We discuss the main dimensions of the design space and in each dimension we indicate possible design choices. Finally, we identify and discuss basic alternatives and possible implementation schemes of register renaming. Keywords: Register Renaming, Register Rename Techniques, False Data Dependency Elimination, Microarchitecture of Superscalar Processors, Design Space. Introduction Instruction level parallel execution (ILP) is constrained by both control dependencies and by data dependencies. Control dependencies, caused by conditional control transfer instructions, call for a serialization of instructions at conditional forks and joins along the control flow graph of the program. On the other hand, data dependencies arising between the operands of subsequent instructions, enforce serialization of the affected instructions1-3. The standard technique to cope with control dependencies is to use speculative branch processing. Pioneered as early as in the 1960s4,5, speculative branch processing was first studied and employed around 19806-10. It finally came into widespread use in advanced pipelined microprocessors in the early 1990’s. Superscalars have inherited this technique and have given rise to a considerable sophistication of the speculation technique used3,11,12. The Design Space of Register Renaming Techniques in Superscalar Processors Data dependencies may appear either between operands of subsequent instructions in a straight line code segment or between operands of instructions belonging to subsequent loop iterations. In straight line code read after write (RAW), write after read (WAR) or write after write (WAW) dependencies may be encountered (see box “Types of data dependencies”). To tackle data dependencies occurring either between register operands or between memory operands various static, dynamic or hybrid techniques were introduced. Of these, in Figure 1 we summarize those dynamic techniques which are used in or have been proposed for superscalars to cope with data dependencies in straight line code. 'DWDÃGHSHQGHQFLHVÃLQÃVWUDLJWÃOLQHÃFRGH 'DWDÃGHSHQGHQFLHVÃEHWZHHQÃUHJLVWHUÃRSHUDQGV 'DWDÃGHSHQGHQFLHVÃEHWZHHQÃPHPRU\ÃRSHUDQGV 5$: :$5 :$: 5$: :$5 :$: ,QVWUXFWLRQÃVKHOYLQJ 5HVXOWÃE\SDVVLQJ 5HJLVWHUÃUHQDPLQJ /RDGÃE\SDVVLQJ 0HPRU\ÃUHQDPLQJ 2XWRIRUGHUÃ ,QRUGHUÃH[HFXWLRQÃRIÃVWRUHV VSHFXODWLYHÃ H[HFXWLRQ 2XWRIRUGHUÃVSHFXODWLYHÃH[HFXWLRQ RIÃORDGV RIÃVWRUHV '\QDPLFDOO\ VSHFXODWHG ORDGV RSHUDQG )3LQVWUXFWLRQV 6WRUHÃIRUZDUGLQJ 6SHFXODWLYH VWRUHÃIRUZDUGLQJ /RDGÃYDOXH 9DOXHÃSUHGLFWLRQ SUHGLFWLRQ /RDGÃYDOXH 9DOXHÃUHXVH UHXVH 6WDQGDUGÃWHFKQLTXHVÃ (PHUJLQJÃÃWHFKQLTXHV 5HVHDUFKÃWRSLFV Figure 1 : Major dynamic techniques to cope with data dependencies in straight line code (assuming a load/store architecture) Instruction shelving (also known as indirect issue or dynamic instruction scheduling) removes the issue bottleneck which arises in the direct issue mode, when executable instructions are directly issued to the execution units. In this case, all kinds of data- and control dependencies as well as busy execution units block instruction issue. The main idea of shelving is to decouple instruction issue and dependency checks. In this scheme instructions are first issued into shelves, basically without any dependency checks. Shelved 2 The Design Space of Register Renaming Techniques in Superscalar Processors instructions are subsequently checked for dependencies and executable instructions are forwarded to free execution units in a separate processing step, called dispatching, see box “The principle of instruction shelving”. Here we note that the above distinction between the terms instruction issue and instruction dispatch is not common in the literature, and both terms are used in either interpretation. Next we will focus on techniques used to cope with data dependencies occurring between register operands. Register renaming (or renaming for short) is a widely used technique to remove false data dependencies (WAR and WAW dependencies) between register operands of subsequent instructions in straight-line code. In the following sections of our paper we will give a detailed review of various techniques used to implement register renaming. Assuming that the processor removes false data dependencies between register operands by renaming and encounters control dependencies by speculative branch processing, only RAW dependencies (and resource constraints) can restrict instructions from being executed in parallel. In this case the processor executes instructions held in the shelves, in other wording in the dispatch window, according to the dataflow principle. This means that assuming renaming and speculative branch processing, basically producer- consumer relations give rise to serialization constraints and constitute thereby the hard limit of the parallelization, called the dataflow limit. As a consequence, much research work has already been done and is currently being carried out by focusing on different aspects of the problem. Below we review two techniques which are already used in a number of superscalars to attack the dataflow limit and two further techniques, which are still in the research stage. Result bypassing is now a standard technique which is employed to lessen the impediments of RAW dependencies between instructions. Result bypassing means forwarding the generated results from the outputs of the execution units immediately to their inputs, omitting a register write and a register read operation in the producer-consumer chain. Three-operand floating point instructions eliminate RAW dependencies arising when the dot product (A*B+C) is to be calculated in floating point code by using traditional two- operand instructions. Introduced mainly in the first halve of the 1990s in the POWER13, PowerPC14, PA-RISC 2.015 and MIPS IV16 instruction set architectures, three operand floating point multiply-add instructions (called also multiply-and accumulate or fused 3 The Design Space of Register Renaming Techniques in Superscalar Processors multiply-add instructions) allow to execute the dot product immediately, without causing any RAW dependency. In order to exceed the dataflow limit related to register operands, two techniques have been the focus of recent investigations: (i) value prediction and (ii) value reuse. With value prediction17-19 (called also as data value speculation) a guess is first made as to whether the result of an operation is predictable based on the execution history. If it is considered to be predictable, the processor may execute the dependent instruction using the predicted source operand value in parallel with the instruction which generates the required result. Once the producer instruction has been executed, the processor checks whether the speculative result is actually correct. If it is, the processor validates the computation. If not, a recovery mechanism is activated to undo the faulty execution and resume the execution using the correct result. With value reuse20-24 (memoing, dynamic instruction reuse) the processor lessens the aftermath of RAW dependencies by removing repeated complex computations (first of all divide and square root calculations). The basic idea is as follows. The processor stores relevant components of complex computations (the source operands, the operation and the result) carried out with register operands in an appropriate operation cache (dynamic look up table). Thus a number of previous computations is held in a gliding operation window. If a new complex computation is to be performed with register operands, the processor checks whether it has already executed the same computation before. If it has, the processor reuses the result of the previous computation held in the operation cache instead of calculating the result anew. Now we turn to memory operands. Concerning memory operands we restrict our discussion to load/store architectures. This is justified by the fact that even superscalar processors with an underlying memory architecture (CISC architecture) usually employ an internal CISC/RISC conversion25-31. Assuming a load/store architecture, only load and store instructions access the memory, so memory data dependencies are restricted to dependencies between load and store instructions. Dependencies arise if they address the same memory location. Beyond shelving, the following techniques are used or suggested to cope with memory data dependencies. WAR and WAW memory dependencies are usually encountered by executing stores in-order. In this case the results conveyed by the store instructions are written into the memory (into the cache) in program order, thus neither WAW nor WAR dependencies can 4 The Design Space of Register Renaming Techniques in Superscalar Processors arise. On the other hand, there are three main approaches to address RAW dependencies between load and store instructions: (i) load bypassing, (ii) out-of-order execution of loads (this actually includes a number of techniques), and (iii) avoiding memory RAW dependencies

Load more