THE DESIGN SPACE OF REGISTER RENAMING TECHNIQUES

TO BOOST AND SYSTEM PERFORMANCE, VIRTUALLY ALL RECENT

SUPERSCALARS RENAME REGISTERS.

Register renaming is a technique to Register renaming remove false data dependencies—write after The principle of register renaming is read (WAR) and write after write (WAW)— straightforward. If the processor encounters that occur in straight line code between reg- an instruction that addresses a destination reg- ister operands of subsequent instructions.1-3 ister, it temporarily writes the instruction’s By eliminating related precedence require- result into a dynamically allocated rename ments in the execution sequence of the buffer rather than into the specified destina- instructions, renaming increases the average tion register. For instance, in the case of the number of instructions that are available for following WAR dependency: parallel execution per cycle. This results in increased IPC (number of instructions exe- i1: add …, r2, …; [… ← (r2) + (…)] cuted per cycle). i2: mul r2, …, …; [r2 ← (…) ∗ (…)] The identification and exploration of the design space of register-renaming lead to a com- the destination register of i2 (r2) is renamed, prehensive understanding of this intricate tech- say to r33. Then, instruction i2 becomes nique. As this article shows, the design space of register renaming is spanned by four main i2′ : mul r33, …, …; [r33 ← (…) * (…)] Dezsö Sima dimensions: the scope of register renaming, the layout of the rename buffers, the method of Its result is written into r33 instead of into r2. Budapest Polytechnic register mapping, and the rename rate. Rele- This resolves the previous WAR dependency vant aspects of the design space give rise to eight between i1 and i2. In subsequent instructions, basic alternatives for register-renaming. In addi- however, references to source registers must tion, the kind of operand fetch policy signifi- be redirected to the rename buffers allocated cantly affects how the processor carries out the to them as long as this renaming remains rename , which duplicates the eight valid.3 basic alternatives to 16 possible implementa- A precursor to register renaming was intro- tion schemes. The article indicates which basic duced for floating-point instructions in 1967 implementation scheme is used in relevant by Tomasulo in the IBM 360/91,4 a scalar superscalar processors. supercomputer of that time that pioneered As register renaming is usually implement- both pipelining and shelving (dynamic ed in conjunction with shelving, the under- instruction issue). The 360/91 renamed float- lying is assumed to employ ing-point registers to preserve the logical con- shelving. (See the “Instruction shelving prin- sistency of the program execution rather than ciple” box for a discussion of this technique.) to remove false data dependencies.

70 0272-1732/00/$10.00  2000 IEEE Tjaden and Flynn5 first suggested the use exception of Sun’s UltraSparc line. At present, of register renaming for removing false data register renaming is considered to be a stan- dependencies for a limited set of instructions dard feature of performance-oriented super- that corresponds more or less to the load scalar processors. instructions. However, they didn’t use the term “register renaming.” Keller6 introduced Design space of register-renaming this designation in 1975 and extended renam- techniques ing to cover all instructions including a desti- The main dimensions of register renaming nation register. He also described how to are as follows: implement register renaming in processors. Even so, due to the complexity of this tech- • scope of register renaming, nique almost two decades passed after its con- • layout of the renamed registers, ception before register renaming came into widespread use in superscalars at the begin- ning of the 1990s. Instruction shelving principle Early superscalars such as the HP PA 7100, In early superscalars, decoded and executable instructions are issued immediately to the Sun SuperSparc, DEC Alpha 21064, MIPS execution units. However, using this scheme control and data dependencies, and busy exe- R8000, and Intel Pentium typically didn’t use cution units, cause issue bottlenecks. The basic technique used to remove an issue bottle- renaming. Renaming appeared gradually— neck is instruction shelving, also known as dynamic instruction issue.3,35,45 first in a restricted form called partial renam- Shelving presumes the availability of dedicated buffers, called shelving buffers, in front of ing (to be discussed in the next section) —in the execution units. The processor first issues instructions into available shelving buffers with- the early 1990s in the IBM RS/6000 out checking for data or control dependencies, or for busy execution units. As data depen- (Power1), Power2, PowerPC 601, and the dencies or busy execution units no longer restrict instruction issue, the issue bottleneck NextGen Nx586 processors. See Figure 1. Full problem occurring in early superscalars is removed. In a second step, instructions held in the renaming emerged later, beginning in 1992, shelving buffers are dispatched for execution. During dispatching, instructions are checked for first in the high-end models of the IBM main- dependencies, and not-dependent instructions are forwarded to free execution units. frame ES/9000, then in the PowerPC 603. At the time being, there’s no consensus on the use of terms instruction issue and instruc- Subsequently, renaming spread into virtually tion dispatch. Both terms are used in both possible interpretations. all superscalar processors with the notable

Compaq Alpha Alpha 21064 (2) Alpha 21164 (4) (4)7

Motorola MC88000 MC88110 (2)

HP PA PA7100 (2) PA7200 (2)8 PA8000 (4)9 PA82000 (4)10 PA 8500 (4)11 Power1 (4)12 Power2 (6/4)13*** IBM Power P2SC (6/4)16*** (RS/6000) 14* PPC 604 (4)16 * PowerPC PPC 601 (3) PowerPC 17 PPC 620 (4)19* Power3 (4)20 Alliance PPC 603 (3)15* PPC 602 (2) * RISC processors MIPS R R8000 (4) UltraSparc (4) (4)21 R12000 (4)22 PMI (4)23 Sun/Hal Sparc SuperSparc (3) UltraSparc-2 (4) UltraSparc-3 (4) (Sparc64)

Pentium/MMX (2)

25 Intel 80x86 Pentium (2) (3)24 Pentium II (3) Pentium III (3)26

IBM ES ES/9000 (2)

TRON Gmicro Gmicro/500 (2)

Cyrix M M1 (2)28 MII (2)29

CISC processors Motorola MC68000 MC68060 (3)

AMD Nx/K Nx586 (1/3)30** K5 (4)31 K6 (3)32 K7 (3)33

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

∗PPC designates PowerPC. Partial renaming ∗∗ Full renaming The Nx586 has scalar issue for CISC instructions but a 3-way superscalar core for converted RISC instructions. ∗∗∗The issue rate of the Power2 and P2SC is 6 along the sequential path while only 4 immediately after a branch.

Figure 1. Chronology of register renaming in commercial superscalar processors. The introduction date indicates the first year of volume production. Following the model designation is the issue rate of the processors (in parentheses). Note that for the issue rate of CISC processors, one instruction is considered to be the equivalent of 1.3 to 1.9 RISC instructions.34

SEPTEMBER–OCTOBER 2000 71 RENAMING REGISTERS

Renaming is restricted Renaming comprises fixed-point instructions. to particular all eligible instruction types instruction types Full renaming covers all instructions includ- The indicated superscalar ing a destination register. As Figure 1 demon- Examples are a few line examples begin with the strates, virtually all recent superscalar early superscalar processors: PowerPC 603 (1993) processors employ full renaming. Notewor- PA 7200 (1995) thy exceptions are the Sun UltraSparc line and Power11 (RS/6000, 1990) Pentium Pro (1995) Alpha processors preceding the Alpha 21264. Power22 (1993) R10000 (1996) PowerPC3 601 (1993), and K5 (1995), and Nx5864 (1994). MII (1997). Rename buffer layout Rename buffers establish the actual frame- Most notable exceptions are Alpha processors work for renaming. There are three essential preceding the Alpha 21264 design aspects in their layout: and Sun UltraSparc line • type of rename buffers, (a) (b) • number of rename buffers, and Trend • number of read and write ports.

1 Renames only FP loads. 2 Extends renaming to all FP instructions. 3 Renames only the link and count register. Rename buffer types 4 Since the Nx586 has an integer core, it renames only FX instructions. The choice of which type of rename buffers to use in a processor has far-reaching impact Figure 2. Register renaming scope: partial (a) and full (b). on the implementation of the rename process. Given its importance, designers must consid- er the various design options. To simplify this • method of register mapping and, presentation, I initially assume a common • rename rate architectural register file for all processed data types, and then extend the discussion to the as indicated in subsequent sections. Due to split-register scenario that is commonly volume restrictions we ignore additional employed. dimensions that are related to the renaming As Figure 3 illustrates, there are four fun- process such as recovery from a misprediction. damentally different ways to implement rename buffers. The range of choices includes Scope of register renaming a) using a merged architectural and rename To indicate how extensively the processor , b) employing a stand-alone makes use of renaming, I distinguish between rename register file, c) keeping renamed val- partial and full renaming. Partial renaming is ues either in the reorder buffer (ROB), or d) restricted to one or only a few instruction types, in the shelving buffers. for instance, only to floating-point instructions. In the first approach, rename buffers are Early processors typically employed this incom- implemented along with the architectural reg- plete form of renaming; the Power1 (RS/6000), isters in the same physical register file called Power2, PowerPC 601, and the Nx586 are the merged architectural and rename register examples, as shown in Figure 2. Of these, the file or the merged register file for short. Here, Power1 (RS/6000) renames only floating-point both architectural and rename registers are loads. As the Power1 has only a single floating- dynamically allocated to particular registers point unit, it executes floating-point instruc- of the same physical file. tions in sequence. Thus there’s no need for Each physical register of the merged archi- renaming floating-point register instructions. tectural and rename register file is at any time The Power2 processor introduces multiple in one of four possible states.27 These states floating-point units and for this reason it reflect the actual use of a physical register as extends renaming to all floating-point instruc- follows: tions. The Power PC 601 renames only the link and count register. The Nx586, which has an • uncommitted (available) state, integer core, obviously restricted renaming to • used as an architectural register (archi-

72 IEEE MICRO Merged Holding renamed architectural and Stand-alone Holding renamed values in the rename register file rename register file values in the ROB shelving buffers

Method of operand fetching

Merged Method of rename and Rename Architectural Architectural Shelving Architectural updating the ROB architectural register file register file register file buffers register file program file

Power1 (1990) PowerPC 603 (1993) Am290000 superscalar (1995) Power2 (1993) PowerPC 604 (1995) K5 (1995) ES/9000 (1992) PowerPC 620 (1996) M1 (1995) Nx586 (1994) Power3 (1998) K6 (1997) PMI (Sparc64, 1995) PA 8000 (1996) Pentium Pro (1995) R10000 (1996) PA 8200 (1997) Pentium II (1997) R12000 (1999) PA 8500 (1999) Pentium III (1999) Alpha 21264 (1998)

Figure 3. Generic types of rename buffers. Shaded boxes indicate the rename buffers.

tectural register state), Initialized Entry is allocated (remaining registers) to an issued instruction • used as a rename buffer— but this register doesn’t yet contain the result of RB, the associated instruction Available not valid (rename buffer, not-valid state), and Instruction Architectural register • used as a rename buffer— is canceled Instruction is finished is reclaimed this register already con- tains the result of the associated instruction (rename buffer, valid AR RB, state). valid

During instruction pro- Initialized Instruction cessing, the states of the phys- (first n registers) is completed ical registers are changed as described in the following and Figure 4. State transition diagram of a particular register of the merged architectural and indicated in the state transi- rename register file.27 AR: architectural register, RB: rename buffer. tion diagram in Figure 4. As part of the initialization the first n physical registers are assigned to the and allocated to the concerned destination reg- architectural registers, where n is the number ister. Accordingly, its state is set to the rename of the registers declared by the instruction set buffer, not-valid state, and its valid bit is reset. architecture (ISA). These registers are set to be After the associated instruction finishes execu- in the architectural register (AR) state; the tion, the produced result is written into the allo- remaining physical registers take on the avail- cated rename buffer. Its valid bit is set, and its able state. When an issued instruction includes state changes to rename buffer, valid. Later, a destination register, a new rename buffer is when the associated instruction completes, the needed. For this reason, one physical register is allocated rename buffer will be declared to be selected from the pool of the available registers the architectural register that implements the

SEPTEMBER–OCTOBER 2000 73 RENAMING REGISTERS

Alternatively, renaming can also be based ROB principle on the reorder buffer (ROB); see “ROB prin- The reorder buffer is implemented basically as a circular buffer whose entries are allocated ciple” box. The ROB has recently been wide- and deallocated by means of two revolving pointers.3,37 It operates as follows. ly used to preserve the sequential consistency When instructions are issued, a ROB entry is allocated to each instruction, strictly in pro- of instruction execution. When using a ROB, gram order. Each ROB entry keeps track of the execution status of the associated instruc- an entry is assigned to each issued instruction tion. The ROB allows instructions to complete (commit, retire) only in program order by for the duration of its execution. It’s quite nat- permitting an instruction to complete only if it has finished its execution and all preceding ural to use this entry for renaming as well— instructions are already completed. In this way, instructions update the program state in basically by extending it with a new field that exactly the same way as a sequential processor would have done. After an instruction has will hold the result of that instruction. Exam- completed, the associated ROB entry is deallocated and becomes eligible for reuse. ples of processors using the ROB for renam- ing are the Am29000 superscalar, K5, K6, Pentium Pro, Pentium II, and Pentium III. destination register specified in the just- The ROB can even be extended further to completed instruction. Its state then changes serve as a central shelving buffer as well. In to the architectural register state to reflect this. this case, the ROB is also occasionally desig- Note that in contrast to other rename buffer nated as the DRIS (deferred scheduling reg- types, no data transfer is required for updat- ister renaming instruction shelf). The ing the architectural registers in merged archi- Lightning processor proposal36 and the K6 tectural and rename register files. Instead, only made use of this solution. Because the Light- the status of the related registers needs to be ning proposal—which dates back to the changed. beginning of the 1990s—was too ambitious Finally, when an old architectural register in the light of the technology available at that is reclaimed, it is set to the available state. time, it couldn’t be economically implement- Assuming dispatch-bound operand fetching, ed and never reached the market. a possibility for reclaiming such old architec- The last conceivable implementation alter- tural registers is to keep track of the physical native of rename buffers is to use the shelving registers that have been the previous instances buffers for renaming (see the “Instruction of the same architectural register, and reclaim shelving principle” box again). In this case, the previous instance when the instruction each shelving buffer must be extended func- incorporating the new instance completes. tionally to also perform the task of a rename Also, not-yet-completed instructions must buffer. But this alternative has a drawback be canceled for exceptions or faulty executed resulting from the different deallocation mech- speculative instructions. Then allocated anisms of the shelving and rename buffers. rename buffers in a) the rename buffer, not- While shelving buffers can be reclaimed as valid and b) the rename buffer, valid states are soon as the instruction has been dispatched, deallocated and changed to available states. In rename buffers can be deallocated only at a addition, the corresponding mappings—kept later time, not earlier than the instruction has in either the mapping table or the rename been completed. Thus, a deeper analysis is buffer (as discussed later)—must be canceled. needed to reveal the appropriateness of using Merged architectural and rename register shelving buffers for renaming. To date, no files are employed, for instance, in the high- processor has chosen this alternative. end models (520-based models) of the IBM ES/9000 mainframes, Power and R1x000 Split rename register files processors, and Alpha 21264s. For simplicity’s sake, I’ve so far assumed that All other alternatives separate rename all data types are stored in a common archi- buffers from architectural registers. tectural register file. But usually, processors In the first “separated” variant, a stand- provide distinct architectural register files for alone rename register file (or rename register fixed-point and floating-point data. Conse- file for short) is used exclusively to implement quently, they typically employ distinct rename rename buffers. The PowerPC 603/620 and register files, as shown in Figure 5. PA8x00 processors are examples of using As depicted in this figure, when the proces- rename register files. sor employs the split-register principle, dis-

74 IEEE MICRO Merged FX Merged FP FX FP FX rename FP rename architectural architectural architectural architectural register file register file register file register file register file register file

(a) (b) Figure 5. Using split registers in the case of (a) merged register files, and (b) a stand-alone rename register file. FX indicates fixed point; FP indicates floating point. tinct fixed-point and floating-point register Thus, the maximal number of instructions files are needed for merged files and stand- that may have been issued but not yet com- alone rename register files. In this case separate pleted in the processor (npmax) is data paths are also needed to access the fixed- point and the floating-point registers. npmax = wdw + nEU + nLq + nSq (1) Recent processors typically incorporate split rename registers. When renaming takes where place within the ROB, usually a single mech- anism is maintained for the preservation of wdw is the width of the dispatch window the sequential consistency of instruction exe- (total number of shelving buffers), cution. Then all renamed instructions are nEU is the number of the execution units kept in the same ROB queue despite using that may operate in parallel, split architectural register files for fixed-point nLq is the number of the entries in the load and floating-point data. In this case, clearly, queue, and each ROB entry is expected to be long nSq is the number of the entries in the store enough to hold either fixed-point or floating- queue. point data. Assuming a worst-case design approach, Number of rename buffers from this formula we can determine that the Rename buffers keep register results tem- total number of rename buffers required porarily until instructions complete. By tak- (nrmax) is ing into account that not every instruction produces a register result, we can state that in nrmax = wdw + nEU + nLq, (2) a processor up to as many rename buffers are needed as the maximum number of instruc- since instructions held in the store queue don’t tions that are in execution; that is, issued but require rename buffers. not yet completed. Issued but not yet com- Furthermore, if the processor includes a pleted instructions are either ROB, based on Equation 1 we can say that the total number of ROB entries required • held in shelving buffers waiting for exe- (nROBmax) is cution (if shelving is employed), • just been processed in execution units, nROBmax = npwax. (3) • in the load queue waiting for access (if there is a load queue), or Nevertheless, if the processor has fewer • in the store queue waiting for comple- rename buffers or fewer ROB entries than tion, then forwarded to the cache to exe- expected, according to the worst-case cute the required store operation (if there approach (as given by Equations 2 and 3) issue is a store queue). blockages can occur due to missing free

SEPTEMBER–OCTOBER 2000 75 RENAMING REGISTERS

Table 1. Type and available number of rename buffers in recent superscalars as well as four related parameters of the enlisted processors.

Processor type Type of No. of Width of dispatch Total no. Reorder (year of volume rename rename buffers Issue window of rename buffers width shipment) buffer FX FP rate (wdw)(nr)(nROB) RISC processors PowerPC 603 (1993) Ren. reg. file N/A 4 3 3 N/A 5 PowerPC 604 (1995) Ren. reg. file 12 8 4 12 20 16 PowerPC 620 (1996) Ren. reg. file 8 8 4 15 16 16 Power3 (1998) Ren. reg. file 16 24 4 20 (?) 40 32 R10000 (1996) Merged 32 32 4 48 64 32 R12000 (1998) Merged 32 32 4 48 64 48 Alpha 21264 (1998) Merged 48 41 4 35 89 80 PA 8000 (1986) Ren. reg. file 56 56 4 56 112 56 PM1 (1996) Merged 38 24 4 36 62 62 x86 (CISC) processors Pentium Pro (1995) In the ROB 40 32 201 40 401 Pentium II (1997) In the ROB 40 32 201 40 401 K5 (1995) In the ROB 16 42 111 (?) 16 161 K6 (1996) In the ROB 24 32 241 24 241 M3 (2000 expected) Merged 32 N/A 32 561 N/A 322

1 RISC operations 2 x86 instructions (on average, produce 1.3 to 1.9 RISC operations34) ? Questionable data N/A Not available

rename buffers or ROB entries. With a of rename buffers provided (nr), and the decreasing number of entries provided, we reorder width (nROB). expect a smooth, slight performance degra- As the data in Table 1 indicates, the designs dation. Hence, a stochastic design approach of most processors have taken into account is also feasible. There, the required number of the four interrelations. There are, however, entries is derived from the tolerated level of two obvious exceptions. First, the PowerPC performance degradation. 604 provides 20 rename buffers, more than Based on Equations 1 to 3, the following the processor’s reorder width of 16. In the sub- relations are typically valid concerning the sequent PowerPC 620, Intel decreased the width of the processor’s dispatch window number to 16. Second, the R10000 provides (wdw), the total number of the rename only 32 ROB entries. This number is far too buffers (nr), and the reorder width (nROB), low compared to the dispatch width (48) and which equals the total number of ROB to the number of available rename buffers entries available: (64). MIPS addressed this disproportion in the R12000 by increasing the reorder width wdw < nr ≤ nROB (4) of the processor to 48.

Table 1 summarizes the type and the num- Number of read and write ports ber of rename buffers provided in recent By taking into account current practice, in RISC and x86 superscalar processors. In addi- the following discussion I assume split regis- tion, Table 1 shows four key parameters of ter files. the enlisted processors: the issue rate, width Clearly, as many read ports are required in of the dispatch window (wdw), total number the rename buffers as there are data items that

76 IEEE MICRO the rename buffers may need to supply in any one cycle. Note that rename buffers supply Operand fetch policies required operands for the instructions to be If a processor uses the issue-bound fetch policy, it fetches referenced register operands executed and also forward the results of the during instruction issue—that is, while it forwards decoded instructions into the shelving completed instructions to the addressed archi- buffers.3,11 In contrast, the dispatch-bound fetch policy defers operand fetching until exe- tectural registers. cutable instructions are forwarded from the shelving buffers to the execution units. When a The number of operands that need to be processor fetches issue-bound operands, shelving buffers hold the source operand values. delivered in the same cycle depends first of all In contrast, in dispatch-bound operand fetching, shelving buffers have much shorter entries on whether the processor fetches operands dur- as they contain only the register identifiers. ing instruction issue or during instruction dis- patch. (See the “Operand fetch policies” box.) If operands are fetched issue bound, the 2 × 1 operands for the 2 load-store units). rename buffers need to supply the operands In addition, if rename buffers are imple- for all instructions that are issued into the mented separately from the architectural reg- shelving buffers in the same cycle. Thus, both isters, the rename buffers must forward in each the fixed-point and floating-point rename cycle as many result values to the architectur- buffers are expected to deliver in each cycle all al registers as the completion rate (retire rate) required operands for up to as many instruc- of the processor. Since recent processors usu- tions as the issue rate. This means that in ally complete up to four , recent four-way superscalar processors the this task typically increases the required num- fixed-point and the floating-point rename ber of read ports in the rename buffers by four. buffers typically need to supply 8 and 12 Too many read ports in a register file may operands respectively, assuming up to two unduly increase the physical size of the data fixed-point and three floating-point operands path and consequently the cycle time. To in each fixed-point and floating-point instruc- avoid this problem, a few high-performance tion, respectively. If, however, there are some processors (such as the Power2, Power3, and issue restrictions, the required number of read Alpha 21264) implement two copies of par- ports is decreased accordingly. ticular register files. The Power2 duplicates In contrast, if the processor uses the dis- the fixed-point architectural register file, the patch-bound fetch policy, the rename buffers Power3 doubles both the fixed-point rename should provide the operands for all instruc- and the architectural file, and the Alpha tions that are forwarded from the dispatch 21264 provides two copies of the fixed-point window (instruction window) for execution merged architectural and rename register file. in the same cycle. In this case, the fixed-point As a result, fewer read ports are needed in rename buffers need to supply the required each of the copies. For example, with two fixed-point operands for the integer and load- copies of the fixed-point merged register file, store units (including register operands for the the Power3 needs only 10 read ports in each specified address calculations and fixed-point file, instead of 16 read ports in a fixed-point data for the fixed-point store instructions). register file. A drawback of this approach is, The floating-point rename buffers need to however, that a scheme is also required to keep deliver operands for the floating-point units both copies coherent. (floating-point register data) and also for the Now let’s turn to the required number of load-store units (floating-point operands of write ports (input ports). Since in each cycle the floating-point store instructions). In the rename buffers need to accept all results pro- Power3, for instance, this implies the follow- duced by the execution units, these buffers ing read port requirements. The fixed-point must provide as many write ports as results the rename buffers need to have 12 read ports (up execution units may produce per cycle. The to 3 × 2 operands for the 3 integer-units as fixed-point rename buffers receive results from well as 2 × 2 address operands and 2 × 1 data the available integer-execution units and from operands for the 2 load-store units). On the the load-store units (fetched fixed-point data). other hand, the floating-point rename regis- In contrast, the floating-point rename buffers ters need to have 8 read ports (up to 2 × 3 hold the results of the floating-point execution operands for the 2 floating-point units and units and the load-store units (fetched float-

SEPTEMBER–OCTOBER 2000 77 RENAMING REGISTERS

Dest. Entry RB Entry register Latest Value valid index valid no. bit Value valid 0 0 Mapping Rename buffers table Associative lookup for r7 5 1 17 9 1 8 1 80 1 Lookup 6 0 10 1 7 0 7 1 for r7 7 1 12 11 1 9 1 - 0 8 1 14 12 1 7 1 70 1

12 12 (RB index = 12) (RB index = 12)

(a) (b) Figure 6. Methods for keeping track of the actual mapping of architectural registers to rename buffers: using a mapping table (a) and mapping within rename buffers (b).

ing-point data). Most results are single data become available in the last execution cycle. items requiring one write port. However, there Delaying the allocation of rename buffers to are a few exceptions. When execution units the instructions37 saves rename buffer space. generate two data items, they require two write Various schemes have been proposed for this, ports as well, similar to the PowerPC proces- such as virtual renaming37-40 and others.41 In sor load-store units. After execution of the fact, a virtual allocation scheme has already LOAD-WITH-UPDATE instruction, these been introduced into the Power3.37 units return both the fetched data value and Established mappings must be maintained the updated address value. until their invalidation. There are two possi- bilities for keeping track of the actual map- Register mapping methods ping of particular architectural registers to During renaming, the processor needs to allocated rename buffers. The processor can allocate rename buffers to the destination reg- use a mapping table for this or can track the isters of the instructions (or usually to every actual register mapping within the rename instruction to simplify logic). It also must buffers themselves. See Figure 6. keep track of the mappings actually used and A mapping table has as many entries as deallocate rename buffers no longer used. there are architectural registers provided by Accordingly, the related aspect of the design the instruction set architecture (usually 32). space has three components: Each entry holds a status bit (called the entry valid bit in Figure 6a), which indicates • allocation scheme of the rename buffers, whether the associated architectural register • method of keeping track of actual map- is renamed. Each valid entry supplies the pings, and index of the rename buffer, which is allocat- • deallocation scheme of rename buffers. ed to the architectural register belonging to that entry (called the RB index). For instance, As far as the allocation scheme of rename Figure 6a shows that the mapping table holds buffers is concerned, rename buffers are usu- a valid entry for architectural register r7, ally allocated to instructions during instruc- which contains the RB index of 12. This indi- tion issue. If rename buffers are assigned to cates that architectural register r7 is actually the instructions as early as during instruction renamed to rename buffer 12. issue, rename buffer space is wasted, since Each entry is set up during instruction rename buffers are not needed until the results issue, while new rename buffers are allocated

78 IEEE MICRO to the issued instructions. A valid mapping is al register r7 has two subsequent allocations. updated when the architectural register From these, entry 12 is the latest one as its lat- belonging to that entry is renamed again. It est bit has been set. Thus, in Figure 6b, renam- will be invalidated when the related mapping ing source register r7 would yield the RB is no longer needed and the allocated rename index of 12. This method of renaming source buffer is reclaimed. In this way, the mapping registers requires an associative lookup in all table continuously provides the latest alloca- entries searching for the latest allocation of tions. Source registers are renamed by access- the given source register. ing the mapping table with the register With issue-bound operand fetching, source numbers as indices and fetching the associat- registers are both renamed and accessed during ed rename buffer identifiers (RB indices), as the issue process. For this reason, in this case, Figure 6a shows. processors usually integrate renaming and Obviously, for split architectural register operand accessing, and therefore keep track of files, separate fixed-point and floating-point the register mapping within the rename mapping tables are needed. buffers. For dispatch-bound operand fetching, As discussed earlier, mapping tables should however, these tasks are separated. Source reg- provide one read port for each source operand isters are renamed during instruction issue, that may be fetched in any one cycle and one whereas the source operands are accessed while write port for each rename buffer that may be the processor dispatches the instructions to the allocated in any one cycle. execution units. Therefore, in this case, proces- The other fundamentally different alterna- sors typically use mapping tables. tive for keeping track of actual register map- If rename buffers are no longer needed, they pings relies on an associative mechanism (see should be reclaimed (deallocated). The Figure 6b). In this case no mapping table scheme of deallocation depends on key aspects exists, but each rename buffer holds the iden- of the overall renaming process. In particular, tifier of the associated architectural register they depend on the allocation scheme of the (usually the register number of the renamed rename buffers, the type of rename buffers destination register) and additional status bits. used, the method of keeping track of actual These entries are set up during instruction allocations, and even whether issue-bound or issue when a particular rename buffer is allo- dispatch-bound operands are fetched. How- cated to a specified destination register. As Fig- ever, lack of space restricts discussion of this ure 6b shows, in this case each rename buffer aspect of the design space in detail here. holds the following five pieces of information: Rename rate • a status bit, which indicates that this As its name suggests, the rename rate stands rename buffer is actually allocated (called for the maximum number of renames that a the entry valid bit in the figure), processor can perform in a cycle. Basically, the • the identifier of the associated architec- processor should rename all instructions tural register (Dest. Reg. No.), issued in the same cycle to avoid performance • a further status bit, called the latest bit, degradation. Thus, the rename rate should whose role will be explained later, equal the issue rate. This is easier said than • another status bit, called the value valid done, since it is not at all an easy task to imple- bit, which shows whether the actual value ment a high rename rate (four or higher). Two of the associated architectural register has reasons make it difficult. already been generated, and First, for higher rename rates the detection • the value itself, provided that the value and handling of interinstruction dependencies valid bit signifies an already produced during renaming becomes a more complex result. task. Second, higher rename rates require a larger number of read and write ports on reg- The latest bit marks the last allocation of a ister files and mapping tables. For instance, the given architectural register if it has more than four-way superscalar R10000 can issue any one valid allocation due to repeated renam- combination of four fixed-point and floating- ing. For instance, in our example architectur- point instructions. Accordingly, its fixed-point

SEPTEMBER–OCTOBER 2000 79 RENAMING REGISTERS

Basic alternatives of register remaining

Merged architectural Separate Renaming Renaming within and rename register file rename register files within the ROB the shelving buffers

Basic Using a Mapping Using a Mapping Using a Mapping Using a Mapping alternatives mapping table within the RBs mapping table within the RBs mapping table within the RBs mapping table within the RBs

Issue- Dispatch- Issue- Dispatch- Issue- Dispatch- Issue- Dispatch- Issue- Dispatch- Issue- Dispatch- Implementation bound bound bound bound bound bound bound bound bound bound bound bound schemes operand operand operand operand operand operand operand operand operand operand operand operand fetching fetching fetching fetching fetching fetching fetching fetching fetching fetching fetching fetching

Proposals Keller (1996)6 Smith-Pleszkun42 (1987) Johnson43 (1987) Sohi, Vajapeyem44 Processors Power1 (1990) PowerPC 603 (1993) PentiumPro (1995) (1987) ES/9000 (1992) PowerPC 604 (1995) Pentium II (1997) Power 2 (1993) PowerPC 620 (1996) Pentium III (1999) P2SC (1996) PA 8000 (1996) AM29000 (1995)* Nx586 (1994) PA 8200 (1997) K5 (1995) 45 R10000 (1996) Power3 (1998) Lighting (1991) R12000 (1999) PA 8500 (1999) K6* (1997) M3 (2000) PMI (1995) (Sparc64)

*The shelving buffers are also implemented in the ROB. The resulting unit is occasionally called the DRIS.

Figure 7. Basic implementation alternatives of register renaming. RB designates the rename buffer.

mapping table needs 12 read ports and 4 write renaming because recent processors typically ports, and its floating-point table requires 16 implement full renaming, and the rename rate read and 4 write ports. This many ports are because of its quantitative character. Two needed since fixed-point instructions can refer main design aspects remain: the layout of the up to 3, and floating-point instructions up to rename buffers and the method of register 4 source operands in this processor. mapping. Furthermore, the layout of the rename buffers itself covers three design Basic alternatives, possible implementation aspects: the type of rename buffers, their num- schemes ber, and the number of the read and write Theoretically in the design space of regis- ports. Of these only the type of the rename ter renaming, each possible combination of buffers is of qualitative character. It follows the available design choices yields one possi- that the design space of register renaming ble implementation alternative. However, includes only two relevant qualitative aspects: instead of considering all possible implemen- the type of the rename buffers and the method tation alternatives, it makes sense to focus only of register mapping. on those that differ in relevant qualitative The design choices available for these two aspects from each other—the basic alterna- relevant design aspects result in eight possible tives. Possible basic alternatives can be derived combinations, called the basic alternatives for from the design space in two steps: by identi- register renaming, as shown in Figure 7. In fying the relevant qualitative design aspects addition, this figure also takes into account involved and then by composing their possi- that the processor’s operand fetch policy— ble combinations. which is a design aspect of shelving—signifi- When selecting the relevant qualitative cantly affects how the renaming process is design aspects, we should recall the design carried out. This splits the eight basic renam- space of renaming mentioned earlier. First, we ing alternatives into 16 feasible implementa- ignore two main aspects: the scope of register tion schemes. Figure 7 also indicates which

80 IEEE MICRO implementation schemes are used in relevant 1, 1993, pp. 9-50. superscalar processors and some hints about 2. P.E. Smith and G.S. Sohi, “The Microarchi- their origins. tecture of Superscalar Processors,” Proc. As Figure 7 indicates, relevant superscalar IEEE, IEEE Press, Piscataway, N.J., Vol. 83, processors make use of only four of the eight No. 12, Dec. 1995, pp. 1609-1624. possible basic alternatives of renaming. More- 3. D. Sima, T. Fountain, and P. Kacsuk, over, most of the latest processors employ the Advanced Computer Architectures, Addison following basic alternatives of renaming when Wesley Longman, Harlow, England, 1997. fetching dispatch-bound operands: 4. R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM • use of merged architectural and rename J. Research and Development, Vol. 11, No. register files and mapping tables 1, 1967, pp. 25-33. (R10000, R12000, M3). 5. G.S. Tjaden and M.J. Flynn, “Detection and • use of separate rename register files and Parallel Execution of Independent Instruc- mapping registers within the rename reg- tions,” IEEE Trans. Computers, Vol. C-19, isters (PA8x00 line, Power3), and No. 10, 1970, pp. 889-895. • renaming within the ROB and using 6. R.M. Keller, “Look-Ahead Processors,” mapping tables (Pentium Pro, Pentium Computing Surveys, Vol. 7, No. 4, 1975, pp. II, Pentium III). 177-195. 7. D. Leibholz and R. Razdan, “The Alpha It’s also conceivable to use different basic 21264: A 500 MIPS Out-of-Order Execution alternatives for renaming fixed-point and float- ,” Proc. Compcon, IEEE ing-point instructions, as is done in the K7. Computer Society Press, Los Alamitos, This processor uses the ROB for renaming Calif., 1997, pp. 28-36. fixed-point instructions and a merged archi- 8. G. Kurpanek et al., “PA-7200: A PA-RISC tectural and rename register file for renaming Processor with Integrated High Performance floating-point ones. However, as AMD didn’t MP Interface,” Proc. Compcon, IEEE CS disclose the method of register mapping, this Press, 1994, pp. 375-382. processor isn’t included in Figure 7. 9. D. Hunt, “Advanced Performance Features As this figure shows, the latest processors of the 64-Bit PA-8000,” Proc. Compcon, fetch predominantly dispatch-bound IEEE CS Press, 1995, pp. 123-128. operands due to the comparative advantage 10. A.P. Scott et al., “Four-Way Superscalar PA- of this fetch policy.40 The move away from the RISC Processors,” Hewlett-Packard J., Aug. issue-bound operands to the dispatch-bound 1997, pp. 1-9. fetch policy is manifested in AMD’s subse- 11. G. Lesartre and D. Hunt, PA-8500: The Con- quent K5 and K6, and by the fact that the tinuing Evolution of the PA-8000 Family, PowerPC 620-based Power3 has also made Hewlett-Packard Co., Palo Alto, Calif., 1998, this transition. pp. 1-11. 12. G.F. Grohoski, “Machine Organization of the egister renaming and shelving mark the IBM RISC System/6000 Processor,” IBM J. R first major step in the evolution of super- Research and Development, Vol. 34, No. 1, scalar processors. The introduction of these 1990, pp. 37-58. techniques became necessary to resolve the 13. S. White and J. Reysa, PowerPC and issue bottleneck of early superscalar designs. POWER2: Technical Aspects of the New As revealed in this article, register renaming IBM RISC System/6000, IBM Corp., Austin, is a complex technique whose understanding Texas,1994. requires the identification and exploration of 14. M. Becker et al., “The PowerPC 601 Micro- its design space. MICRO processor,” IEEE Micro, Oct. 1993, pp. 54-68. References 15. B. Burgess et al., “The PowerPC 603 Micro- 1. B.R. Rau and J.A. Fisher, “Instruction Level processor,” Comm. ACM, ACM Press, New Parallel Processing: History, Overview and York, Vol. 37, No. 6, 1994, pp. 34-42. Perspective,” J. Supercomputing, Vol. 7. No. 16. S.P. Song et al., “The PowerPC 604 RISC

SEPTEMBER–OCTOBER 2000 81 RENAMING REGISTERS

Microprocessor,” IEEE Micro, Oct. 1994, pp. 1-11. pp. 8-17. 32. B. Shriver and B. Smith, The Anatomy of a 17. D. Ogden et al., “A New PowerPC Micro- High-Performance Microprocessor, IEEE CS processor for Low Power Computing Sys- Press, 1998. tems,” Proc. Compcon, IEEE CS Press, 33. K. Diefendorff, “K7 Challenges Intel,” Micro- 1995, pp. 281-284. processor Report, Micro Design Resources, 18. L. Gwennap, “IBM Crams Power2 Onto Sin- Vol. 12, No. 14, 1998, pp. 1, 6-11. gle Chip,” Microprocessor Report, Micro 34. L. Gwennap, “Nx686 Goes Toe-to-Toe with Design Resources, Sunnyvale, Calif., Vol. 10, Pentium Pro,” Microprocessor Report, No. 11, 1996, pp. 14-16. Micro Design Resources, Vol. 9, No. 14, 19. D. Levitan et al., “The PowerPC 620 Micro- 1995, pp. 1, 6-10. processor: A High Performance Superscalar 35. D. Sima, “Superscalar Instruction Issue,” RISC Microprocessor,” Proc. Compcon, IEEE Micro, Sept.-Oct. 1997, pp. 29-39. IEEE CS Press, 1995, pp. 285-291. 36. V. Popescu et al., “The Metaflow Architec- 20. S.P. Song, “IBM’s Power3 to Replace P2SC,” ture,” IEEE Micro, June 1991, pp. 10-13, 63- Microprocessor Report, Micro Design 71. Resources, Vol. 11, No. 15, 1997, pp. 23-27. 37. T. Monreal et al., “Delaying Physical Regis- 21. L. Gwennap, “MIPS R10000 Uses Decou- ter Allocation Through Virtual-Physical Reg- pled Architecture,” Microprocessor Report, isters,” Proc. MICRO-32, IEEE CS Press, Micro Design Resources, Vol. 8, No. 14, 1999, pp. 186-192. 1994, pp. 18-22. 38. S. Wallace and N. Bagheryadeh, “A Scalable 22. L. Gwennap, “MIPS R12000 to Hit 300 MHz,” Register File Architecture for Dynamically Microprocessor Report, Micro Design Scheduled Processors,” Proc. 1996 Conf. Resources, Vol. 11, No. 13, 1997, pp. 1,6-7,17. Parallel Architectures and Compilation Tech- 23. N. Patkar et al., “Microarchitecture of HaL’s niques, 1996, pp. 179-184. CPU,” Proc. Compcon, IEEE CS Press, 39. A. González et al., “Virtual Registers,” Proc. 1995, pp. 259-266. Third Int’l Symp. High-Performance Com- 24. L. Gwennap, “Intel’s Uses Decoupled puter Architecture, IEEE CS Press, 1997, pp. Superscalar Design,” Microprocessor 364-369. Report, Micro Design Resources, Vol. 9, No. 40. A. González, J. González, and M. Valero, 2, 1995, pp. 9-15. “Virtual-Physical Register,” Proc. Fourth Int’l 25. L. Gwennap, “Klamath Extends P6 Family,” Symp. High-Performance Computer Archi- Microprocessor Report, Micro Design tecture, IEEE CS Press, 1998, pp. 175-184. Resources, Vol. 11, No. 2, 1997, pp. 1, 6-8. 41. S. Jourdan et al., “A Novel Renaming 26. Pentium III Processor, Product Overview, Scheme to Exploit Value Temporal Locality Intel Corp, Santa Clara, Calif., 1999. Through Physical Register Reuse and Unifi- 27. J.S. Liptay, “Design of the IBM Enterprise cation,” Proc. MICRO-31, IEEE CS Press, Sytem/9000 High-End Processor,” IBM J. 1998, pp. 216-225. Research and Development, Vol. 36, No. 4, 42. J.E. Smith and A.R. Pleszkun, “Implement- July 1992, pp. 713-731. ing Precise Interrupts in Pipelined Proces- 28. B. Burkhardt, “Delivering Next-Generation sors,” IEEE Trans. Computers, IEEE CS Performance on Today’s Installed Computer Press, Vol. C-37, No. 5, 1988, pp. 562-573. Base,” Proc. Compcon, IEEE CS Press, 43. M. Johnson, Superscalar Microprocessor 1994, pp. 11-16. Design, Prentice Hall, Englewood Cliffs, 29. Cyrix 686MX, Cyrix Corporation, Richardson, N.J., 1991. Texas, Order No. 94329-00, July 1997. 44. G.S. Sohi and S. Vajapayem, “Instruction 30. L. Gwennap, “NexGen Enters Market with Issue Logic for High Performance, Inter- 66-MHz Nx586,” Microprocessor Report, ruptible Pipelined Processors,” Proc. 14th Micro Design Resources, Vol. 8, No. 4, 1994, ISCA, IEEE CS Press, 1987, pp. 27-36. pp. 12-17. 45. D. Sima, “The Design Space of Shelving,” 31. M. Slater, “AMD’s K5 Designed to Outrun J. Systems Architecture, Vol. 45, No. 11, Pentium,” Microprocessor Report, Micro 1999, pp. 863-885. Design Resources, Vol. 8, No. 14, 1994,

82 IEEE MICRO Dezsö Sima is the dean of the John von Neu- degree in telecommunications, both from the mann Faculty of Informatics at Budapest Technical University of Dresden, Germany. Polytechnic, Hungary. He has taught com- He has authored more than 40 papers and a puter architecture at the Technical Universi- book used in advanced architecture courses. ty of Dresden; at Kandó Polytechnic, He served as president of the John von Neu- Budapest; and at South Bank University, Lon- mann Computer Society in Hungary and is don; and been a guest lecturer on computer an IEE Fellow and a member of the IEEE. architectures at several European universities. He was the first professor to hold the Barkhausen Chair at the Technical Universi- Direct comments about this article to the ty of Dresden. His research interests include author at Budapest Polytechnic, John von computer architectures and computer-assist- Neumann Faculty of Informatics, PO Box ed teaching and learning. Sima holds an MSc 112, H-1431 Budapest 8, Hungary; degree in electrical engineering and a PhD [email protected].

IEEE Micro 2001 Editorial Calendar

January-February July-August Hot Interconnects General Interest This issue focuses on the hardware and software architec- IEEE Micro gathers together the latest details on new devel- ture and implementation of high-performance interconnec- opments in chips, systems, and applications. tions on chips. Topics include network-attached storage; voice Ad close date: 1 July and video transport over packet networks; network interfaces, novel switching and routing technologies that can provide dif- September-October ferentiated services, and active network architecture. Embedded Fault-Tolerant Systems Ad close date: 2 January To avoid loss of life, certain computer systems—such as those in automobiles, railways, satellites, and other vital sys- March-April tems—cannot fail. Look for articles that focus on the verifi- Hot Chips cation and validation of complex computers, embedded An extremely popular annual issue, Hot Chips presents the computing system design, and chip-level fault-tolerant designs. latest developments in microprocessor chip and system tech- Ad close date: 1 September nology used to construct high-performance workstations and systems. November-December Ad close date: 1 March RF-ID and noncontact smart card applications Equipped with radio-frequency signals, small electronic tags May-June can locate and recognize people, animals, furniture, and other Mobile/Wearable computing items. The new generation of cell phones and powerful PDAs has Ad close date: 1 November made mobile computing practical. Wearable computing will soon be moving into the deployment stage. Ad close date: 1 May

IEEE Micro is a bimonthly publication of the IEEE Computer Society. Authors should submit paper proposals to micro- [email protected], include author name(s) and full contact information.

SEPTEMBER–OCTOBER 2000 83