Flexible Register Management Using Reference Counting

Appears in Proc. HPCA-18 (2012) Flexible Register Management using Reference Counting Steven Battle Andrew D. Hilton Mark Hempstead Amir Roth Drexel University IBM Corporation Drexel University University of Pennsylvania [email protected] [email protected] [email protected] [email protected] Abstract erence count becomes zero. Reference counting is more general and powerful than cicular queue driven regis- Conventional out-of-order processors that use a uni- ter management. In addition to conventional retirement- fied physical register file allocate and reclaim registers driven register reclamation, it supports execution-driven explicitly using a free list that operates as a circular register reclamation used in latency-tolerant, scalable queue. We describe and evaluate a more flexible register instruction window designs like CFP (Continual Flow management scheme—reference counting. Pipelines) [23] and BOLT (Better Out-of-Order Latency We implement reference counting using a bit-matrix Tolerance) [8]. It also supports register sharing which with a column for every physical register and a row for allows multiple logical registers that contain the same every entity that can hold a physical register, e.g., an value to share the same (physical) register, reducing ex- in-flight instruction. Columns are NOR’ed together to ecution overhead and (potentially) the height of the dy- create a bitvector free list from which registers are allo- namic dataflow graph. Register sharing has been used in cated using priority encoders. previously proposed techniques Unified Renaming [10], We describe reference counting designs that sup- RENO (Rename Optimizer) [18], and NoSQ (No Store port micro-architectural techniques including register Queue) [22]. We show how reference counting supports file power gating, dynamic register move elimination, these and other features, both alone and in combination. register file checkpointing, and latency tolerant execu- The first description of reference counting used per- tion. Performance and circuit simulation show that the register multi-input up-down counters [16]. More recent energy cost of reference counting is low and is easily designs use a two-dimensional bit-matrix [20]. The ma- recouped by the savings of the techniques it enables. trix has a column per register and a row per entity that can hold a register, e.g., an in-flight instruction or the rename map table. The bit at row e and column p is 1. Introduction set if entity e holds register p. The bits in each col- Most contemporary out-of-order processors use a umn are NOR’ed together to create a bitvector-style free unified physical register file to hold both architectural list from which registers are allocated using priority en- and speculative register state. This design requires fewer coders. This structure is similar to the register renam- register value copies than one that uses a separate ar- ing and map table checkpointing mechanism used in the chitectural register file and a ROB that holds destina- Alpha 21264 [11], but is used only for register reclamation register values. Processors manage registers—for tion and can operate in conjunction with any renaming the rest of the paper, we use register to mean physi- and checkpointing scheme. Reference counts are ma- cal register—using a free list that operates as a circu- nipulated by operating on matrix bits. Different uses lar queue. Dispatching instructions dequeue (allocate) of reference counting require different matrix organiza- registers at one end. Committing instructions enqueue tions. Depending on the type of register-holding entity (free) the registers they overwrite at the other end. The and on manipulation bandwidth requirements, a given destination registers of squashed instructions are freed matrix row may be implemented using RAM or latches. in bulk by moving the free list head pointer. We use cycle-level performance simulation and cir- This paper describes an alternative register manage- cuit simulation, described in Section 2, to evaluate ment algorithm, reference counting, in which processor the performance benefits and area and power costs actions decrement reference counts rather than explicitly of various register reference counting configurations. freeing registers. A register becomes free when its ref- Where analogous free-list configurations exist, we eval- 1 ROB Instruc,ons D O r0 r1 r2 r3 Tail p3 p1 p2 p3 p4 p5 p6 p7 p8 p9 Commit A r1+r2 r3 p1+p2 p4 p4 p3 p0 p1 p2 p4 Arch p5 1 1 0 1 1 1 1 1 0 B r3+r1 r2 p4+p1 p5 p5 p2 Arch+ROB Chkpt p6 Free List Chkpt C r2+r1 r3 p5+p1 p6 p6 p4 p0 p1 p5 p6 Chkpt p7 AND D r0+r3 r1 p0+p6 p7 p7 p1 Head p8 Chkpt Rename E r1+r3 r2 p7+p6 p8 p8 p5 p0 p7 p8 p6 Rename p9 1 1 1 1 1 1 0 0 0 b) Register management using reference counng. a) Convenonal queue-based register management. There are four logical The circular queue free list is replaced by a vector registers (r0−r3) and ten physical registers (p0−p9). Hardwired logical register r0 with one bit per physical register. When instrucDon A is mapped to hardwired register p0. The circular queue free list has six slots. commits, it clears the bit corresponding to its InstrucDon A commits and frees overwriGen (O) physical register p3 by adding it overwriGen register p3. When instrucDon E to the tail of the free list. It also writes its desDnaon (D) physical register p4 dispatches, it sets the bit corresponding to its into the architectural map table. InstrucDon E dispatches and allocates desDnaon register p8. This desDnaon was selected desDnaon physical register p8 from the head of the free list. It also writes p8 from the free list using an encoder. InstrucDon C’s into the rename map table. The processor took a checkpoint immediately aer checkpoint includes a copy of the map table and a renaming instrucDon C. The checkpoint includes a copy of the map table and a copy of bitvector. Restoring the checkpoint involves copy of the free list head pointer. Restoring the checkpoint involves restoring the restoring the map table copy and ANDing the map table copy and the free list head pointer. checkpointed and acDve bitvectors. Figure 1. Implementations of ROB register management. uate those side-by-side. Section 3 introduces reference ory latency is 150 cycles—corresponding to a 3.2GHz counting and shows how it can re-implement traditional clock—and bandwidth is 12.8 Gbytes/s. register management. Section 4 presents register-file We compiled the SPEC2006 benchmarks using gcc power-gating, a new application of reference counting 4.3 at optimization level –O3. We simulate the bench- that exploits its ability to allocate registers in any or- marks to completion on their training inputs, sampling der. Section 5 discusses reference counting support for 10 million of every 500 million instructions with 10 mil- register sharing with an application to dynamic regis- lion instructions of cache and branch predictor warmup. ter move elimination. Although reference counting con- Power and area estimation. We created custom sumes more area and power than a traditional free list, RTL for the reference counting structures and used the the gating and move elimination optimizations it sup- NCSU FabScalar memory generator [4] to create netlists ports offset its energy cost. Sections 6 and 7 demon- for the register file, free list, and checkpoint SRAMs, all strate reference counting support for speculative retire- in NSCU’s 45nm FreePDK CMOS technology. We de- ment and execution-driven register reclamation. rived power estimates for synthesized netlists using the power modeling features in Synopsys PrimeTime and 2. Methodology also used HSPICE for transistor-level netlists. Our input activity traces included both the average port access We evaluate reference counting and its applications rate and the average bit toggle rate to accurately reflect using both performance and circuit simulation. energy consumption. Static power is scaled to bring the Performance evaluation. Our cycle-level perfor- P(st):P(dy) ratio in line with the typical 1:3 ratio [27] mance simulator executes user level x86 64 code, break- measured in commercial 45nm processors. ing x86 instructions into RISC-like three-register micro- operations. The simulated core is modeled roughly after 3. Conventional Register Reclamation Nehalem. It is 4-wide issue with a 23-stage pipeline, 128-entry reorder buffer, 36-entry issue queue, and 96 Out-of-order processors with unified register files use rename registers. We model both a single-threaded con- a variant of the MIPS R10000 register management al- figuration with 160 total registers and a dual-threaded gorithm [26]. The rename map table is a multi-ported configuration with 224 registers. The branch predictor is RAM containing (physical) register numbers indexed by a 16 Kbyte 3-table PPM predictor [15]. The three-level logical register number. The free list is a RAM of regis- cache hierarchy has 32 Kbyte 8-way set-associative, 3- ter numbers managed as a circular queue. For P registers cycle access instruction and data caches, a 256 Kbyte, and L logical registers the free list contains P–L entries 8-way set-associative, 10-cycle acccess L2 and an 8 as L registers are always mapped. Each entry is log2P- MByte, 16-way set-associative 40-cycle L3. Main mem- bits wide. At rename, a register-writing instruction is al- disp[0:7] ovwr[0:7] disp2[0:5] disp3[0:5] disp0[0:5] ovwr0[0:7] disp1[0:5] ovwr1[0:7] ovwr2[0:7] ovwr3[0:7] DEC DEC DEC DEC DEC DEC DEC DEC DEC DEC [0:159] set [0:159] clr set0[0:39] clr0[0:159] set1[40:79] clr1[0:159] set3[120:159] clr3[0:159] clr2[0:159] set2[80:119] OR OR OR OR REF[0:159] REF[0:39] REF[40:79] REF[80:119] REF[120:159] PRIO_ENC PRIO_ENC PRIO_ENC PRIO_ENC PRIO_ENC PIPE_LATCH PIPE_LATCH alloc[0:7] alloc1[0:5] a) Reference counng for a scalar processor.

Load more