How to Fake 1000 Registers

Appears in MICRO-38, November 2005 How to Fake 1000 Registers David W. Oehmke†, Nathan L. Binkert, Trevor Mudge, Steven K. Reinhardt Advanced Computer Architecture Lab University of Michigan, Ann Arbor {doehmke, binkertn, tnm, stever}@eecs.umich.edu Abstract those original ideas have merit. This paper shows how architects can achieve the benefits of large logical regis- Large numbers of logical registers can improve ter files using a standard physical register file. performance by allowing fast access to multiple Registers are a central component of both instruc- subroutine contexts (register windows) and multiple tion-set architectures (ISAs) and processor microarchi- thread contexts (multithreading). Support for both of tectures. From the ISA’s perspective, a small register these together requires a multiplicative number of namespace allows the encoding of multiple operands in registers that quickly becomes prohibitive. We overcome an instruction of reasonable size. Registers also provide this limitation with the virtual context architecture a simple, unambiguous specification of data depen- (VCA), a new register-file architecture that virtualizes dences, because—unlike memory locations—they are logical register contexts. VCA works by treating the specified directly in the instruction and cannot be physical registers as a cache of a much larger memory- aliased. From the microarchitecture’s point of view, reg- mapped logical register space. Complete contexts, isters comprise a set of high-speed, high-bandwidth stor- whether activation records or threads, are no longer age locations that are integrated into the datapath more required to reside in their entirety in the physical register tightly than a data cache, and are thus far more capable file. A VCA implementation of register windows on a of keeping up with a modern superscalar execution core. single-threaded machine reduces data cache accesses by As with many architectural features, these two pur- 20%, providing the same performance as a conventional poses of registers can conflict. For example, the depen- machine while requiring one fewer cache port. Using dence specification encoded by the register assignment VCA to support multithreading enables a four-thread is adequate for the ISA’s sequential execution semantics. machine to use half as many physical registers without a However, out-of-order execution requires that these logi- significant performance loss. VCA naturally extends to cal registers be renamed into an alternate, larger physical support both multithreading and register windows, register space to eliminate false dependencies. providing higher performance with significantly fewer This paper addresses a different conflict between the registers than a conventional machine. abstraction of registers and its implementation: that of context. A logical register identifier is meaningful only 1. Introduction in the context of a particular procedure instance (activation record) in a particular thread of execution. From the Twenty-five years ago architects expected VLSI ISA’s perspective, the processor supports exactly one trends to enable processors with thousands of registers, context at any point in time. However, a processor and they devised schemes such as register windows and designer may wish for an implementation to support multithreading to take advantage of this resource [25]. multiple contexts for several reasons: to support multi- Unfortunately, the realities of wire delay and power con- threading, to reduce context switch overhead, or to sumption, along with the large number of ports needed reduce procedure call/return overhead (e.g., using regis- to support wide-issue superscalar pipelines, made these ter windows) [15, 27, 29, 5, 20, 25]. Conventional extremely large register files impractical. Nevertheless, designs require that each active context be present in its entirety; thus each additional context adds directly to the size of the register file. Unfortunately, larger register † Currently at Cray Inc. files are inherently slower to access, leading to a slower cycle time or additional cycles of register access The virtual context architecture enables a near latency, either of which reduces overall performance. optimal implementation of register windows, improv- This problem is further compounded by the additional ing performance and greatly reducing traffic to the data rename registers necessary to support out-of-order exe- cache (by up to 10% and 20%, in our simulations). cution. VCA naturally supports simultaneous multithreading We seek to bypass this trade-off between multiple (SMT) as well, allowing large numbers of threads to be context support and register file size by decoupling the multiplexed on relatively small physical register files. logical register requirements of active contexts from Our results show that a VCA SMT machine can the contents of the physical register file. Just as caches achieve speedups comparable to a conventional SMT and virtual memory allow a processor to give the illu- implementation with twice as many registers. sion of numerous multi-gigabyte address spaces with Implementing both register windows and multi- an average access time approaching that of several threading in the same processor requires a multiplica- kilobytes of SRAM, we propose an architecture that tive number of contexts. The one such machine of gives the illusion of numerous active contexts with an which we are aware (Sun’s Niagara [12]) has 640 reg- average access time approaching that of a convention- isters per core—even though it is an in-order pipeline, ally sized register file. Our design treats the physical it does not require renaming registers. VCA provides register file as a cache of a practically unlimited mem- unified support for both techniques in an out-of-order ory-backed logical register space. We call this scheme pipeline without the multiplicative increase in register the virtual context architecture (VCA). count normally required. We show that VCA can In VCA, an individual instruction needs only its achieve near-peak performance on an out-of-order source operands and its destination register to be pipeline using register windows and four threads (as in present in the register file to execute. Inactive register Niagara) with only 192 physical registers. values are automatically saved to memory as needed, We describe VCA in more detail in Section 2. and restored to the register file on demand. The archi- Section 3 discusses our simulation-based evaluation tecture modifies the rename stage of the pipeline to methodology. Section 4 evaluates a VCA-based imple- trigger the movement of register values between the mentations of register windows, SMT, and a combina- physical register file and the data cache. Furthermore, a tion of both techniques. Section 5 presents previous thread can change its register context simply by chang- work. Section 6 concludes and discusses future work. ing a base pointer—either to another register window on a call or return, or to an entirely different software 2. The Virtual Context Architecture thread context. Compared to prior proposals (see Section 5), VCA: This section presents the virtual context architec- • unifies support for both multiple independent ture in two phases. First, we outline the basic changes threads and register windowing within each thread; to a conventional out-of-order processor required to • completely decouples the physical register file size support VCA. Second, we describe a handful of imple- from the number of logical registers by using mem- mentation optimizations that make VCA more effi- ory as a backing store; cient. • enables the physical register file to hold just the most active subset of logical register values, instead 2.1. Basic VCA implementation of the complete register contexts, by allowing the hardware to spill and fill registers on demand; The fundamental operation of VCA lies in map- • is backward compatible with existing ISAs at the ping logical registers to physical registers. VCA builds application level for multithreaded contexts, and on the register renaming logic already found in a mod- requires only minimal ISA changes for register ern dynamically scheduled processor. A key feature of windowing; VCA is that it has minimal impact outside of the • requires no changes to the physical register file rename stage. Figure 1 illustrates the blocks that are design and the performance-critical schedule/exe- changed. We describe the basic operation of VCA by cute/writeback loop; first discussing the VCA renaming process, then detail- • is orthogonal to the other common techniques for ing the states and state transitions of physical registers. reducing the latency of the register file—register We then discuss the implications for branch recovery, banking and register caching; and the requirements for supporting multithreading and • does not involve speculation or prediction, avoid- register windows. ing the need for recovery mechanisms. physical register allocation are discussed in the follow- F D R1 R2 IQ ALU ing section.) RF In the simplest incarnation, fills and spills can be LSU D$ ASTQ accomplished by injecting load and store operations, I$ respectively, into the processor’s instruction queue. Figure 1: Pipeline diagram. These operations can use the same register ports, data- Shaded regions depict pipeline changes for VCA. paths, and cache ports as regular load and store instruc- R2 represents the extra rename stage necessary. tions, but are simpler in some

Load more