Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance
Total Page:16
File Type:pdf, Size:1020Kb
EFFICIENT RUNAHEAD EXECUTION: POWER-EFFICIENT MEMORY LATENCY TOLERANCE SEVERAL SIMPLE TECHNIQUES CAN MAKE RUNAHEAD EXECUTION MORE EFFICIENT BY REDUCING THE NUMBER OF INSTRUCTIONS EXECUTED AND THEREBY REDUCING THE ADDITIONAL ENERGY CONSUMPTION TYPICALLY ASSOCIATED WITH RUNAHEAD EXECUTION. Today’s high-performance processors ulatively processed (executed) instructions, face main-memory latencies on the order of sometimes without enhancing performance. hundreds of processor clock cycles. As a result, For runahead execution to be efficiently even the most aggressive processors spend a sig- implemented in current or future high-per- nificant portion of their execution time stalling formance processors which will be energy- and waiting for main-memory accesses to constrained, processor designers must develop return data to the execution core. Previous techniques to reduce these extra instructions. research has shown that runahead execution Our solution to this problem includes both significantly increases a high-performance hardware and software mechanisms that are processor’s ability to tolerate long main-mem- simple, implementable, and effective. Onur Mutlu ory latencies.1, 2 Runahead execution improves a processor’s performance by speculatively pre- Background on runahead execution Hyesoon Kim executing the application program while the Conventional out-of-order execution processor services a long-latency (L2) data processors use instruction windows to buffer Yale N. Patt cache miss, instead of stalling the processor for instructions so they can tolerate long latencies. the duration of the L2 miss. Thus, runahead Because a cache miss to main memory takes University of Texas at execution lets a processor execute instructions hundreds of processor cycles to service, a that it otherwise couldn’t execute under an L2 processor needs to buffer an unreasonably large Austin cache miss. These preexecuted instructions gen- number of instructions to tolerate such a long erate prefetches that the application program latency. Runahead execution1 provides the will use later, improving performance. memory-level parallelism (MLP) benefits of a Runahead execution is a promising way to large instruction window without requiring tolerate long main-memory latencies because the large, complex, slow, and power-hungry it has modest hardware cost and doesn’t sig- structures—such as large schedulers, register nificantly increase processor complexity.3 files, load/store buffers, and reorder buffers— However, runahead execution significantly associated with a large instruction window. increases a processor’s dynamic energy con- The execution timelines in Figure 1 illustrate sumption by increasing the number of spec- the differences between the operation of a con- 10 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE Load A misses Instruction window Load B misses Instruction window in L2 cache becomes full in L2 cache becomes full No forward progress in program Compute Stall Compute Stall Compute Useful L2 miss A (being serviced from memory) L2 miss B (being serviced from memory) computation (a) Load A misses Load A is the oldest Load B misses Load A reexecuted Load B reexecuted in L2 cache instruction in window in L2 cache (cache hit) (cache hit) Pipeline flush Compute Runahead mode Compute Compute L2 miss A (being serviced from memory) L2 miss B (being serviced from memory) Cycles saved by runahead execution (b) Program execution timeline Figure 1. Execution timelines showing a high-level overview of the concept of runahead execution: conventional out-of-order execution processor (a) and runahead execution processor (b). ventional out-of-order execution processor updating the architectural state. It identifies the (Figure 1a) and a runahead execution proces- results of L2 cache misses and their dependents sor (Figure 1b). A conventional processor’s as bogus or invalid (INV) and removes instruc- instruction window becomes full soon after a tions that source INV results (INV instruc- load instruction incurs an L2 cache miss. Once tions) from the instruction window so they the instruction window is full, the processor don’t prevent the processor from placing inde- can’t decode and process any new instructions pendent instructions into the window. and stalls until it has serviced the L2 cache miss. Pseudoretirement is the program-order removal While the processor is stalled, it makes no of instructions from the processor during runa- forward progress on the running application. head mode. Some of the instructions executed Therefore, a memory-intensive application’s in runahead mode that are independent of L2 execution timeline on a conventional proces- cache misses might miss in the instruction, sor consists of useful compute periods inter- data, or unified caches (for example, Load B in leaved with long useless stall periods due to Figure 1b). The memory system overlaps their L2 cache misses, as Figure 1a shows. With miss latencies with the latency of the runahead- increasing memory latencies, stall periods start causing cache miss. When the runahead-caus- dominating the compute periods, leaving the ing cache miss completes, the processor exits processor idle for most of its execution time runahead mode by flushing the instructions in and thus reducing performance. its pipeline. It restores the checkpointed state Runahead execution avoids stalling the and resumes normal instruction fetch and exe- processor when an L2 cache miss occurs, as Fig- cution starting with the runahead-causing ure 1b shows. When the processor detects that instruction (Load A in Figure 1b). the oldest instruction is waiting for an L2 cache When the processor returns to normal miss that is still being serviced, it checkpoints mode, it can make faster progress without the architectural register state, the branch his- stalling because it has already prefetched into tory register, and the return address stack, and the caches during runahead mode some of enters a speculative processing mode—the the data and instructions needed during nor- runahead mode. The processor then removes mal mode. For example, in Figure 1b, the this L2-miss instruction from the instruction processor doesn’t need to stall for Load B window. While in runahead mode, the proces- because it discovered the L2 miss caused by sor continues to execute instructions without Load B in runahead mode and serviced it in JANUARY–FEBRUARY 2006 11 MICRO TOP PICKS small hardware cost, as we’ve shown in pre- Related work on runahead execution vious work.1,3 As a promising technique for increasing tolerance to main-memory latency, runahead exe- cution has recently inspired and attracted research from many other computer architects in Efficiency of runahead execution both industry1-3 and academia.4-6 For example, architects at Sun Microsystems are imple- A runahead processor executes some instruc- menting a version of runahead execution in their next-generation microprocessor.3 To our tions in the instruction stream more than once knowledge, none of the previous work addressed the runahead execution efficiency problem. because it speculatively executes instructions We hereby provide a brief overview of related work on runahead execution. in runahead mode. Because each executed Dundas and Mudge first proposed runahead execution as a means to improve the perfor- instruction consumes dynamic energy, a runa- mance of an in-order scalar processor.7 In other work (see the main article), we proposed head processor consumes more dynamic ener- runahead execution to increase the main-memory latency tolerance of more aggressive out- gy than a conventional processor. Reducing of-order superscalar processors. Chou and colleagues demonstrated that runahead execu- the number of instructions executed in runa- tion effectively improves memory-level parallelism in large-scale database benchmarks head mode reduces the energy consumed by a because it prevents the instruction and scheduling windows, along with serializing instruc- runahead processor. Unfortunately, reducing tions, from being performance bottlenecks.1 Three recent articles1,4,6 combined runahead exe- the number of instructions can significantly cution with value prediction, and Zhou5 proposed using an idle processor core to perform reduce runahead execution’s performance runahead execution in a chip multiprocessor. Applying the efficiency mechanisms we propose improvement because runahead execution to these variants of runahead execution can improve their power efficiency. relies on the execution of instructions in runa- head mode to discover L2 cache misses further down in the instruction stream. Our goal is to References increase a runahead processor’s efficiency with- 1. Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture Optimizations for Exploit- out significantly decreasing its instructions per ing Memory-Level Parallelism,” Proc. 31st Int’l Symp. Computer Architecture cycle (IPC) performance improvement. (ISCA 04), IEEE CS Press, 2004, pp. 76-87. We define efficiency as 2. S. Iacobovici et al., “Effective Stream-Based and Execution-Based Data Prefetch- ing,” Proc. 18th Int’l Conf. Supercomputing, ACM Press, 2004, pp. 1-11. Efficiency = Percent increase in IPC perfor- 3. S. Chaudhry et al., “High-Performance Throughput Computing,” IEEE Micro, mance/Percent increase in executed instructions vol. 25, no. 3, May/June 2005, pp. 32-45. 4. L. Ceze et al., “CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Pre- where percent increase in IPC performance is diction,” Computer