EFFICIENT RUNAHEAD EXECUTION: POWER-EFFICIENT MEMORY LATENCY TOLERANCE

SEVERAL SIMPLE TECHNIQUES CAN MAKE RUNAHEAD EXECUTION MORE

EFFICIENT BY REDUCING THE NUMBER OF INSTRUCTIONS EXECUTED AND

THEREBY REDUCING THE ADDITIONAL ENERGY CONSUMPTION TYPICALLY

ASSOCIATED WITH RUNAHEAD EXECUTION.

Today’s high-performance processors ulatively processed (executed) instructions, face main-memory latencies on the order of sometimes without enhancing performance. hundreds of processor clock cycles. As a result, For runahead execution to be efficiently even the most aggressive processors spend a sig- implemented in current or future high-per- nificant portion of their execution time stalling formance processors which will be energy- and waiting for main-memory accesses to constrained, processor designers must develop return data to the execution core. Previous techniques to reduce these extra instructions. research has shown that runahead execution Our solution to this problem includes both significantly increases a high-performance hardware and software mechanisms that are processor’s ability to tolerate long main-mem- simple, implementable, and effective. Onur Mutlu ory latencies.1, 2 Runahead execution improves a processor’s performance by speculatively pre- Background on runahead execution Hyesoon Kim executing the application program while the Conventional out-of-order execution processor services a long-latency (L2) data processors use instruction windows to buffer Yale N. Patt cache miss, instead of stalling the processor for instructions so they can tolerate long latencies. the duration of the L2 miss. Thus, runahead Because a cache miss to main memory takes University of Texas at execution lets a processor execute instructions hundreds of processor cycles to service, a that it otherwise couldn’t execute under an L2 processor needs to buffer an unreasonably large Austin cache miss. These preexecuted instructions gen- number of instructions to tolerate such a long erate prefetches that the application program latency. Runahead execution1 provides the will use later, improving performance. memory-level parallelism (MLP) benefits of a Runahead execution is a promising way to large instruction window without requiring tolerate long main-memory latencies because the large, complex, slow, and power-hungry it has modest hardware cost and doesn’t sig- structures—such as large schedulers, register nificantly increase processor complexity.3 files, load/store buffers, and reorder buffers— However, runahead execution significantly associated with a large instruction window. increases a processor’s dynamic energy con- The execution timelines in Figure 1 illustrate sumption by increasing the number of spec- the differences between the operation of a con-

10 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE Load A misses Instruction window Load B misses Instruction window in L2 cache becomes full in L2 cache becomes full

No forward progress in program Compute Stall Compute Stall Compute

Useful L2 miss A (being serviced from memory) L2 miss B (being serviced from memory) computation (a)

Load A misses Load A is the oldest Load B misses Load A reexecuted Load B reexecuted in L2 cache instruction in window in L2 cache (cache hit) (cache hit) Pipeline flush

Compute Runahead mode Compute Compute

L2 miss A (being serviced from memory)

L2 miss B (being serviced from memory) Cycles saved by runahead execution (b)

Program execution timeline

Figure 1. Execution timelines showing a high-level overview of the concept of runahead execution: conventional out-of-order execution processor (a) and runahead execution processor (b). ventional out-of-order execution processor updating the architectural state. It identifies the (Figure 1a) and a runahead execution proces- results of L2 cache misses and their dependents sor (Figure 1b). A conventional processor’s as bogus or invalid (INV) and removes instruc- instruction window becomes full soon after a tions that source INV results (INV instruc- load instruction incurs an L2 cache miss. Once tions) from the instruction window so they the instruction window is full, the processor don’t prevent the processor from placing inde- can’t decode and process any new instructions pendent instructions into the window. and stalls until it has serviced the L2 cache miss. Pseudoretirement is the program-order removal While the processor is stalled, it makes no of instructions from the processor during runa- forward progress on the running application. head mode. Some of the instructions executed Therefore, a memory-intensive application’s in runahead mode that are independent of L2 execution timeline on a conventional proces- cache misses might miss in the instruction, sor consists of useful compute periods inter- data, or unified caches (for example, Load B in leaved with long useless stall periods due to Figure 1b). The memory system overlaps their L2 cache misses, as Figure 1a shows. With miss latencies with the latency of the runahead- increasing memory latencies, stall periods start causing cache miss. When the runahead-caus- dominating the compute periods, leaving the ing cache miss completes, the processor exits processor idle for most of its execution time runahead mode by flushing the instructions in and thus reducing performance. its pipeline. It restores the checkpointed state Runahead execution avoids stalling the and resumes normal instruction fetch and exe- processor when an L2 cache miss occurs, as Fig- cution starting with the runahead-causing ure 1b shows. When the processor detects that instruction (Load A in Figure 1b). the oldest instruction is waiting for an L2 cache When the processor returns to normal miss that is still being serviced, it checkpoints mode, it can make faster progress without the architectural register state, the branch his- stalling because it has already prefetched into tory register, and the return address stack, and the caches during runahead mode some of enters a speculative processing mode—the the data and instructions needed during nor- runahead mode. The processor then removes mal mode. For example, in Figure 1b, the this L2-miss instruction from the instruction processor doesn’t need to stall for Load B window. While in runahead mode, the proces- because it discovered the L2 miss caused by sor continues to execute instructions without Load B in runahead mode and serviced it in

JANUARY–FEBRUARY 2006 11 MICRO TOP PICKS

small hardware cost, as we’ve shown in pre- Related work on runahead execution vious work.1,3 As a promising technique for increasing tolerance to main-memory latency, runahead exe- cution has recently inspired and attracted research from many other computer architects in Efficiency of runahead execution both industry1-3 and academia.4-6 For example, architects at Sun Microsystems are imple- A runahead processor executes some instruc- menting a version of runahead execution in their next-generation .3 To our tions in the instruction stream more than once knowledge, none of the previous work addressed the runahead execution efficiency problem. because it speculatively executes instructions We hereby provide a brief overview of related work on runahead execution. in runahead mode. Because each executed Dundas and Mudge first proposed runahead execution as a means to improve the perfor- instruction consumes dynamic energy, a runa- mance of an in-order scalar processor.7 In other work (see the main article), we proposed head processor consumes more dynamic ener- runahead execution to increase the main-memory latency tolerance of more aggressive out- gy than a conventional processor. Reducing of-order superscalar processors. Chou and colleagues demonstrated that runahead execu- the number of instructions executed in runa- tion effectively improves memory-level parallelism in large-scale database benchmarks head mode reduces the energy consumed by a because it prevents the instruction and scheduling windows, along with serializing instruc- runahead processor. Unfortunately, reducing tions, from being performance bottlenecks.1 Three recent articles1,4,6 combined runahead exe- the number of instructions can significantly cution with value prediction, and Zhou5 proposed using an idle processor core to perform reduce runahead execution’s performance runahead execution in a chip multiprocessor. Applying the efficiency mechanisms we propose improvement because runahead execution to these variants of runahead execution can improve their power efficiency. relies on the execution of instructions in runa- head mode to discover L2 cache misses further down in the instruction stream. Our goal is to References increase a runahead processor’s efficiency with- 1. Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture Optimizations for Exploit- out significantly decreasing its instructions per ing Memory-Level Parallelism,” Proc. 31st Int’l Symp. Computer Architecture cycle (IPC) performance improvement. (ISCA 04), IEEE CS Press, 2004, pp. 76-87. We define efficiency as 2. S. Iacobovici et al., “Effective Stream-Based and Execution-Based Data Prefetch- ing,” Proc. 18th Int’l Conf. Supercomputing, ACM Press, 2004, pp. 1-11. Efficiency = Percent increase in IPC perfor- 3. S. Chaudhry et al., “High-Performance Throughput Computing,” IEEE Micro, mance/Percent increase in executed instructions vol. 25, no. 3, May/June 2005, pp. 32-45. 4. L. Ceze et al., “CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Pre- where percent increase in IPC performance is diction,” Computer Architecture Letters, vol. 3, Dec. 2004, http://www.cs. the IPC increase after adding runahead exe- virginia.edu/~tcca/2004/ceze_dec04.pdf. cution to a conventional baseline processor, 5. H. Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruc- and percent increase in executed instructions tion Window,” Proc. 14th Int’l Conf. Parallel Architectures and Compilation Tech- is the increase in the number of executed niques (PACT 05), IEEE CS Press, 2005, pp. 231-242. instructions after adding runahead execution. 6. N. Kirman et al., “Checkpointed Early Load Retirement,” Proc. 11th Int’l Symp. We can increase a runahead processor’s effi- High-Performance Computer Architecture (HPCA-11), IEEE CS Press, 2005, pp. ciency in two ways: 16-27. 7. J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-execut- • We can reduce the number of executed ing Instructions Under a Cache Miss,” Proc. 1997 Int’l Conf. Supercomputing, instructions (the denominator) without IEEE Press, 1997, pp. 68-75. affecting the increase in IPC (the numerator) by eliminating the causes of inefficiency. parallel with the L2 miss caused by Load A. • We can increase the IPC improvement Hence, runahead execution uses otherwise- without increasing the number of exe- idle clock cycles caused by L2 misses to pre- cuted instructions. To do this, we increase execute instructions in the program to the usefulness of each runahead execution generate accurate prefetch requests. Previous period by extracting more useful prefetch- research has shown that runahead execution es from the executed instructions. increases processor performance mainly because it parallelizes independent L2 cache Our techniques increase efficiency in misses3 (see also the “Related Work” sidebar). both ways. Furthermore, the memory latency tolerance Figure 2 shows by how much runahead exe- provided by runahead execution comes at a cution increases IPC and the number of exe-

12 IEEE MICRO 235 110 105 Increase in IPC 100 95 Increase in executed instructions 90 85 80 75 70 65 60 55 50 45 40 35 30 25 Increase over baseline (percent) 20 15 10 5 0 f i l 2 n p c ip c r k lf r p lu s rt e c d e s a k e n fty o a c z o p p a k 3 a s rid im is a ip e g g m rse rtex v m p a re a lg c e g w e z ra g a tw o m p a u e a trac s w b c rlbm v a a q c fm g lu m m p m p e e fa ix u p s w

Figure 2. Increase in instructions per cycle (IPC) performance and executed instructions due to runahead execution. cuted instructions compared to an aggressive unnecessary speculative instruction execution. conventional out-of-order processor. Our Because exiting from runahead mode has a baseline processor model includes an effective performance cost (it requires a full pipeline stream-based prefetcher, a 1-Mbyte L2 cache, flush), such runahead periods can actually and a detailed model of a main-memory sys- decrease performance. We propose some sim- tem with a 500-cycle latency. Detailed infor- ple techniques to eliminate such periods. mation on our experimental methodology is available elsewhere.4 Short runahead periods On average, for the SPEC CPU2000 In a short runahead period, the processor benchmarks, runahead execution increases stays in runahead mode for tens of cycles IPC by 22.6 percent at a cost of increasing the instead of hundreds. A short runahead peri- number of executed instructions by 26.5 per- od occurs because the processor can enter cent. Unfortunately, runahead execution in runahead mode in response to an already out- some benchmarks results in a large increase in standing L2 cache miss that was initiated— the number of executed instructions without but not yet completed—by the hardware or yielding a correspondingly large IPC improve- software prefetcher, a wrong-path instruction, ment. For example, in parser, runahead exe- or a previous runahead period. cution increases the number of executed Figure 3a shows a short runahead period instructions by 47.8 percent while decreasing caused by an incomplete prefetch generated the IPC by 0.8 percent. In art, the IPC by a previous runahead period. Load B gen- increase is impressive at 108.4 percent, but is erates an L2 miss when it is speculatively exe- overshadowed by a 235.4 percent increase in cuted in runahead period A. When the the number of executed instructions. processor executes Load B again in normal mode, the associated L2 miss (L2 miss B) is Eliminating the causes of inefficiency still in progress. Therefore, Load B causes the The three major causes of inefficiency in processor to enter runahead mode again. runahead execution processors are short, over- Shortly afterward, the memory system com- lapping, and useless runahead periods. Runa- pletely services L2 miss B, and the processor head execution episodes with these properties exits runahead mode. Hence, the runahead rarely provide performance benefit but result in period caused by Load B is short. Short

JANUARY–FEBRUARY 2006 13 MICRO TOP PICKS

Load A misses Load B misses Load A reexecuted Load B reexecuted Load B reexecuted in L2 cache in L2 cache (cache hit) (still L2 miss!) (cache hit)

Pipeline flush

ComputeRunahead period A Compute Compute

L2 miss A (being serviced from memory) Short runahead period L2 miss B (being serviced from memory) (runahead period B)

(a)

Load A misses Load B INV Load A reexecuted Load B misses Load B reexecuted in L2 cache (dependent on load A) (cache hit) in L2 cache (cache hit)

Overlap Overlap

ComputeRunahead period A ComputeRunahead period B Compute

L2 miss A (being serviced from memory) L2 miss B (being serviced from memory)

(b)

Load A misses Load A reexecuted in L2 cache (cache hit)

No L2 misses discovered

Compute Runahead period A Compute

L2 miss A (being serviced from memory) (c)

Figure 3. Example execution timelines illustrating the causes of inefficiency in runahead execution and how they can occur: short runahead period (a), overlapping runahead period (b), and useless runahead period (c).

runahead periods are undesirable because the periods are the same dynamic instructions. processor is unlikely to preexecute enough These periods can be caused by independent instructions far ahead into the instruction L2 misses that have significantly different stream and hence unlikely to uncover any use- latencies or by dependent L2 misses (for ful L2 cache misses during runahead mode. example, L2 misses due to pointer-chasing We eliminate short runahead periods by loads). In Figure 3b, runahead periods A and associating a timer with each outstanding L2 B are overlapping because of dependent L2 miss. If the L2 miss has been outstanding for misses. During period A, the processor exe- more than N cycles, where N is determined cutes Load B and finds that it is dependent on statically or dynamically, the processor pre- the miss caused by Load A. Because the dicts that the miss will return from memory processor hasn’t serviced L2 miss A yet, Load soon and doesn’t enter runahead mode on that B can’t calculate its address and the processor miss. We found that a static threshold of 400 marks Load B as INV. The processor executes cycles for a processor with a 500-cycle mini- and pseudoretires N instructions after Load B mum main-memory latency eliminates almost and exits period A. In normal mode, the all short runahead periods and reduces the processor reexecutes Load B and finds it to be extra instructions from 26.5 to 15.3 percent an L2 miss, which causes runahead period B. with negligible impact on performance (Per- The first N instructions executed during peri- formance improvement decreases slightly od B are the same dynamic instructions that from 22.6 to 21.5 percent). were executed at the end of period A. Hence, period B repeats the work done by period A. Overlapping runahead periods Overlapping runahead periods can benefit Two runahead periods are overlapping if performance because the completion of Load some of the instructions executed in the two A can provide data values for more instructions

14 IEEE MICRO in runahead period B, which can result in the processor initiates runahead execution if that generation of useful L2 misses that the proces- load misses in the L2 cache. Otherwise, the sor couldn’t have generated in runahead peri- processor doesn’t enter runahead mode on an od A. However, in the benchmark set we L2 miss due to the static load instruction. The examined, overlapping runahead periods rarely insight behind this technique is that the benefited performance. In any case, overlap- processor can usually predict the usefulness of ping runahead periods can be a major cause of future runahead periods from the recent past inefficiency because they result in the execu- behavior of runahead periods caused by the tion of the same instructions multiple times in same static load. runahead mode, especially if many L2 misses The second technique predicts the available are clustered together in the program. MLP during the ongoing runahead period. If Our solution to reducing the inefficiency the fraction of INV (that is, L2-miss depen- due to overlapping periods involves not enter- dent) load instructions encountered during the ing a runahead period if the processor predicts ongoing runahead mode is greater than a sta- it to be overlapping with a previous runahead tically determined threshold, the processor pre- period. During a runahead period, the proces- dicts that there isn’t enough MLP for runahead sor counts the number of pseudoretired execution to exploit and exits runahead mode. instructions. During normal mode, the The third technique uses sampling to pre- processor counts the number of instructions dict runahead execution’s usefulness in a more fetched since the exit from the last runahead coarse-grained fashion. This technique aims period. If the number of instructions fetched to turn off runahead execution in program after runahead mode is less than the number phases with low MLP. To do so, the processor of instructions pseudoretired in the previous periodically monitors the total number of L2 runahead period, the processor doesn’t enter misses generated during N consecutive runa- runahead mode. This technique, implement- head periods. If this number is less than a sta- ed with two simple counters and a compara- tic threshold T, the processor doesn’t enter tor, reduces the extra instructions resulting runahead mode for the next M L2 misses. We from runahead execution from 26.5 to 11.8 found that even with untuned values of N, M, percent while reducing the performance ben- and T (100, 1,000, and 25, respectively, in efit only slightly, from 22.6 to 21.2 percent. our experiments), sampling can significantly reduce the extra instructions resulting from Useless runahead periods runahead execution. Useless runahead periods are those in which The fourth uselessness prediction technique the processor generates no useful L2 misses leverages compile-time profiling. The compil- that are needed by normal mode execution, er profiles the application and identifies load as Figure 3c shows. These periods exist instructions that consistently cause useless because of the lack of MLP5 in the applica- runahead periods. The compiler marks such tion program—that is, because the applica- load instructions as nonrunahead loads. When tion lacks independent cache misses under the the hardware encounters a nonrunahead load shadow of an L2 miss. Useless periods are inef- instruction that is an L2 cache miss, it doesn’t ficient because they increase the number of initiate runahead execution on that load. executed instructions without benefiting per- Combining the four uselessness prediction formance. To eliminate a useless runahead techniques reduces the extra instructions from period, we propose four simple mechanisms 26.5 to 14.9 percent while reducing the per- for predicting whether a runahead period will formance benefit only slightly, from 22.6 to be useful (that is, whether it will generate an 20.8 percent. Experiments analyzing each tech- L2 cache miss). nique’s effectiveness are available elsewhere.4 In the first technique, the processor records the usefulness of past runahead periods caused Increasing the usefulness of runahead periods by static load instructions in the Runahead Because runahead execution’s performance Cause Status Table (RCST), a small table of improvement is mainly a result of the useful two-bit counters.4 If recent runahead periods L2 misses prefetched during runahead mode,3 initiated by the same load were useful, the discovering more L2 misses during runahead

JANUARY–FEBRUARY 2006 15 MICRO TOP PICKS

mode can increase the benefit. We propose cache pollution. Moreover, inaccurate hard- two optimizations that increase efficiency by ware prefetcher requests can cause resource increasing runahead periods’ usefulness. contention for the more accurate runahead memory requests during runahead mode and Eliminating useless instructions thus reduce runahead execution’s effectiveness. Because runahead execution aims to gener- Runahead execution and hardware data ate L2 cache misses, instructions that don’t prefetching have synergistic behavior1 (see also contribute to the generation of L2 cache miss- Reference 2 in the “Related Work” sidebar). We es are essentially useless for its purposes. propose optimizing the prefetcher update pol- Therefore, eliminating these instructions dur- icy in runahead mode to increase the synergy ing runahead mode can increase a runahead between these two prefetching mechanisms. period’s usefulness. Our analysis shows that creating new hard- Floating-point (FP) operate instructions, ware prefetch streams is sometimes harmful which don’t contribute to the address compu- in runahead mode because these streams con- tation of load instructions, are an example of tend with more accurate runahead requests. such useless instructions. We turn off the FP Thus, not creating prefetch streams in runa- unit during runahead mode and drop FP oper- head mode increases the usefulness of runa- ate instructions after they are decoded. This head periods. This optimization increases optimization spares the processor resources for runahead execution’s IPC improvement (from more useful instructions that lead to the gen- 22.6 to 25.0 percent) and also reduces the eration of load/store addresses, which increas- extra instructions (from 26.5 to 24.7 percent). es the likelihood of generating an L2 miss during a runahead period. Furthermore, by Putting it all together not executing the energy-intensive FP instruc- Figure 4 shows the increase in executed tions and powering down the FP unit during instructions and IPC resulting from runahead runahead mode, the processor can save signif- execution when we incorporate our proposed icant dynamic and static energy. techniques into a runahead processor. We exam- However, turning off the FP unit during ine the effect of profiling-based useless period runahead mode can reduce performance. If a elimination separately because it requires mod- processor mispredicts a control-flow instruc- ifying the instruction set architecture (ISA). tion that depends on an FP instruction’s result Applying all of our proposed techniques sig- during runahead mode, it has no way of recov- nificantly reduces the average increase in exe- ering from that misprediction if the FP unit is cuted instructions in a runahead processor turned off because the branch’s source operand from 26.5 to only 6.7 percent (6.2 percent wouldn’t be computed. Nevertheless, our sim- with profiling). Using the proposed tech- ulations show that turning off the FP unit is a niques reduces the average IPC increase of valuable optimization that both increases runa- runahead execution slightly, from 22.6 to 22.0 head execution’s performance improvement percent (22.1 percent with profiling). Hence, (from 22.6 to 24.0 percent) and reduces the a runahead processor using the proposed tech- extra instructions (from 26.5 to 25.5 percent). niques is much more efficient than a tradi- tional runahead processor but it increases Optimizing runahead execution and hardware performance almost as much. prefetcher interaction Figure 5 shows that the proposed tech- A potential benefit of runahead execution is niques are effective for a wide range of mem- that the processor can update the hardware ory latencies. As memory latency increases, data prefetcher during runahead mode. If the both the IPC improvement and extra instruc- updates are accurate, the prefetcher can gen- tions resulting from runahead execution erate prefetches earlier than it would in the increase. Hence, runahead execution is more baseline processor. This can improve the time- effective with longer memory latencies. For liness of the accurate prefetches. On the other almost all memory latencies, using the pro- hand, if the prefetches generated by updates posed efficiency techniques increases the aver- during runahead mode are inaccurate, they’ll age IPC improvement on the FP benchmarks waste memory bandwidth and can cause while only slightly reducing the IPC

16 IEEE MICRO 235 120 115 110 Baseline runahead 105 Efficient runahead (no profiling) 100 Efficient runahead (with profiling) 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20

Increase in executed instructions (percent) 15 10 5 0 art eon gap gcc mcf vpr gzip twolf apsi swim bzip2crafty vortex ammpapplu fma3dgalgellucas mesamgrid parser equake facerec amean perlbmk sixtrack wupwise (a)

120 115 110 105 100 95 90 85 80 75 70 65 60 55 50 45 40

Increase in IPC (percent) 35 30 25 20 15 10 5 0 art eon gap gcc mcf vpr gzip twolf apsi swim bzip2crafty vortex ammpapplu fma3dgalgel lucasmesamgrid parser equake facerec hmean perlbmk sixtrack wupwise (b)

Figure 4. Increase in executed instructions (a) and IPC (b) resulting from runahead execution after incorporating all of our effi- ciency techniques. improvement on the integer (INT) bench- • It doesn’t require large, complex, and marks. For all memory latencies, using the power-hungry structures in the proces- proposed dynamic techniques significantly sor core. Instead, it utilizes the already- reduces the extra instructions. existing processing structures to improve memory latency tolerance. fficient runahead execution has two major • With the simple efficiency techniques Eadvantages: described in this article, it requires only

JANUARY–FEBRUARY 2006 17 MICRO TOP PICKS

effective solution to the pressing memory laten- 60 cy problem in high-performance processors. Baseline runahead (FP) 55 Orthogonal approaches can be developed to Efficient runahead (FP) solve the inefficiency problem in runahead 50 Baseline runahead (INT) Efficient runahead (INT) processors, which we believe is an important 45 research area in runahead execution and other 40 memory-latency tolerance techniques. In par- ticular, solutions to two important problems 35 in computer architecture can significantly 30 increase runahead execution’s efficiency: branch 25 mispredictions and dependent cache misses. Because processors rely on correct branch 20 predictions to stay on the correct program 15 path during runahead mode, the development 10 of more accurate branch predictors will

Increase in executed instructions (percent) increase runahead execution’s efficiency and 5 performance benefits. Irresolvable branch mis- 0 100 300 500 700 900 predictions that depend on L2 cache misses Memory latency in cycles (minimum) cause the processor to stay on the wrong path, which might not always provide useful (a) prefetching benefits, until the runahead peri- 60 od ends. Reducing such branch mispredic- tions with novel techniques is a promising area 55 of future work. 50 Dependent L2 cache misses reduce a runa- 45 head period’s usefulness because they can’t be parallelized using runahead execution. There- 40 fore, runahead execution is inefficient, and 35 sometimes ineffective, for pointer-chasing 30 workloads in which dependent load instruc- 25 tions are common. In previous work, we’ve shown that a simple value-prediction tech- 20 nique for pointer-load instructions—address- Increase in IPC (percent) 15 value delta prediction—significantly increases 10 runahead execution’s efficiency and perfor- mance by parallelizing dependent L2 cache 5 misses.6 Enabling the parallelization of depen- 0 dent cache misses is another promising area 100 300 500 700 900 of future research in runahead execution. Memory latency in cycles (minimum) Our future research will also focus on refin- (b) ing the methods for increasing the usefulness of runahead execution periods. Combined Figure 5. Increase in executed instructions (a) and IPC (b) with and without the compiler-microarchitecture mechanisms can efficiency techniques for five different memory latencies. Data shown is aver- be instrumental in eliminating useless runa- aged separately over integer (INT) and floating-point (FP) benchmarks. head instructions. Through simple modifica- tions to the ISA, the compiler can convey to the hardware which instructions are impor- a small number of extra instructions to tant to execute or not execute during runa- provide significant performance head mode. Furthermore, the compiler might improvements. be able to increase runahead periods’ useful- ness by trying to arrange code such that inde- Hence, efficient runahead execution provides pendent L2 cache misses are clustered close a simple, energy-efficient, and complexity- together during program execution.

18 IEEE MICRO Eliminating the reexecution of instructions 4. O. Mutlu, H. Kim, and Y.N. Patt, “Tech- executed in runahead mode via result reuse7 niques for Efficient Processing in Runahead or value prediction8 can potentially increase Execution Engines,” Proc. 32nd Int’l Symp. runahead execution’s efficiency. However, Computer Architecture (ISCA 05), IEEE CS even an ideal reuse mechanism doesn’t signif- Press, 2005, pp. 370–381. icantly improve performance7 and likely has 5. A. Glew, “MLP Yes! ILP No!” Architectural significant hardware cost and complexity, Support for Programming Languages and which can offset the energy reduction result- Operating Systems (ASPLOS 98) Wild and ing from improved efficiency. Value predic- Crazy Idea Session, Oct. 1998; http://www. tion might not significantly improve efficiency cs.berkeley.edu/~kubitron/asplos98/slides/ because of its low accuracy.8 Nevertheless, fur- andrew_glew.pdf. ther research on eliminating the unnecessary 6. O. Mutlu, H. Kim, and Y.N. Patt, “Address reexecution of instructions might yield low- Value Delta (AVD) Prediction: Increasing the cost mechanisms that can significantly Effectiveness of Runahead Execution by improve runahead efficiency. Exploiting Regular Memory Allocation Pat- Finally, the scope of our efficient processing terns,” Proc. 38th Int’l Symp. Microarchi- techniques isn’t limited to runahead execution. tecture (Micro-38), IEEE CS Press, 2005, pp. In general, the proposed runahead uselessness 233–244. predictors are techniques for predicting the 7. O. Mutlu et al., “On Reusing the Results of available MLP at a given point in a program. Pre-executed Instructions in a Runahead They are therefore applicable to other mecha- Execution Processor,” Computer Architec- nisms that are designed to exploit MLP. Other ture Letters, vol. 4, Jan. 2005, http://www. methods of preexecution that are targeted for cs.virginia.edu/~tcca/2005/mutlu_jan05.pdf. prefetching, such as helper threads,9,10 can use 8. N. Kirman et al., “Checkpointed Early Load our techniques to eliminate inefficient threads Retirement,” Proc. 11th Int’l Symp. High- and useless speculative execution. MICRO Performance Computer Architecture (HPCA- 11), IEEE CS Press, 2005, pp. 16–27. Acknowledgments 9. R.S. Chappell et al., “Simultaneous Subor- We thank Mike Butler, Nhon Quach, Jared dinate Microthreading (SSMT),” Proc. 26th Stark, Santhosh Srinath, and other members of Int’l Symp. Computer Architecture (ISCA 99), the HPS research group for their helpful com- IEEE CS Press, 1999, pp. 186–195. ments on earlier drafts of this article. We grate- 10. J.D. Collins et al., “Dynamic Speculative Pre- fully acknowledge the commitment of the computation,” Proc. 34th Int’l Symp. Cockrell Foundation, Intel Corporation, and Microarchitecture (Micro-34), IEEE CS Press, the Advanced Technology Program of the Texas 2001, pp. 306–317. Higher Education Coordinating Board. Onur Mutlu is a PhD candidate in computer References engineering at the University of Texas at Austin. 1. O. Mutlu et al., “Runahead Execution: An His research interests include computer archi- Alternative to Very Large Instruction Win- tectures, with a focus on high-performance dows for Out-of-Order Processors,” Proc. energy-efficient microarchitectures, data 9th Int’l Symp. High-Performance Computer prefetching, runahead execution, and novel Architecture (HPCA-9), IEEE Press, 2003, pp. latency-tolerance techniques. Mutlu has an MS 129–140. in computer engineering from UT Austin and 2. J. Dundas and T. Mudge, “Improving Data BS degrees in psychology and computer engi- Cache Performance by Pre-executing neering from the University of Michigan. He is Instructions Under a Cache Miss,” Proc. a student member of the IEEE and the ACM. 1997 Int’l Conf. Supercomputing, IEEE Press, 1997, pp. 68-75. Hyesoon Kim is a PhD candidate in electri- 3. O. Mutlu et al., “Runahead Execution: An cal and computer engineering at the Univer- Effective Alternative to Large Instruction sity of Texas at Austin. Her research interests Windows,” IEEE Micro, vol. 23, no. 6, include high-performance energy-efficient Nov./Dec. 2003, pp. 20–25. microarchitectures and compiler-microarchi-

JANUARY–FEBRUARY 2006 19 MICRO TOP PICKS

tecture interaction. Kim has master’s degrees Gates to C and Beyond, (McGraw-Hill, 2nd in mechanical engineering from Seoul edition, 2004). His honors include the 1996 National University and in computer engi- IEEE/ACM Eckert-Mauchly Award and the neering from UT Austin. She is a student 2000 ACM Karl V. Karlstrom Award. He is a member of the IEEE and the ACM. Fellow of both the IEEE and the ACM.

Yale N. Patt is the Ernest Cockrell Jr. Cen- tennial Chair in Engineering at the Universi- Direct questions and comments about this ty of Texas at Austin. His research interests article to Onur Mutlu, 2501 Lake Austin include harnessing the expected fruits of future Blvd., Apt. N204, Austin, TX 78703; process technology into more effective [email protected]. microarchitectures for future . Patt has a PhD in electrical engineering from For further information on this or any other Stanford University. He is co-author of Intro- computing topic, visit our Digital Library at duction to Computing Systems: From Bits and http://www.computer.org/publications/dlib.

REACH HIGHER Advancing in the IEEE Computer Society can elevate your standing in the profession. Application to Senior-grade membership recognizes ✔ ten years or more of professional expertise Nomination to Fellow-grade membership recognizes ✔ exemplary accomplishments in computer engineering GIVE YOUR CAREER A BOOST ■ UPGRADE YOUR MEMBERSHIP www.computer.org/join/grades.htm

20 IEEE MICRO