Runahead Execution: an Alternative to Very Large Instruction Windows for Out-Of-Order Processors

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu § Jared Stark † Chris Wilkerson ‡ Yale N. Patt § §ECE Department †Microprocessor Research ‡Desktop Platforms Group The University ofTexas at Austin Intel Labs Intel Corporation {onur,patt}@ece.utexas.edu [email protected] [email protected] Abstract the processor buffers the operations in an instruction window, the size ofwhich determines the amount oflatency the Today’s high performance processors tolerate long la- out-of-order engine can tolerate. tency operations by means of out-of-order execution. How- Today’s processors are facing increasingly larger laten- ever, as latencies increase, the size of the instruction win- cies. With the growing disparity between processor and dow must increase even faster if we are to continue to tol- memory speeds, operations that cause cache misses out to erate these latencies. We have already reached the point main memory take hundreds ofprocessor cycles to com- where the size of an instruction window that can handle plete execution [25]. Tolerating these latencies solely with these latencies is prohibitively large, in terms of both de- out-of-order execution has become difficult, as it requires sign complexity and power consumption. And, the problem ever-larger instruction windows, which increases design is getting worse. This paper proposes runahead execution complexity and power consumption. For this reason, com- as an effective way to increase memory latency tolerance puter architects developed software and hardware prefetch- in an out-of-order processor, without requiring an unrea- ing methods to tolerate these long memory latencies. sonably large instruction window. Runahead execution un- We propose using runahead execution [10] as a substi- blocks the instruction window blocked by long latency op- tute for building large instruction windows to tolerate very erations allowing the processor to execute far ahead in the long latency operations. Instead ofmoving the long-latency program path. This results in data being prefetched into operation “out ofthe way,” which requires buffering it and caches long before it is needed. On a machine model based the instructions that follow it in the instruction window, on the IntelR PentiumR 4 processor, having a 128-entry in- runahead execution on an out-of-order execution processor struction window, adding runahead execution improves the tosses it out ofthe instruction window. IPC (Instructions Per Cycle) by 22% across a wide range of When the instruction window is blocked by the long- memory intensive applications. Also, for the same machine latency operation, the state ofthe architectural register file is model, runahead execution combined with a 128-entry win- checkpointed. The processor then enters “runahead mode.” dow performs within 1% of a machine with no runahead It distributes a bogus result for the blocking operation and execution and a 384-entry instruction window. tosses it out ofthe instruction window. The instructions following the blocking operation are fetched, executed, and pseudo-retired from the instruction window. By pseudo- 1. Introduction retire, we mean that the instructions are executed and com- pleted as in the conventional sense, except that they do not update architectural state. When the blocking operation Today’s high performance processors tolerate long la- completes, the processor re-enters “normal mode.” It re- tency operations by implementing out-of-order instruction stores the checkpointed state and refetches and re-executes execution. An out-of-order execution engine tolerates long instructions starting with the blocking operation. latencies by moving the long-latency operation “out ofthe Runahead’s benefit comes from transforming a small in- way” ofthe operations that come later in the instruction struction window which is blocked by long-latency opera- stream and that do not depend on it. To accomplish this, tions into a non-blocking window, giving it the performance Intel R and Pentium R are trademarks or registered trademarks ofIn- ofa much larger window. The instructions fetched and exe- tel Corporation or its subsidiaries in the United States and other countries. cuted during runahead mode create very accurate prefetches Proceedings of the The Ninth International Symposium on High-Performance Computer Architecture (HPCA-9’03) 1530-0897/02 $17.00 © 2002 IEEE for the data and instruction caches. These benefits come at tion bandwidth), which are not available when the processor a modest hardware cost, which we will describe later. is well used. In this paper we only evaluate runahead for memory op- Runahead execution [10] was first proposed and evalu- erations that miss in the second-level cache, although it can ated as a method to improve the data cache performance be initiated on any long-latency operation that blocks the in- ofa five-stage pipelined in-order execution machine. It struction window. We use Intel’s IA-32 ISA, and through- was shown to be effective at tolerating first-level data cache out this paper, microarchitectural parameters (e. g., instruc- and instruction cache misses [10, 11]. In-order execution is tion window size) and IPC (Instructions Per Cycle) perfor- unable to tolerate any cache misses, whereas out-of-order mance are reported in terms ofmicro-operations. Using execution can tolerate some cache miss latency by exe- a machine model based on the Intel Pentium 4 processor, cuting instructions that are independent ofthe miss. We which has a 128-entry instruction window, we first show will show that out-of-order execution cannot tolerate long- that current out-of-order execution engines are unable to latency memory operations without a large, expensive in- tolerate long main memory latencies. Then we show that struction window, and that runahead is an alternative to a runahead execution can better tolerate these latencies and large window. We also introduce the “runahead cache” to achieve the performance of a machine with a much larger effectively handle store-load communication during runa- instruction window. Our results show that a baseline ma- head mode. chine with a realistic memory latency has an IPC perfor- Balasubramonian et al. [3] proposed a mechanism to ex- mance of0.52, whereas a machine with a 100% second- ecute future instructions when a long-latency instruction level cache hit ratio has an IPC of1.26. Adding runahead blocks retirement. Their mechanism dynamically allocates increases the baseline’s IPC by 22% to 0.64, which is within a portion ofthe register file to a “future thread,” which is 1% ofthe IPC ofan identical machine with a 384-entry in- launched when the “primary thread” stalls. This mechanism struction window. requires partial hardware support for two different contexts. Unfortunately, when the resources are partitioned between 2. Related work the two threads, neither thread can make use ofthe machine’s full resources, which decreases the future thread’s benefit and increases the primary thread’s stalls. In runa- Memory access is a very important long-latency op- head execution, both normal and runahead mode can make eration that has concerned researchers for a long time. use ofthe machine’s fullresources, which helps the ma- Caches [29] tolerate memory latency by exploiting the chine to get further ahead during runahead mode. temporal and spatial reference locality of applications. Finally, Lebeck et al. [20] proposed that instructions de- Kroft [19] improved the latency tolerance of caches by al- pendent on a long-latency operation be removed from the lowing them to handle multiple outstanding misses and to (relatively small) scheduling window and placed into a (rel- service cache hits in the presence ofpending misses. atively big) waiting instruction buffer (WIB) until the oper- Software prefetching techniques [5, 22, 24] are effective ation is complete, at which point the instructions are moved for applications where the compiler can statically predict back into the scheduling window. This combines the la- which memory references will cause cache misses. For tency tolerance benefit ofa large instruction window with many applications this is not a trivial task. These techniques the fast cycle time benefit of a small scheduling window. also insert prefetch instructions into applications, increasing However, it still requires a large instruction window (and a instruction bandwidth requirements. large physical register file), with its associated cost. Hardware prefetching techniques [2, 9, 16, 17] use dy- namic information to predict what and when to prefetch. They do not require any instruction bandwidth. Different 3. Out-of-order execution and memory latency prefetch algorithms cover different types of access patterns. tolerance The main problem with hardware prefetching is the hardware cost and complexity ofa prefetcherthat can cover the 3.1. Instruction and scheduling windows different types of access patterns. Also, if the accuracy of the hardware prefetcher is low, cache pollution and unnec- Out-of-order execution can tolerate cache misses better essary bandwidth consumption degrades performance. than in-order execution by scheduling operations that are in- Thread-based prefetching techniques [8, 21, 31] use idle dependent ofthe miss. An out-of-orderexecution machine thread contexts on a multithreaded processor to run threads accomplishes this using two windows: the instruction win- that help the primary thread [6]. These helper

Runahead Execution: an Alternative to Very Large Instruction Windows for Out-Of-Order Processors

Precise Runahead Execution

Runahead Execution

Rock:Ahigh-Performance Sparc Cmt Processor

Vector Runahead

Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance

Onur-447-Spring12-Lecture23

Copyright by Onur Mutlu 2006 the Dissertation Committee for Onur Mutlu Certiﬁes That This Is the Approved Version of the Following Dissertation

Runahead Execution

Configurable Simultaneously Single-Threaded (Multi-)Engine Processor

Runahead Execution a Short Retrospective

Vector Runahead

Runahead Threads