Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining

Beating in-order stalls with “flea-flicker”∗two-pass pipelining Ronald D. Barnes Erik M. Nystrom John W. Sias Sanjay J. Patel Nacho Navarro Wen-mei W. Hwu Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {rdbarnes, nystrom, sias, sjp, nacho, hwu}@crhc.uiuc.edu Abstract control speculation features allow the compiler to miti- gate control dependences, further increasing static schedul- Accommodating the uncertain latency of load instructions ing freedom. Predication enables the compiler to optimize is one of the most vexing problems in in-order microarchi- program decision and to overlap independent control con- tecture design and compiler development. Compilers can structs while minimizing code growth. In the absence of generate schedules with a high degree of instruction-level unanticipated run-time delays such as cache miss-induced parallelism but cannot effectively accommodate unantici- stalls, the compiler can effectively utilize execution re- pated latencies; incorporating traditional out-of-order exe- sources, overlap execution latencies, and work around execution into the microarchitecture hides some of this latency cution constraints [1]. For example, we have measured that, but redundantly performs work done by the compiler and when run-time stall cycles are discounted, the Intel refer- adds additional pipeline stages. Although effective tech- ence compiler can achieve an average throughput of 2.5 in- niques, such as prefetching and threading, have been pro- structions per cycle (IPC) across SPECint2000 benchmarks posed to deal with anticipable, long-latency misses, the for a 1.0GHz Itanium 2 processor. shorter, more diffuse stalls due to difficult-to-anticipate, Run-time stall cycles of various types prolong the execu- first- or second-level misses are less easily hidden on in- tion of the compiler-generated schedule, in the noted exam- order architectures. This paper addresses this problem ple reducing throughput to 1.3IPC. This paper focuses on by proposing a microarchitectural technique, referred to the majority of those stall cycles—those that arise due to a as two-pass pipelining, wherein the program executes on load instruction missing in the data cache, when a load’s re- two in-order back-end pipelines coupled by a queue. The sult does not arrive in time for consumption by its consumer “advance” pipeline executes instructions greedily, without instruction, triggering an interlock. Cache miss stall cycles stalling on unanticipated latency dependences (executing are significant in the current generation of microprocessors independent instructions while otherwise blocking instruc- and are expected to increase as the gap between processor tions are deferred). The “backup” pipeline allows con- and memory speeds continues to grow [2]. Achieving high current resolution of instructions that were deferred in the performance in any processor design requires that they be other pipeline, resulting in the absorption of shorter misses mitigated effectively. and the overlap of longer ones. This paper argues that this There are two important issues with data stall cycles. design is both achievable and a good use of transistor re- First, the run-time occurrence of data cache misses is in sources and shows results indicating that it can deliver sig- general hard to predict at compile time. Compilers can at- nificant speedups for in-order processor designs. tempt to schedule instructions according to their expected cache miss latency; such strategies, however, fail to capital- 1 Introduction ize on cache hits and can over-stress critical resources such as machine registers. Second, when a data stall arises, it is Modern instruction set architectures offer the compiler sev- desirable to overlap the data stall cycles with other data stall eral features supporting the enhancement of instruction- cycles as well as computing cycles. This requires the abil- level parallelism and the generation of aggressive sched- ity to defer the execution of an instruction waiting for its ules for wide issue processors. Large register files grant the data while allowing other load and compute instructions to compiler broad computation restructuring ability needed proceed. Contemporary out-of-order designs rely on reg- to overlap the execution latency of instructions. Explicit ister renaming, dynamic scheduling, and large instruction windows to provide such concurrency. ∗ In American football, the flea-flicker offense tries to catch the de- Although these out-of-order execution mechanisms ef- fense off guard with the addition of a forward pass to a lateral pass play. Defenders covering the ball carrier thus miss the tackle and, hopefully, the fectively hide data cache miss delays, they replicate, at ensuing play. great expense, much work done by the compiler. Regis- Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003) 0-7695-2043-X/03 $17.00 © 2003 IEEE ter renaming duplicates the effort of compile-time regis- available parallelism, but rather a particular implementation ter allocation. Dynamic scheduling repeats the work of within that constraint. In Itanium, these groups are sep- the compile-time scheduler. These mechanisms incur ad- arated by variably-positioned “stop bits.” All instructions ditional power consumption, add instruction pipeline la- within an “issue group” are essentially fused with respect tency, reduce predictability of performance, complicate to dependence-checking [4]. If an issue group contains an EPIC feature implementation, and occupy substantial ad- instruction whose operands are not ready, the entire group ditional chip real estate. and all groups behind it are stalled. This design accommo- Attempting to exploit the efficiencies of EPIC compi- dates wide issue by reducing the complexity of the issue lation and an in-order pipeline design while avoiding the logic, but introduces the likelihood of “artificial”1 depen- penalty of cache miss stalls, this paper proposes a new mi- dences between instructions of unanticipated latency and croarchitectural organization employing two in-order sub- instructions grouped with or subsequent to their consumers. pipelines bridged by a first-in-first-out buffer (queue). The Not surprisingly, therefore, a large proportion of EPIC “advance” sub-pipeline, referred to as the A-pipe, executes execution time is spent stalled waiting for data cache misses all instructions speculatively without stalling. Instructions to return. When, for example, SPECint2000 is compiled dispatching without all of their input operands ready, rather with a commercial reference compiler (Intel ecc v.7.0) at than incurring stalls, are suppressed, bypassing and writing a high level of optimization (-O3 -ipo -prof use) and ex- specially marked non-results to their consumers and desti- ecuted on a 1.0GHz Itanium 2 processor with 3MB of L3 nations. Other instructions execute normally. This propa- cache, 38% of execution cycles are consumed by data mem- gation of non-results in the A-pipe to identify instructions ory access-related stalls. Furthermore, depending on the affected by deferral is inspired by EPIC control speculation benchmark, between 10% and 95% of these stall cycles work [3]. The “backup” sub-pipeline, the B-pipe, executes are incurred due to accesses satisfied in the second-level instructions deferred in the A-pipe and incorporates all re- cache, despite its having a latency of only five cycles. As sults in a consistent order. This two-pipe structure allows suggested previously, the compiler’s carefully generated, cache miss latencies incurred in one pipe to overlap with in- highly parallel schedule is being disrupted by the injection dependent execution and cache miss latencies in the other of many, short, unanticipated memory latencies. The two- while preserving in-order semantics in each. pass design absorbs these events while allowing efficient This paper presents the design and evaluation of the flea- exploitation of the compiler’s generally good schedule. flicker two-pass pipelining model. We argue that two-pass Figure 1 shows an example from one of the most signif- pipelining effectively hides the latency of near cache ac- icant loops in the SPECint2000 benchmark with the most cesses (such as hits in the L2 cache) and provides substan- pronounced data cache problems, 181.mcf. The figure, in tial performance benefit while preserving the most impor- which each row constitutes one issue group and arrows in- tant characteristics of EPIC design. This argument is sup- dicate data dependences, shows one loop iteration plus one ported with simulations of SPEC95 and SPEC2000 bench- issue group from the next. In a typical EPIC machine, on marks that characterize the prevalence and effects of the the indicated cache miss stall caused by the consumption targeted latency events, demonstrate the effectiveness of the of r42 in group 1, all subsequent instructions (dotted box) proposed model in achieving concurrent execution through are prevented from issuing until the load is resolved, al- these events, and inform the design decisions involved in though only those instructions enclosed in the solid box building two-pass systems. are truly dependent on the cache miss. (Since the last of these is a branch, the instructions subsequent to the branch are, strictly speaking, control dependent on the cache miss, 2 Motivation and case study but a prediction effectively breaks this control dependence.) An out-of-order

Beating In-Order Stalls with “Flea-Flicker” Two-Pass Pipelining

Caching Basics

45-Year CPU Evolution: One Law and Two Equations

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

An Algorithmic Theory of Caches by Sridhar Ramachandran

Generalized Methods for Application Specific Hardware Specialization

14. Caching and Cache-Efficient Algorithms

Open Jishen-Zhao-Dissertation.Pdf

Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation

Chapter 2: Memory Hierarchy Design

Architecting and Programming an Incoherent Multiprocessor Cache

55 Measuring Microarchitectural Details of Multi- and Many-Core

Overview of 15-740