Vector Runahead
Total Page:16
File Type:pdf, Size:1020Kb
Vector Runahead Ajeya Naithaniy Sam Ainsworthz Timothy M. Joneso Lieven Eeckhouty yGhent University zUniversity of Edinburgh oUniversity of Cambridge Abstract—The memory wall places a significant limit on stride patterns, such techniques are endemic in today’s cache performance for many modern workloads. These applications systems [19]. For more complex indirection patterns, the feature complex chains of dependent, indirect memory accesses, inability at the cache-system level to identify complex load which cannot be picked up by even the most advanced microar- chitectural prefetchers. The result is that current out-of-order chains and generate their addresses limits existing techniques superscalar processors spend the majority of their time stalled. to simple array-indirect [88] and pointer-chasing [23] codes. While it is possible to build special-purpose architectures to To achieve the instruction-level visibility necessary to cal- exploit the fundamental memory-level parallelism, a microarchi- culate the addresses of complex access patterns seen in to- tectural technique to automatically improve their performance day’s workloads [3], we conclude that this ideal technique in conventional processors has remained elusive. Runahead execution is a tempting proposition for hiding must operate within the core, instead of within the cache. latency in program execution. However, to achieve high memory- Runahead execution [25, 32, 34, 57, 58, 64] is the most level parallelism, a standard runahead execution skips ahead of promising technique to date, where upon a memory stall cache misses. In modern workloads, this means it only prefetches at the head of the reorder buffer (ROB), execution enters the first cache-missing load in each dependent chain. We argue a speculative ‘runahead’ mode designed to prefetch future that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then memory accesses. In runahead mode, the addresses of future it could regain performance if it could stall on many at once. memory accesses are calculated and the memory accesses are With this insight, we present Vector Runahead, a technique that speculatively issued. When the blocking load miss returns, prefetches entire load chains and speculatively reorders scalar the ROB unblocks and the processor resumes normal-mode operations from multiple loop iterations into vector format to execution, at which time the memory accesses hit in the nearby bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode caches. Runahead execution is a highly effective technique bandwidth with reduced resource requirements, to achieve high to identify independent load misses in the future instruction degrees of memory-level parallelism at a much faster rate. Across stream when the processor is stalled on a first long-latency load a variety of memory-latency-bound indirect workloads, Vector miss. By speculatively issuing multiple independent memory Runahead achieves a 1.79× performance speedup on a large accesses, runahead execution significantly increases memory- out-of-order superscalar system, significantly improving on state- of-the-art techniques. level parallelism (MLP), ultimately improving overall appli- cation performance. I. INTRODUCTION While runahead execution successfully prefetches indepen- Modern-day workloads are poorly served by current out- dent load misses in the future instruction stream, it suffers from of-order superscalar cores. From databases [40], to graph three fundamental limitations. First, runahead is unsuitable for workloads [49, 67], to HPC codes [11, 30], many workloads complex indirection patterns that consist of chains of depen- feature sparse, indirect memory accesses [9] characterized by dent load misses [62]. While combining a traditional hardware high-latency cache misses that are unpredictable by today’s stride prefetcher with runahead might enable prefetching one stride prefetchers [1, 19]. For these workloads, even out- level of indirection, it does not provide a general solution of-order superscalar processors spend the majority of their to prefetch all loads in a chain of dependent loads. Second, time stalled, since their ample reorder buffer and issue queue runahead execution is limited by the processor’s front-end resources are still insufficient to capture the memory-level (fetch/decode/rename) width: the rate at which runahead ex- parallelism necessary to hide today’s DRAM latencies. ecution can generate MLP is slow if there is a large number Still, this performance gap is not insurmountable. Many of instructions between the independent loads in the future elaborate accelerators [1,3, 40, 44, 45, 48] have shown sig- instruction stream. Third, the speculation depth of runahead is nificant success in improving the performance of these work- limited by the amount of available back-end resources (issue loads through domain-specific programming models and/or queue slots and physical registers) [64]. specialized hardware units dedicated to the task of fetching In this paper, we propose Vector Runahead, a novel runa- in new data. Part of the gap can even be eliminated with head technique that overcomes the above limitations through sophisticated programmer- and compiler-directed software- three key innovations. First, Vector Runahead alters the runa- prefetching mechanisms [2,4, 30, 41]. head’s termination condition by remaining in runahead mode More ideal is a pure-microarchitecture technique that can until all loads in the dependence chain have been issued, do the same: achieving high performance for memory-latency- as opposed to returning to normal mode as soon as the bound workloads, while requiring no programmer input, and blocking load miss returns from main memory. This enables being binary-compatible with today’s applications. For simple Vector Runahead to prefetch entire load chains. Second, Base IQ full Stride Indirect Other 3.5 full-window stall. Such a memory access typically stalls the 3.0 core for tens to hundreds of cycles. Figure1 shows CPI stacks for a range of benchmarks on an OoO core [27] (see 2.5 SectionV for our experimental setup). In addition to the 2.0 cycles spent on performing useful work (shown as the ‘Base’ 1.5 component), the CPI stacks also show the number of cycles the Cycles Per Instruction Per Cycles 1.0 processor is waiting due to a full issue queue (‘IQ full’) or a 0.5 memory access. Memory-access cycles are divided into three 0.0 types of load instruction stalling the processor: (1) striding VR VR VR VR VR VR VR VR VR VR PRE PRE PRE PRE PRE PRE PRE PRE PRE PRE OoO OoO OoO OoO OoO OoO OoO OoO OoO OoO load instructions (‘Stride’); (2) indirect load instructions that Camel G5-s16 G5-s21 HJ2 HJ8 Kangar NAS-CG NAS-IS Rand Avg directly or indirectly depend on a striding load instruction Fig. 1: CPI stacks for an out-of-order (OoO) core, Precise (‘Indirect’); and (3) other types of load instructions (‘Other’). Runahead Execution (PRE) and our new Vector Runahead We find that indirect load instructions stall the processor for (VR). The memory component is broken down and attributed 61.6% of the total execution time on average, and up to 89.8% to striding loads and indirect dependent-chain loads. The pre- (HJ8). For high performance, it is critical to eliminate stalls vious state-of-the-art runahead cannot prefetch the majority from indirect memory accesses in OoO cores. of (indirect) memory accesses, unlike Vector Runahead. B. Limitations of Runahead Techniques Vector Runahead vectorizes the runahead instruction stream To alleviate the bottleneck caused by memory accesses, by reinterpreting scalar instructions as vector operations to standard runahead execution [25, 32, 34, 57, 58] checkpoints generate many different cache misses at different offsets. It and releases the architectural state of an application after a full- subsequently gathers many dependent loads in the instruction window stall and enters runahead mode. The processor then sequence, hiding cache latency even for complex memory- continues speculatively generating memory accesses. When access patterns. In effect, vectorization virtually increases the the blocking memory access returns, the runahead interval effective fetch/decode bandwidth during runahead mode, while terminates, at which point the pipeline is flushed, the archi- at the same time requiring very few back-end resources, e.g., tectural state is restored to the point of entry to runahead, and only a single issue queue slot is required for a vector operation normal execution resumes. The prefetches generated during with 8 (memory) operations. Third, Vector Runahead issues runahead mode bring future data into the processor caches that multiple rounds of these vectorized instructions through vector reduce the number of upcoming stalls during normal mode, unrolling and pipelining to speculate even deeper and increase thus improving performance. the effective runahead fetch/decode bandwidth even further — Precise Runahead Execution (PRE) [63, 64], the state-of- e.g., 8 rounds of vector runahead with 8 vector loads each, lead the-art in runahead execution, improves upon standard runa- to 64 speculative prefetches that are issued in parallel. head through three key mechanisms. (1) PRE leverages the We evaluate Vector Runahead through detailed simulation available back-end (issue queue and physical register file) using a variety of graph, database and HPC workloads,