Stream-Based Memory Access Specialization for General Purpose Processors

Stream-based Memory Access Specialization for General Purpose Processors Zhengrong Wang Tony Nowatzki University of California, Los Angeles University of California, Los Angeles [email protected] [email protected] ABSTRACT 1 INTRODUCTION Because of severe limitations in technology scaling, architects have Inspired by the slowing of technology scaling and the limits of innovated in specializing general purpose processors for compu- improving general purpose processors, architects have developed tation primitives (e.g. vector instructions, loop accelerators). The many techniques which specialize aspects of general purpose ex- general principle is exposing rich semantics to the ISA. An oppor- ecution. This includes well-known techniques like vector instruc- tunity to explore is whether richer semantics of memory access tions, custom application specific instructions, and more recently patterns could also be used to improve the efficiency of memory and a number of in-core accelerators which specialize for particular communication. Two important open questions are how to convey computational patterns [3–9]. These approaches add rich semantic higher level memory information and how to take advantage of information into an ISA and leverage this with a more aggressive this information in hardware. or streamlined computation substrate. For example, vector archi- We find that a majority of memory accesses follow a small num- tectures expose instruction parallelism in the ISA; CCA and DySER ber of simple patterns; we term these streams (e.g. affine, indirect). expose instruction dependence in the ISA; BERET exposes the pres- Streams can often be decoupled from core execution, and patterns ence of common control patterns to the ISA, and so on. persist long enough to express useful behavior. Our approach is Though these techniques deliver significant performance and en- therefore to express streams as ISA primitives, which we argue ergy advantages, their specialization capabilities end at the bound- can enable: prefetch stream accesses to hide memory latency, semi- ary to the memory system. This is problematic considering Am- binding decoupled access to remove address computation and opti- dahl’s law for energy; one study of general purpose cores showed mize the memory interface, and finally inform cache policies. that the cache hierarchy and network consumed more than 50% of In this work, we propose ISA-extensions for decoupled-streams, power on chip [10]. Also, specialization paradigms typically make which interact with the core using a FIFO-based interface. We imple- strong assumptions on common code behaviors, limiting their po- ment optimizations for each of the aforementioned opportunities tential scope1. Specializing the core computation substrate and on an aggressive wide-issue OOO core and evaluate with SPEC CPU relying on only traditional memory abstractions is insufficient. 2017 and CortexSuite[1, 2]. Across all workloads, we observe about One unexplored opportunity is to perform the equivalent form 1.37× speedup and energy efficiency improvement over hardware of specialization for memory primitives: expose rich semantic infor- stride prefetching. mation about memory operations at fine grain at the ISA level, and take advantage with microarchitecture mechanisms. To be worth CCS CONCEPTS the specialization effort, such an approach should be designed to • Computer systems organization → Architectures; Hetero- address many of the sources of inefficiency of memory access, in- geneous (hybrid) systems. cluding within the core pipeline, at the interface to cache, and within the cache system itself. More specifically, this includes the KEYWORDS overhead in the pipeline for address generation instructions, the Stream, Specialization, Decoupled Access Execute, DAE, Prefetch- overhead of requiring long instruction windows for generating ing, Cache Bypassing, Microarchitecture, ISA overlapping memory requests, and also reducing the number of requests and transfers in the memory system. ACM Reference Format: A critical question then is: what is the structure of memory Zhengrong Wang and Tony Nowatzki. 2019. Stream-based Memory Access accesses which can be exploited? Our perhaps unsurprising answer Specialization for General Purpose Processors. In ISCA ’19: 46th International is streams – repeated patterns of memory access, occurring due to Symposium on Computer Architecture, June 22–26, 2019, Phoenix, AZ, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3307650.3322229 loops and nested loops. Figure 1 shows the prevalence of streams in CortexSuite [1, 2] and SPEC CPU 2017, with a breakdown to affine, indirect, and pointer-chasing types. On average, more than half of Permission to make digital or hard copies of all or part of this work for personal or dynamic accesses across both benchmark suites can be considered classroom use is granted without fee provided that copies are not made or distributed as streams, with the simplest affine streams being the most common. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM Given the premise of specialization for streams, a number of must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, interesting questions arise: How should memory access patterns be to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. communicated to avoid high-overhead? Should data be consumed ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA in a binding way, or only through the cache? How to ensure the © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6669-4/19/06...$15.00 1For example, CCA [3] and DySER [5] rely on clusters of dependent computations, https://doi.org/10.1145/3307650.3322229 BERET [4] relies on dynamic control-speculation being unnecessary, etc. 1 Affine Indirect Pointer Chase • Stream Specialization Principles/Mechanisms: The con- SPEC CPU 2017 cept of specializing multiple facets of general purpose proces- CortexSuite sor execution (core, memory-interface, and cache) with stream avg. abstractions, along with the development of mechanisms which leverage this information. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 • Stream Specialization ISA/Microarchitecture: A lightweight Figure 1: Prevalence of Streams set of ISA-extensions for applying stream-specialization to an ISA, along with microarchitecture design for implementation right decoupling distance? Which patterns should be expressible, and integration with the core pipeline. and can patterns be expressed in the presence of control flow? How Paper Organization: In the remainder of the paper, we first should the cache take advantage of this information? present an overview (§2), then discuss key related work from sev- Our goal in this work is to explore some possible answers to eral relevant areas (§3). We next characterize the prevalence of these questions. The crux of our approach is twofold: 1. the ISA is streams (§4), and use this insight to propose ISA extensions for de- extended to express coarse grain streaming patterns, and 2. the core coupling streams (§5). This is followed by a description of how the pipeline is responsible to keep track of its relative index into each new information is exploited in an SSP microarchitecture (§6), as stream. With such a design, we argue the following opportunities well as stream-aware prefetch and cache policies (§7). Finally, we dis- are possible (depicted by Figure 2): cuss evaluation methodology (§8), results (§9), and conclude (§10). Opportunity 1: Stream-based Prefetching: Having knowledge of access patterns and their relationship to the core’s control flow can lead to a very effective stream-based programmable prefetcher. 2 OVERVIEW The prefetcher can understand when exactly to make a request In this section, we first make an argument for the requirements of based on how far ahead of the core the prefetch is. Similar to other a stream-specialized interface. We then overview the approach, in programmable prefetchers, this would enable the scheduling of terms of the ISA and microarchitecture of the stream-specialized memory requests far past the limits of a traditional OOO core processor (SSP), as well as the basics of how we exploit each stream- processor’s instruction window. specialization opportunity. Opportunity 2: Stream-decoupling: Stream primitives can fur- Stream ISA Requirements: In our stream-characterization study ther be incorporated into the functional semantics of the program to (Section 4), we find that streams are common (>50% of dynamic enable what we refer to as semi-binding prefetching. Following the access), which is promising. However, some streams are shorter principle of decoupled access execute [11], a specialized memory ac- (37% less than 100 accesses), streams often have indirect access cess engine would generate requests corresponding to streams, and (about 11%), and streams often coexist with control-flow (>50% of ordinary core instructions can access stream data through registers stream accesses). Based on this characterization, we argue that a that are mapped to this data (we call these pseudo-registers). There decoupled-stream ISA interface should have five qualities: are several potential advantages, including the removal of address 1 Integration-simplicity It should be lightweight and not re- generation instructions

Stream-Based Memory Access Specialization for General Purpose Processors

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Prefetching for Complex Memory Access Patterns

Eth-28647-02.Pdf

Detecting False Sharing Efficiently and Effectively

A Hybrid Analytical DRAM Performance Model

Performance Analysis of Cache Memory

Classifying Memory Access Patterns for Prefetching

STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud

Embedded Memory Architecture for Low-Power Application Processor

A Visual Performance Analysis Tool for Memory Bound GPU Kernels