How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Hot topic session: How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? Francky Catthoor, Nikil D. Dutt, IMEC , Leuven, Belgium ([email protected]) U.C.Irvine, CA ([email protected]) Christoforos E. Kozyrakis, U.C.Berkeley, CA ([email protected]) Abstract bedded memories are correctly incorporated. What is more controversial still however is how to deal with dynamic ap- Current processor architectures, both in the pro- plication behaviour, with multithreading and with highly grammable and custom case, become more and more dom- concurrent architecture platforms. inated by the data access bottlenecks in the cache, system bus and main memory subsystems. In order to provide suf- 2. Explicitly parallel architectures for mem- ficiently high data throughput in the emerging era of highly ory performance enhancement (Christo- parallel processors where many arithmetic resources can work concurrently, novel solutions for the memory access foros E. Kozyrakis) and data transfer will have to be introduced. The crucial question we want to address in this hot topic While microprocessor architectures have significantly session is where one can expect these novel solutions to evolved over the last fifteen years, the memory system rely on: will they be mainly innovative processor archi- is still a major performance bottleneck. The increasingly tecture ideas, or novel approaches in the application com- complex structures used for speculation, reordering, and piler/synthesis technology, or a mix. caching have improved performance, but are quickly run- ning out of steam. These techniques are becoming ineffec- tive in terms of resource utilization and scaling potential. In addition, memory system performance will become even 1. Motivation and context more critical in the future, as the processor-memory performance gap is increasing at the rate of 50% per year [28]. At the processor architecture side, previous work has fo- Multimedia and embedded applications, that are expected to cused on microarchitecture enhancements like intelligent dominate the processing cycles on future microprocessors, management of cache hierarchies, streaming buffers and have significantly different memory access characteristics value or address prediction, techniques that exploit spa- compared to typical engineering workloads [26]. Temporal tial/temporal locality in memory references. But can this locality is not always available, leading to poor performance approach provide the memory performance necessary to from traditional caching structures. feed highly parallel processors? We will discuss alternative We believe the enhancement of memory system perfor- instruction set architectures that attempt to provide the hard- mance requires the synergy of software and hardware tech- ware with explicit information about parallelism in mem- niques. Each system component should be focused on the ory, and examine the opportunities and challenges from the task it is most efficient with: emerging combination of multiprocessing and multithreading in a single chip. Compilers and/or run-time tools can view and analyze At the side of the system design technology and compi- several hundred lines of source code. They can de- lation for embedded data-dominated multi-media applica- tect instruction/data parallelism and regular memory tions, also much evolution is present. We will show that access patterns, or transform the code so that paral- decisions made at this stage heavily influence the final out- lelism and desired patterns for the given hardware are come when the appropriate architectural issues of the em- created (see the two following sections). The processor can utilize the parallelism information memory instructions explicitly specify a large number to execute concurrently a large number of memory and memory accesses to be issued in parallel, along with computation operations using the massive hardware re- their spatial relation (sequential, strided, or indexed ac- sources available in current and future chips. The run- cesses). The architecture includes support for software time (dynamic) information available to the hardware speculation as well. The implementation allows con- can also be used to apply further optimizations within trol of address mapping at a per process or a per mem- a small window of instructions. ory page granularity. The instruction sets used by CISC and RISC processors The Impulse architecture [25] allows software to de- today are inherently sequential and hide from the hardware scribe regular memory access patterns directly to the the parallelism and memory access information that is avail- memory controller. The controller accesses memory able at various software levels. For example, a group of in an optimized manner for each pattern, performs independent operations are described to the hardware us- prefetching and groups the requested data for maxi- ing sequentially ordered instructions. Complex dependency mum caching efficiency. In addition, it provides sup- analysis has to be performed in hardware for the parallelism port for data remapping by the application or the com- to be ”re-discovered” and utilized. Similarly, a set of se- piler. quential or strided accesses to an array will be expressed as While instruction set architectures should focus on ex- a sequence of simple loads or stores to a single memory lo- posing parallelism to the hardware, processor design should cation. To apply any prefetching or other memory access focus on implementations of these ISAs that can tolerate optimization, the hardware must observe the sequence of high memory latency, even in the case of poor caching be- addresses and guess the access pattern first. havior. Increased memory latency is a technology prob- To enable high-performance, yet efficient, processor de- lem unlikely to be solved in the near future. On the signs, future instruction set architectures (ISAs) should al- other hand, high memory bandwidth is already available low software to pass explicit parallelism and memory access through technologies like multi-bank embedded DRAMs, information to the hardware. Instructions should explicitly high-performance memory interfaces (Rambus, DDR), and identify operations that can be issued and executed in par- cost-efficient MCM packaging. Processor designs can uti- allel. Memory instructions should allow control of memory lize high memory bandwidth in order to hide the perfor- system features such as cache space allocation, and provide mance penalty of high memory latency. This is in har- information that describes the access characteristics of the mony with the characteristics of many multimedia and e- application. This could include access pattern (e.g. sequen- commerce workloads, where throughput is by far more im- tial or strided), temporal and spatial locality hints, and ex- portant than latency. Frequently, latency can also be ad- pected caching behavior at the various levels of memory hi- dresseed with software techniques like those presented in erarchy. Using this information, the processor can issue in the next section. But hardware techniques for Ha hiding parallel or overlap a large number of memory accesses, em- latency are still necessary as they can be applied to all ap- ploy aggressive prefetching, and tune the use of the memory plications, regardless of the availability of source code or hierarchy for maximum performance and efficiency. From their suitability to compiler analysis. the point of view of compilers and run-time tools, explicitly There are several architectural approaches to utilizing parallel architectures allow fine control of hardware features high memory bandwidth. Some of the techniques proposed and create new opportunities for higher-level optimizations. or used recently include the following: There have been several recent examples of explicitly parallel architectures both from academia and industry. Multithreaded processors, such as Sun MAJC and Some examples of different approaches are: Compaq Alpha EV-8, attempt to execute concurrently several fine-grain threads. When some thread is The EPIC (IA-64) architecture [27] exposes par- blocked due to memory latency (e.g. a cache miss) allelism to the hardware using VLIW instructions. or lack of parallelism, the hardware switches to issu- Memory operations provide explicit hints about the ex- ing instructions from another thread within a couple pected caching behavior at each level of the memory of clock cycles. Given a large number of fine-grain hierarchy, which is used for guiding cache space al- threads and high memory bandwidth, multithreaded location. Software speculation is used to issue mem- designs can efficiently hide high memory latency. ory accesses as early as possible without compromis- ing program correctness. Multiprocessing designs combine two to four processors on the same chip forming a symmetric multipro- The VIRAM architecture [30] expresses parallelism cessor (IBM Power4, Sun MAJC, Stanford Hydra). to hardware in the form of vector operations. Vector Each processor can run a separate execution thread, 2 System bus load = crucial bottle-neck process or application. The high memory bandwidth is System-chip used to feed data to the multiple processors and keep On-chip Memory/ Main Cache Hierarchy Disk the overall system busy despite the high latency of each L2 system SDRAM access Hardware Proc bus bus bus individual access. Data L1 L2 (Main Disk Paths Memory) Vector processors (Berkeley

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

Prefetching for Complex Memory Access Patterns

Eth-28647-02.Pdf

Detecting False Sharing Efficiently and Effectively

A Hybrid Analytical DRAM Performance Model

Performance Analysis of Cache Memory

Classifying Memory Access Patterns for Prefetching

STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud

Embedded Memory Architecture for Low-Power Application Processor

A Visual Performance Analysis Tool for Memory Bound GPU Kernels

Predicting Multiprocessor Memory Access Patterns with Learning Models