The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System ∗
Total Page:16
File Type:pdf, Size:1020Kb
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System ∗ Jiwei Lu, Howard Chen, Rao Fu, Wei-Chung Hsu, Dong-Yuan Chen Bobbie Othmer, Pen-Chung Yew Department of Computer Science and Engineering Microprocessor Research Lab University of Minnesota, Twin Cities Intel Corporation {jiwei, chenh, rfu, hsu, bothmer, yew}@cs.umn.edu [email protected] Abstract 1.1. Impact of Program Structures Traditional software controlled data cache prefetching is We first give a simple example to illustrate several is- often ineffective due to the lack of runtime cache miss and sues involved in software controlled data cache prefetching. miss address information. To overcome this limitation, we Fig. 1 shows a typical loop nest used to perform a ma- implement runtime data cache prefetching in the dynamic trix multiplication. Experienced programmers know how optimization system ADORE (ADaptive Object code RE- optimization). Its performance has been compared with void matrix_multiply(long A[N][N], long B[N][N], long C[N][N]) static software prefetching on the SPEC2000 benchmark { int i, j, k; suite. Runtime cache prefetching shows better performance. for ( i = 0; i < N; ++i ) On an Itanium 2 based Linux workstation, it can increase for (j = 0; j < N; ++j ) performance by more than 20% over static prefetching on for (k = 0; k < N; ++k ) some benchmarks. For benchmarks that do not benefit from A[i][j] += B[i][k] * C[k][j]; } prefetching, the runtime optimization system adds only 1%- 2% overhead. We have also collected cache miss profiles to Figure 1. Matrix Multiplication guide static data cache prefetching in the ORCR compiler. With that information the compiler can effectively avoid generating prefetches for loops that hit well in the data to use cache blocking, loop interchange, unroll and jam, or cache. library routines to increase performance. However, typi- cal programmers rely on the compiler to conduct optimiza- tions. In this simple example, the innermost loop is a can- didate for static cache prefetching. Note that the three ar- 1. Introduction rays in the example are passed as parameters to the func- tion. This introduces the possibility of aliasing, and results in less precise data dependency analysis. We used two com- Software controlled data cache prefetching is an effi- pilers for Itanium system in this study: the IntelR C/C++ cient way to hide cache miss latency. It has been very ItaniumR Compiler (i.e. ECC V7.0) [15] and the ORC open successful for dense matrix oriented numerical applications. research compiler 2.0 [26]. Both compilers have imple- However, for other applications that include indirect mem- mented static cache prefetch optimizations that are turned ory references, complicated control structures and recursive on at O3. For the above example function, the ECC com- data structures, the performance of software cache prefetch- piler generates cache prefetches for the innermost loop, but ing is often limited due to the lack of cache miss and miss the ORC compiler does not. Moreover, if the above exam- address information. In this paper, we try to keep the ap- ple is re-coded to treat the three arrays as global variables, plicability of software data prefetching by using cache miss the ECC compiler generates much more efficient code (over profiles and a runtime optimization system. 5x faster on an Itanium 2 machine) by using loop unrolling. ∗ This simple example shows that program structure has a sig- This work is supported in part by the U.S. National Science Founda- tion under grants CCR-0105574 and EIA-0220021, and grants from Intel, nificant impact on the performance of static cache prefetch- Hewlett Packard and Unisys. ing. Some program structures require more complicated Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003) 0-7695-2043-X/03 $17.00 © 2003 IEEE analysis (such as interprocedural analysis) to conduct effi- stride loops, prefetch instructions are more difficult to re- cient and effective cache prefetching. duce. Stride-based prefetching is easy to perform efficiently. 1.2. Impact of Runtime Memory Behavior Prefetching for pointer chasing references and indirect memory references [23][25][29][35][22] are relatively chal- It is in general difficult to predict memory working set lenging since they incur a higher overhead, and must be size and reference behaviors at compile time. Consider used more carefully. A typical compiler would not attempt Gaussian Elimination, for example. The execution of the high overhead prefetching unless there is sufficient evidence loop nest usually generates frequent cache misses at the be- that a code region has frequent data cache misses. ginning of the execution and few cache misses near the end Due to the lack of cache miss information, static cache of the loop nest. Initially, the sub-matrix to be processed prefetches usually are less aggressive in order to avoid un- is usually too large to fit in the data caches; hence frequent desirable runtime overhead. Therefore we attempt to con- cache misses will be generated. As the sub-matrices to be duct data cache prefetching at runtime through a dynamic processed get smaller, they may fit in the cache and produce optimization system. This dynamic optimization system au- fewer cache misses. It is hard for the compiler to generate tomatically monitors the performance of the execution of one binary that meets the cache prefetch requirements for the binary, identifies the performance critical loops/traces both ends. with frequent data cache misses, re-optimizes the selected The memcpy library routine is another example. In some traces by inserting cache prefetch instructions, and patches applications, the call to memcpy may involve a large amount the binary to redirect the subsequent execution to the op- of data movement and intensive cache misses. In some other timized traces. Using cache miss and cache miss address applications, the calls to the memcpy routine have few cache information collected from the hardware performance mon- misses. Once again, it is not easy 1 to provide one memcpy itors, our runtime system conducts more efficient and effec- routine that meets all the requirements. tive software cache prefetching, and significantly increases the performance of several benchmark programs over static void daxpy( double *x, double *y, double a, int n ) cache prefetching. Also in this paper, we will establish a secondary process for comparison to feedback cache miss {int i ; profile to the ORC compiler so that the compiler could for ( i = 0; i < n; ++i ) perform cache prefetch optimizations only for the code re- y[i] += a * x[i]; gion/loop that has frequent cache misses. } The remainder of the paper is organized as follows. Sec- tion 2 introduces the performance monitoring features on Figure 2. DAXPY Itanium processors and the framework of our runtime op- timization system. Section 3 discusses the implemention of the runtime data cache prefetching. In Section 4, we 1.3. Impact of Micro-architectural Constraints present the performance evaluation of runtime prefetching and a profile-guided static prefetching approach. Section 5 Microarchitecture can also limit the effectiveness of soft- highlights the related works and Section 6 offers the con- ware controlled cache prefetching. For example, the issue clusion and future work. bandwidth of memory operations, the memory bus and bank bandwidth [14][34], the miss latency, the non-blocking de- gree of caches, and memory request buffer size will affect the effectiveness of software cache prefetching. Consider 2. Runtime Optimization and ADORE the DAXPY loop in Fig. 2, for example. On the latest Ita- nium 2 processor, two iterations of this loop can be com- puted in one cycle (2 ldfpds, 2 stfds, 2 fmas, which can fit The ADORE (ADaptive Object code REoptimization) in two MMF bundles). If prefetches must be generated for system is a trace based user-mode dynamic optimiza- x y both and arrays, the requirement of two extra memory tion system. Unlike other dynamic optimization systems operations per iteration would exceed the “two bundles per [10][3][7], its profiling and trace selection are based on cycle” constraint. Since the array references in this exam- sampling of hardware performance monitors. This section ple exhibit unit stride, the compiler could unroll the loop explains performance monitoring, trace selection and opti- to reduce the number of prefetch instructions. For non-unit mization mechanisms in ADORE. Runtime prefetching is 1A compiler may generate multiple versions of memcpy to handle dif- part of the ADORE system and will be discussed in detail ferent cases. in Section 3. Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36 2003) 0-7695-2043-X/03 $17.00 © 2003 IEEE 2.1. Performance Monitoring and Runtime Sam- Main Thread DynOpt Thread pling on Itanium processor Trace User Event Main Selector Intel’s Itanium architecture offers extensive hardware Buffer (UEB) support for performance monitoring [19]. The Itanium- Program Optimizer 2 processor provides more than one hundred counters to measure the performance events such as memory latency, Patcher Phase branch prediction rate, CPU cycles, retired-instruction Detector counts, and pipeline stalls. Two usage models are supported using these performance counters: workload charateriza- Trace Cache System Sample tion, which gives the overall runtime cycle breakdown, and Kernel Space Buffer (SSB) profiling, which provides information for identifying pro- gram bottlenecks. Both models are programmable and fully supported in the latest IA64 Linux kernel. Based on these Figure 3. ADORE Framework two models, St´ephane Eranian developed a generic kernel interface called perfmon [28] to help programmers access the IA64 PMU (Performance Mornitoring Unit) and collect int __libc_start_main (…) { performance profiles on Itanium processors. The sampling … of ADORE is built on top of this kernel interface. dyn_open( ); In ADORE, sample collection is achieved through a on_exit (dyn_close); signal handler communicating with the perfmon interface.