Cacheminer: a Runtime Approach to Exploit Cache Locality on SMP
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000 357 Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP Yong Yan, Member, IEEE Computer Society, Xiaodong Zhang, Senior Member, IEEE, and Zhao Zhang, Member, IEEE AbstractÐExploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory- access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimOS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations. Index TermsÐCache locality, nested loops, runtime systems, simulation, symmetric multiprocessors (SMP), and task scheduling. æ 1INTRODUCTION HE recent developments in circuit design, fabrication improvement on the memory performance of applications Ttechnology, and Instruction-Level Parallelism (ILP) is critical to the successful use of SMP systems for technology have increased microprocessor speed about applications. 100 percent every year. However, memory-access speed In order to narrow the processor-memory speed gap, has only improved about 20 percent every year. In a hardware caches have been widely used to build a memory modern computer system, the widening gap between hierarchy in all kinds of computers, from supercomputers processor performance and memory performance has to personal computers. The effectiveness of the memory become a major bottleneck to improving overall computer hierarchy for improving performance of programs comes performance. Since the increase in memory-access speed from the locality property of both instruction executions cannot match that of the processor speed, memory-access contention is increased, which results in a longer memory- and data accesses of programs. In a short period of time, the access latency. This makes memory-access operations much execution of a program tends to stay in a set of instructions more expensive than computation operations. In multi- close in time or close in the allocation space of a program, processor systems, the effect of the widening processor- called the instruction locality. Similarly, the set of instruc- memory speed gap on performance becomes more sig- tions executed tends to access data that are also close in nificant due to the heavier access contention on the network time or in the allocation space, called the data locality. Using and the shared memory and to the cache coherence cost. a fast and small cache close to a CPU is expected to hold the Recently, Symmetric Multi-Processor (SMP) systems have working set of a program so that memory accesses can be emerged as a major class of parallel computing platforms, avoided or reduced. such as HP/Convex Exemplar S-class [3], SGI Challenge [7], Unfortunately, the memory hierarchy is not a panacea Sun SMP servers [5], and DEC AlphaServer [19]. SMPs dominate the server market for commercial applications for eliminating the processor-memory performance gap. and are used as desktops for scientific computing. They are Low-level memory accesses are still substantial for many also important building blocks for large-scale systems. The applications and are becoming more expensive as the processor-memory performance gap continues to widen. The reasons for possible slow memory accesses are: . Y. Yan is with the Computer Systems Labs, Hewlett Packard Laboratories, Palo Alto, CA 94304. Email: [email protected]. Applications may not be programmed with an . X. Zhang and Z. Zhang are with the Computer Science Department, awareness of the memory hierarchy. College of William and Mary, Williamsburg, VA 23187-8709. E-mail: {zhang, zzhang}@cs.wm.edu. Applications have a wide range of working sets which cannot be held by a hardware cache, resulting Manuscript received 1 Oct. 1997; revised 24 June 1999; accepted 6 Jan. 2000. For information on obtaining reprints of this article, please send e-mail to: in capacity misses at cache levels of the memory [email protected], and reference IEEECS Log Number 105746. hierarchy. 1045-9219/00/$10.00 ß 2000 IEEE 358 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000 . The irregular data-access patterns of applications result in excessive conflict misses at cache levels of the memory hierarchy. In a time-sharing system, the dynamic interaction among concurrent processes and the underlying operating system causes a considerable amount of memory accesses as processes are switched in context. This effect cannot be handled by the memory hierarchy on its own. In a cache coherent multiprocessor system, false data sharing and true data sharing result in considerable cache coherent misses. In a CC-NUMA system, processes may not be perfectly co-located with their data, which results Fig. 1. SMP shared memory system model. in remote memory accesses to significantly degrade overall performance. global memory. When a processor accesses its data, it first Due to the increasing cost of memory accesses, techni- looks up the cache hierarchy. If the data is not found in ques for eliminating the effect of long memory latency have caches, the processor reads the memory block that contains been intensively investigated by researchers from applica- the required data from the shared memory and brings a tion designers to hardware architects. In general, the copy of the memory block in an appropriate cache block in proposed techniques fall into two categories: latency the cache hierarchy. Data is copied into the cache hierarchy avoidance and latency tolerance. The latency tolerance so that the subsequent accesses to the data can be satisfied techniques are aimed at hiding the effect of memory-access from the cache and memory accesses can be avoided. The latencies by overlapping computations with communica- cache locality optimization is aimed at optimizing the tions or by aggregating communications. The latency cache-access pattern of an application so that memory avoidance techniques, also called locality optimization accesses can be satisfied in the cache as often as possible (or techniques, are aimed at minimizing memory accesses by in other words, cache data can be reused as much as using software and/or hardware techniques to maximize possible). To increase the chance of cache data to be reused, the reusability of data or instructions at cache levels of we must reduce the interference that would replace or the memory hierarchy. In a SMP system, reducing the total invalidate the cache data. In a SMP system, there are two number of memory accesses is a substantial solution to types of interference that would affect the reuse of cache reduce cache coherence overhead, memory contention, data: the interference from the local processor which refills and network contention. So, the locality optimization a cache block with new data, and the interference from a techniques, i.e., the latency avoidance techniques, are more remote processor which invalidates stale data copies to demanding than the latency tolerance techniques. In maintain data consistency. The two types of interference in addition, because instruction accesses are more regular a parallel computation are determined by how the data is than data accesses, designing novel data-locality optimiza- accessed in the computation, called the data-access pattern, tion techniques is more challenging and more important for and how the data is mapped into a cache, called the cache- performance improvement. mapping pattern. Hence, it is essential for a cache locality The objective of this work is to propose and implement optimization technique to obtain and use the information an efficient technique to optimize the data cache locality of on the data-access pattern and the cache-mapping pattern parallel programs on SMP systems. The contributions of of a parallel program. this work are: 1) We propose an effective cache locality The data-access pattern of a program is determined by exploitation method on SMP multiprocessors based on program characteristics. Because the compilation time of an simple runtime information, which is complimentary to application is not a part of its execution time, a compiler can compiler optimizations. To our knowledge based