IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000 357 Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

Yong Yan, Member, IEEE Computer Society, Xiaodong Zhang, Senior Member, IEEE, and Zhao Zhang, Member, IEEE

AbstractÐExploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory- access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimOS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations.

Index TermsÐCache locality, nested loops, runtime systems, simulation, symmetric multiprocessors (SMP), and task scheduling. æ

1INTRODUCTION

HE recent developments in circuit design, fabrication improvement on the memory performance of applications Ttechnology, and Instruction-Level Parallelism (ILP) is critical to the successful use of SMP systems for technology have increased speed about applications. 100 percent every year. However, memory-access speed In order to narrow the processor-memory speed gap, has only improved about 20 percent every year. In a hardware caches have been widely used to build a memory modern computer system, the widening gap between hierarchy in all kinds of computers, from supercomputers processor performance and memory performance has to personal computers. The effectiveness of the memory become a major bottleneck to improving overall computer hierarchy for improving performance of programs comes performance. Since the increase in memory-access speed from the locality property of both instruction executions cannot match that of the processor speed, memory-access contention is increased, which results in a longer memory- and data accesses of programs. In a short period of time, the access latency. This makes memory-access operations much execution of a program tends to stay in a set of instructions more expensive than computation operations. In multi- close in time or close in the allocation space of a program, processor systems, the effect of the widening processor- called the instruction locality. Similarly, the set of instruc- memory speed gap on performance becomes more sig- tions executed tends to access data that are also close in nificant due to the heavier access contention on the network time or in the allocation space, called the data locality. Using and the shared memory and to the cache coherence cost. a fast and small cache close to a CPU is expected to hold the Recently, Symmetric Multi-Processor (SMP) systems have working set of a program so that memory accesses can be emerged as a major class of parallel computing platforms, avoided or reduced. such as HP/Convex Exemplar S-class [3], SGI Challenge [7], Unfortunately, the memory hierarchy is not a panacea Sun SMP servers [5], and DEC AlphaServer [19]. SMPs dominate the server market for commercial applications for eliminating the processor-memory performance gap. and are used as desktops for scientific computing. They are Low-level memory accesses are still substantial for many also important building blocks for large-scale systems. The applications and are becoming more expensive as the processor-memory performance gap continues to widen. The reasons for possible slow memory accesses are: . Y. Yan is with the Computer Systems Labs, Hewlett Packard Laboratories, Palo Alto, CA 94304. Email: [email protected]. . Applications may not be programmed with an . X. Zhang and Z. Zhang are with the Computer Science Department, awareness of the memory hierarchy. College of William and Mary, Williamsburg, VA 23187-8709. E-mail: {zhang, zzhang}@cs.wm.edu. . Applications have a wide range of working sets which cannot be held by a hardware cache, resulting Manuscript received 1 Oct. 1997; revised 24 June 1999; accepted 6 Jan. 2000. For information on obtaining reprints of this article, please send e-mail to: in capacity misses at cache levels of the memory [email protected], and reference IEEECS Log Number 105746. hierarchy.

1045-9219/00/$10.00 ß 2000 IEEE 358 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

. The irregular data-access patterns of applications result in excessive conflict misses at cache levels of the memory hierarchy. . In a time-sharing system, the dynamic interaction among concurrent processes and the underlying operating system causes a considerable amount of memory accesses as processes are switched in context. This effect cannot be handled by the memory hierarchy on its own. . In a cache coherent multiprocessor system, false data sharing and true data sharing result in considerable cache coherent misses. . In a CC-NUMA system, processes may not be perfectly co-located with their data, which results Fig. 1. SMP shared memory system model. in remote memory accesses to significantly degrade overall performance. global memory. When a processor accesses its data, it first Due to the increasing cost of memory accesses, techni- looks up the cache hierarchy. If the data is not found in ques for eliminating the effect of long memory latency have caches, the processor reads the memory block that contains been intensively investigated by researchers from applica- the required data from the shared memory and brings a tion designers to hardware architects. In general, the copy of the memory block in an appropriate cache block in proposed techniques fall into two categories: latency the cache hierarchy. Data is copied into the cache hierarchy avoidance and latency tolerance. The latency tolerance so that the subsequent accesses to the data can be satisfied techniques are aimed at hiding the effect of memory-access from the cache and memory accesses can be avoided. The latencies by overlapping computations with communica- cache locality optimization is aimed at optimizing the tions or by aggregating communications. The latency cache-access pattern of an application so that memory avoidance techniques, also called locality optimization accesses can be satisfied in the cache as often as possible (or techniques, are aimed at minimizing memory accesses by in other words, cache data can be reused as much as using software and/or hardware techniques to maximize possible). To increase the chance of cache data to be reused, the reusability of data or instructions at cache levels of we must reduce the interference that would replace or the memory hierarchy. In a SMP system, reducing the total invalidate the cache data. In a SMP system, there are two number of memory accesses is a substantial solution to types of interference that would affect the reuse of cache reduce cache coherence overhead, memory contention, data: the interference from the local processor which refills and network contention. So, the locality optimization a cache block with new data, and the interference from a techniques, i.e., the latency avoidance techniques, are more remote processor which invalidates stale data copies to demanding than the latency tolerance techniques. In maintain data consistency. The two types of interference in addition, because instruction accesses are more regular a parallel computation are determined by how the data is than data accesses, designing novel data-locality optimiza- accessed in the computation, called the data-access pattern, tion techniques is more challenging and more important for and how the data is mapped into a cache, called the cache- performance improvement. mapping pattern. Hence, it is essential for a cache locality The objective of this work is to propose and implement optimization technique to obtain and use the information an efficient technique to optimize the data cache locality of on the data-access pattern and the cache-mapping pattern parallel programs on SMP systems. The contributions of of a parallel program. this work are: 1) We propose an effective cache locality The data-access pattern of a program is determined by exploitation method on SMP multiprocessors based on program characteristics. Because the compilation time of an simple runtime information, which is complimentary to application is not a part of its execution time, a compiler can compiler optimizations. To our knowledge based on the use sophisticated techniques to analyze the data-access existing literature, our design method and its implementa- pattern of a program. However, there is a large class of real tions on SMPs are unique. 2) Abstracting the estimated world applications whose data-access patterns cannot be physical memory-access patterns of program loops into analyzed at compile-time. The data-access patterns of many internal data structures, we are able to partition and of these functions are dependent on runtime data. In schedule parallel tasks by optimizing cache locality of the addition, many real-world applications have indirect data program. 3) We have built an application programming accesses [22], which are difficult for a compiler to analyze. interface (API) to collect simple program hints and For example, pointers may point to different objects during automatically generate memory-layout oriented parallel the execution of a program, and the subscripts of an array programs for users. variable may be given by another array variable. The existence of these complicated applications recommands 1.1 The Problem runtime techniques for analyzing data-access patterns. In a SMP system as shown in Fig. 1, each processor has a In Fig. 2, we present a sparse matrix multiplication hierarchy of local caches (such as the on-chip cache and the algorithm where two sparse source matrices have dense off-chip cache in the figure) and all the processors share a representations. In the innermost loop, the two elements to YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 359

Exploiting cache locality at runtime has been done in loop scheduling. References [12], [13], [25] present dynamic scheduling algorithms that take into consideration the affinity of loop iterations to processors. The main idea is to let each iteration stay on the same processor while it is repeatedly executed. Although significant performance improvement can be acquired for some applications, the type of affinity exploited by this approach is not very popular because it does not take into consideration the relationship between memory references of different itera- tions. Reference [12] does give some consideration to the data distribution of a loop by allocating iterations close to their data. In a SMP shared-memory system, all shared data reside in a unique shared memory where the proposed method in [12] is not applicable. The proposed technique in this paper not only takes into consideration the affinity of parallel tasks to processors, it also uses information on the underlying cache architecture and memory reference Fig. 2. A Sparse Matrix Multiplication (SMM) which has a dynamic data- patterns of tasks to minimize cache misses and false access pattern and an irregular computation pattern. sharing. In the design of the COOL language [4], the locality be multiplied, A[k] and B[r], are indirectly determined by exploitation issue is addressed using language mechanisms the data in indexing arrays Arow, Acol, Bcol, and Brow. and a runtime system. Both task affinity and data affinity The data-access pattern of this program can only be are specified by users and then are implemented by the determined at runtime when input data is available. runtime system. A major limit with this approach is that the quality of locality optimizations heavily depends on a 1.2 Comparisons with Related Work programmer. Our proposed technique uses a simple Because the cache locality of an application is affected by programming interface for a user or compiler to specify program characteristics, by parallel compilation methods, simple information about data, not about complicated by the interference of the underlying operating system, and affinity relations. The affinity relations will be recognized by the architectural features of the cache, many locality at runtime. exploitation methods have been proposed to improve the Regarding the runtime locality optimization of sequen- memory performance of applications at different system tial programs, reference [16] proposes a memory-layout levels (see, e.g., [2], [4], [6], [8], [10], [11], [12], [13], [14], [15], oriented method. It reorganizes the computation of a loop [16], [20], [25]). based on some simple hints about the memory reference At compilation phase, the main idea of locality optimiza- patterns of loops and cache architectural information. tion techniques is to conduct two types of transformations: Compared with a uniprocessor system, a cache coherent program transformations and data layout transformations. shared memory system has more complicated factors that Some program transformation-based techniques use a data- should be considered for locality exploitation, such as data reuse prediction model to determine a sequence of program sharing and load balancing. Our proposed memory-layout transformations in order to maximize the cache locality of oriented method is aimed at attacking the following two applications (see, e.g., [11], [14]). These techniques are distinct tasks: 1) It uses a more precise multidimensional control-centric where transformations are mainly con- hash structure to reorganize tasks so that the task can be ducted based on control flow analysis. The other type of partitioned easily into groups to maximize data reuse in a transformations reorganize program structures based on group and minimize data sharing among groups. 2) It data layout (see, e.g., [6], [9]). These techniques are data- dynamically trades off cache locality with parallelism, load centric in the logic space of a program, and have only been imbalance, and runtime scheduling overhead. used on sequential programs so far. Data transformation based techniques aim at restructuring data layout so that 1.3 Models and Notations data locality is optimized (see, e.g., [8], [20]). More recently, The program structures addressed in this paper are nested some techniques are proposed to take into consideration loops in applications as shown in Fig. 3 (all the programs both program transformations and data transformations presented in this paper are in C-language format consistent (see, e.g., [1], [2]). The success of compiler based transfor- with the C-language implementation of our runtime mations in improving memory performance of applications system). In Fig. 3, lj and uj are the lower bound and upper depends on static analysis at compiler-time on control flow bound of loop index variable ij for j ˆ 1; 2; ÁÁÁ;k k  1†, and data flow. For applications with irregular and/or which are usually functions of the outer loop index dynamic data access patterns, it is difficult for compiler variables, i1;i2; ÁÁÁ;ij1, and are determined at runtime. based techniques to conduct control flow analysis and data The loop body B is a set of statements where the statements flow analysis. In this case, runtime analysis is necessary and can also be loops. An execution instance of the loop body B important. can be considered as a fine-grained task, which can be 360 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

Fig. 3. Data-independent nested loop. expressed as B(t1;t2; ÁÁÁ;tk), where tj is the value of index variable ij for j ˆ 1; 2; ÁÁÁ;k. The condition that the above nested loop must satisfy is defined as follows: All the execution instances of the loop body B are data-independent, i.e., for any two instances, 1 1 1 2 2 2 B(t1;t2; ÁÁÁ;tk) and B(t1;t2; ÁÁÁ;tk), the following condition is valid:

1 1 1 2 2 2 out B t1;t2; ÁÁÁ;tk†† \ out B t1;t2; ÁÁÁ;tk†† ˆ ; ^ 1 1 1 2 2 2 out B t1;t2; ÁÁÁ;tk†† \ in B t1;t2; ÁÁÁ;tk††ˆ;^ 1† 1 1 1 2 2 2 in B t1;t2; ÁÁÁ;tk†† \ out B t1;t2; ÁÁÁ;tk†† ˆ ;; where notations out and in represent, respectively, Fig. 4. Framework of the runtime system. the output variable set and input variable set of an instance [22]. reorganizing tasks into cache affinity groups where the The rest of the paper is organized as follows: The next tasks in a group are expected to heavily reuse their data in section describes our runtime optimization technique in the cache, partitioning task-affinity groups onto multiple detail. Section 3 presents our performance evaluation processors so that data sharing among multiple processors method. The performance results are given in Section 4. is minimized, and then adaptively scheduling the execution Finally, we conclude our work in Section 5. of tasks. In order to minimize runtime overhead, a multidimen- 2RUNTIME CACHE LOCALITY EXPLOITATION sional hash table is internally built to manage a set of task- METHOD affinity groups. Meanwhile, a set of hash functions are given to map into an appropriate task-affinity group in the A runtime system should be highly effective in order to hash table. Locality oriented task reorganization and prevent the benefit of cache locality optimizations from being nullified by the associated runtime overhead. Thus, a partitioning are integratedly done in the task mapping. simple and heuristic runtime approach is the only realistic This section describes our information estimation method, choice. internal data structures for task reorganization and parti- The basic idea of our method is to group and partition tioning, and our task scheduling algorithm. tasks through shrinking and partitioning the memory access 2.1 Memory-Access Pattern Estimation space of parallel tasks. Shrinking the memory access space In a loop structure, data is usually declared and stored as is to group tasks that have shared data accessing regions. arrays. Let A ;A ; ÁÁÁ;A be the n arrays accessed in the These tasks are expected to reuse data in a cache when they 1 2 n loop body of a nested data-independent loop. Each array is execute as a group. Partitioning the memory access space is to divide tasks into several partitions so that tasks usually laid out in a contiguous memory region, indepen- belonging to different partitions have minimal data sharing dent of the other arrays. Many operating systems attempt to among their memory access regions. Finally, task partitions map contiguous virtual pages to cache blocks contiguously, are adaptively scheduled to execute on the multiprocessor so that our virtual-address-based study is practically system for possible load balance, subject to minimize the meaningful and effective. In rare cases, an array may be execution time of the tasks. This runtime approach is laid out across several uncontiguous memory pages. primarily memory-layout oriented. Although our runtime system may not handle these rare Our runtime technique has been implemented as a set of cases efficiently, the system works well for most memory library functions. Fig. 4 presents a framework of the system. layout cases in practice. A given sequential application program is first transformed The memory regions of the n independent arrays can be by a compiler or rewritten by a user to insert runtime represented by an n-dimensional memory-access space, functions. The generated executable file is encoded with expressed as A1;A2; ÁÁÁ;An†, where arrays are arranged in application-dependent hints. At runtime, the encoded any selected order by a user. This n-dimensional memory- functions are executed to fulfill the following functional- access space practically contains all the memory addresses ities: estimating the memory-access pattern of a program, that are accessed by a loop. YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 361

Hint 4. A memory-access vector of task tj:

aj1;aj2; ÁÁÁ;ajn†;

where aji is the starting address of the referenced region on ith array by task tj i ˆ 1; 2; ÁÁÁ;n†. In some loop structures, a parallel iteration may not contiguously access an array so that the access region may not be precisely abstracted by the starting address. In this case, the loop iteration should be further split into smaller iterations so that each iteration accesses a contiguous region on each array. In addition, the following hint is also provided to assist task partitioning. Hint 5. p, the number of processors. Based on the above hints, the memory-access space of the loop is abstracted as an n-dimensional memory-access space:

b1 : b1 ‡ s1 1;b2 : b2 ‡ s2 1; ÁÁÁ;bn : bn ‡ sn 1†:

Task tj is abstracted as point aj1;aj2; ÁÁÁ;ajn† in the memory-access space based on the runtime estimation on its memory-access pattern. Fig. 5 presents an example of the abstract representation of the memory accesses based on the physical memory layout of arrays A and B in the SMM given in Fig. 2. Fig. 5a gives the hints on the memory-access space. Fig. 5b illustrates the memory layout of two arrays where B and A are laid out at starting address 100 and 1,000, respectively. Each array element has size of 8 bytes. Then, the memory-access space is represented by a 2D space as shown in Fig. 5c where each point gives a pair of possible Fig. 5. Memory-access space representations of the SMM program. (a) Hints on memory layouts of two accessed arrays. (b) Physical starting memory-access addresses on A and B, respectively, memory layout. (c) A 2D memory-accessing space. by a task. For example, t(1,000, 100) means task t will access array A at starting memory address 1,000, and access In order for the runtime system to capture the memory array B at starting physical address 100. space information, the following three hints are provided 2.2 Task Reorganization by the interface. In the memory-access space, same or nearby task points Hint 1. n, the number of arrays accessed by tasks. access the same or nearby memory addresses. So, grouping Hint 2. The size in bytes of each array. Based on this, the nearby tasks in the memory-access space for execution take runtime system maintains a Memory-access Space Size advantage of temporal locality and spatial locality of

vector s1;s2; ÁÁÁ;sn†, denoted as the MSS vector, where programs. This is achieved by shrinking the memory-access space based on the underlying cache capacity (size). si is the size of ith array i ˆ 1; 2; ÁÁÁ;n†. Let ft a ;a ; ÁÁÁ;a †ji ˆ 1; 2; ÁÁÁ;mg be a set of m data Hint 3. The starting memory address of each array. From this, i i1 i2 in independent tasks of a parallel loop, and b : b ‡ s the underlying runtime system constructs a starting 1 1 1 1;b2 : b2 ‡ s2 1; ÁÁÁ;bn : bn ‡ sn 1† be the memory-access address vector b1;b2; ÁÁÁ;bn†, denoted the SA vector, space of the parallel loop. Conceptually, task ti (i=1, ÁÁÁ, n) where bi i ˆ 1; 2; ÁÁÁ;n† is the starting memory address is mapped onto point ai1;ai2; ÁÁÁ;ain† in the memory-access of ith array. space based on the starting memory addresses of their Hint 1 is static and can be collected at the user or memory-access regions. In addition, let p be the number of compiler level. Hint 2, the array size, may be static and is processors and C be the capacity of the underlying known at compiler-time, or dynamic and is determined at secondary cache in bytes. runtime. Hint 3, the starting addresses, are dynamic Task reorganization consists of two steps. In the first because memory addresses can only be determined at step, the memory-access space b1 : b1 ‡ s1 1;b2 : b2 ‡ s2 runtime. 1; ÁÁÁ;bn : bn ‡ sn 1† is shifted to origin point (0; ÁÁÁ; 0)by After determining the global memory-access space of a subtracting b1;b2; ÁÁÁ;bn† from the coordinates of all task loop, we need to determine how each parallel iteration points. In the second step, we use the equal-shrinking accesses the global memory-access space so that we can method to shrink each dimension of the shifted memory reorganize them to improve memory performance. Here, by fC=n. The n-dimensional space resulted from shrinking we abstract each instance of a loop body of a parallel loop as is called a n-dimensional bin space. Here, f is a weight a parallel task. The accessing region of a task in an array is constant in 0; 1Š. When f ˆ 1, the cache is fully utilized, simply represented by the starting address of the region. So, otherwise the cache is partially used for the tasks. This gives the following hint is provided by the runtime system. an alternative to tune program performance. In the bin 362 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

Fig. 6. Equally shrinking a memory-access space. (a) Abstract 2D memory-access space. (b) Shifting memory-access space by function f1 : f1 x; y†ˆ x 1; 000;y 100†; (c) Shrinking memory-access space by function f2 : f2 x; y†ˆ x=1; 000;y=100† on a cache of 200 bytes. space, each point is associated with a task bin which holds 1. Optimally partitioning tasks into p parties: all the tasks that are mapped into it. P1;P2; :::; Pp so that In Fig. 6, the shrinking procedure of the memory-access X space is exemplified by the 2D memory-access space given jsharing Pi;Pj†j in Fig. 5. Before shrinking, the original memory-access 1i;jp†^ i6ˆj† space is shifted to origin point 0; 0† (see Fig. 6b). The is minimized, where sharing Pi;Pj† is called sharing shifting function is shown in Fig. 6b. Then each dimension degree measured by the boundary space among the of the shifted memory-access space is shrunk by C/2 into a partitions. This optimization is aimed at minimizing new 2D bin space in Fig. 6c. The tasks in the shadow square cache coherence overhead. in Fig. 6b would not access larger space than the cache size, 2. Balancing the task volumes among p partitions. and are mapped onto one point in the bin space. All the The major function of partitioning an n-dimensional bin tasks in a bin can be grouped together to execute. n space B 0:L1; 0:L2; ÁÁÁ; 0:Ln† is to find a partitioning ~ 2.3 Task Partitioning vector k k1;k2; ÁÁÁ;kn† so that the above two conditions are After shrinking the n-dimensional memory-access space, satisfied. Because finding an optimal partitioning vector is tasks have been grouped based on locality affinity an NP-complete problem, we propose a heuristic algorithm information in an n-dimensional bin space. Task partition- based on the following partitioning rules. Detailed proofs ing is aimed at partitioning the n-dimensional bin space can be found in [24]. into p partitions (p is the number of processors and each Theorem 1. Ordering Rule. For a given partitioning vector partition is an n-dimensional polyhedron) by achieving: ~ ~ k k1;k2; ÁÁÁ;kn† not in decreasing order, by sorting k in YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 363

decreasing order, the sharing degree of the resulting vector is at 2.4 Locality-Preserved Adaptive Scheduler least as low as that of k~. The dynamic scheduler in the runtime system is aimed at Theorem 2. Increment Rule 1. For an n-dimensional bin space minimizing the parallel computing time of a set of data- n ~ B , and partitioning vectors k k1;k2; ÁÁÁ;ki;ki‡1  independent tasks created in the task reorganization step. ~0 q; 1; ÁÁÁ; 1† and k k1;k2; ÁÁÁ;ki  q; ki‡1; 1; 1; ÁÁÁ; 1†; where The task groups generated in the task reorganization step q>1, k~has smaller shared data regions than that of k~0 if and are locality oriented. This may not guarantee that all the only if partitions in the second step have the same execution time, due to the following possible reasons: ki  Li‡1 >ki‡1  Li: 1. Irregular data access patterns of programs will generate different amount of tasks in each partition; Corollary 1. Increment Rule 2. For an n-dimensional bin space 2. irregular computation patterns will directly result in n ~ B , and partitioning vectors k k1;k2; ÁÁÁ;ki;ki‡1; 1; ÁÁÁ; 1† different execution times among the partitions; ~0 ~ and k k1;k2; ÁÁÁ;ki  ki‡1; 1; 1; ÁÁÁ; 1†, where ki‡1 > 1, k has 3. possible interference of a paging system in the ~0 smaller shared data regions than that of k if and only if operating system will also generate different amount k  L > ÂL : of tasks in each partition; and i i‡1 i 4. different data locality exploited in different parti- Based on the above three rules, we design an efficient tions may speed up their executions at different heuristic algorithm as follows: rates. The scheduling problem of parallel loops in shared 1. Factoring p, the number of processors, we generate memory systems has been studied in a wide range (e.g, [13], all the prime factors of p in decreasing order. [17], [25]). In [25], we proposed an adaptive algorithm to Assume that there are q prime factors: balance workload among multiple processors while ex- ploiting the affinity of parallel tasks to processors, which r  r ÁÁÁr . Initially, the n-dimensional parti- 1 2 q was shown to outperform many existing algorithms for a ~ tioning vector k, stored in k‰1:nŠ,is(1; 1; ÁÁÁ; 1) for wide range of applications. The task scheduling problem n the bin space B 0:L1; 0:L2; ÁÁÁ; 0:Ln†. here is different from the traditional loop scheduling 2. Let index last be a chosen position in k‰1:nŠ where problem, because the tasks are locality dependent. To k‰iŠ > 1 for i < last and k‰iŠˆ1 for i  last. Initially, schedule a set of locality dependent tasks, the scheduler

last ˆ 1. For each prime factor rj where j increases must take advantage of the locality exploited in the task from 1 to q, do the following: reorganization phase as much as possible, while balancing workload to minimize the parallel computing time. Here, a. When (last  n), use the increment rule 2 to we extend our linearly adaptive algorithm proposed in [25] determine whether rj should be put in k‰lastŠ. to address this issue in the runtime system. The extended Based on the ordering rule, the best place to put algorithm is called the Locality-preserved Adaptive Sche- rj must be in k‰1:lastŠ. So, we use increment duling algorithm, denoted as LAS. Because the number of rules to find a better place in k‰1:lastŠ. If so, last processors in the targeted SMP system is in the range of is increased by 1 and go back; otherwise, use the small scale to medium scale, the linearly adaptive algorithm increment rule 1 to put r together with k‰last j is aggressive enough to reduce synchronization overhead. 1Š or k‰last 2Š, then reorder k‰1:last 1Š in Initially, the ith task group chain in the TCL list is decreasing order and go back. considered as the local task chain for processor i, for i ˆ b. Otherwise, use the increment rule 1 to put r j 1; 2; ÁÁÁ;p (p is the total number of processors). This initial together with k‰last 1Š or k‰last 2Š,then allocation maintains the minimized data sharing among reorder k‰1:last 1Š in decreasing order and go back. processors. The remaining number of tasks in the local chain of processor i is tracked by the corresponding TCL The above algorithm has a computational complexity p counter variable, denoted as Count , which is used in the O n ‡ p†. After the determination of a partitioning vector, i LAS algorithm to estimate load imbalance, just like what the the bin space is partitioned into multiple independent spaces that are further reconstructed in an n ‡ 1†-dimen- processor speed variables do in [25]. However, the TCL sional space. An example of this procedure is shown in counter variables are different from the processor speed Fig. 7, where the bin space produced in Fig. 6 is partitioned variables. A processor speed variable records the number of by vector (2; 2). The partitions in Fig. 7a are first tasks that have been finished in the corresponding transformed into four independent spaces in Fig. 7b, which processor, which can precisely estimate how many tasks are further transformed into a 3D space shown in Fig. 7c. remain in the processor because the tasks are evenly The 3D space in Fig. 7c is implemented as a 3D hash table partitioned among processors initially. It does not work in where task bins in each partition are chained together to be the runtime system, because the task chains in the TCL list pointed by a record in a Task Control Linked (TCL) list. The may contain imbalanced numbers of tasks. In addition, each hashing of tasks into the hash table is performed by the processor has a chunking control variable of initial value of space transformation functions. For detailed information of p, denoted Ki for processor i, to determine how many tasks these functions, the interested readers may refer to [24]. to be executed at each scheduling step. 364 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

Fig. 7. Partitioning: the bin space is evenly divided into 4 partitions from X and Y dimensions. (a) Indexing of partitions. (b) Independent address space of each partition. (c) 3D internal representation of the memory access space.

Xp The LAS algorithm still works in two phases: the local light if Counti < Countj=p 3† scheduling phase and the global scheduling phase. How- jˆ1 ever, it considers a task group as a minimal schedulable normal otherwise: 4† unit, never breaking up the tasks in a group. The LAS P Here, is d p Count =p†= 2p†e, which decreases with algorithm can take good advantage of locality-oriented task jˆ1 j execution to control the load distribution more closely. assignments while achieving a good load balance because During the above computation, if the number of the tasks are usually fine-grained. All the processors start at remaining tasks in one processor's local chain is found to the local scheduling phase. The algorithm is described for be 0, i.e., 9j2‰1;pŠ Countj ˆ 0†, processor i sets its chunking control variable, K ,top, then goes to the global scheduling processor i i ˆ 1; 2; ÁÁÁ;p† as follows. i phase. Otherwise, processor i linearly adjusts its chunking Local scheduling. Processor i first calculates its load status control variable according to its load status as follows: relative to the other processors as follows: 8 <> maxfp=2;Ki 1g if its load is light Xp Ki ˆ minf2p; Ki ‡ 1g if its load is heavy 1† heavy if Counti > Countj=p ‡ 2† :> jˆ1 Ki otherwise: YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 365

The varying range ‰p=2; 2pŠ for the chunking control to have a larger number of parameters than that of variables has been shown to be safe for balancing the load the accessed data arrays. However, m cannot be [13], [17], [25]. Then processor i gets the following number smaller than n, which is easily achieved in program- of task groups from its local chain: ming by using dummy parameters in a task function. The order of hints here must be the same dCounti= Ki à gs†e; as that of those hints presented in cacheminer_- init, i.e., a , s , and b are hints on the same array where gs is the size of a task group. Having finished the j j j allocated tasks, processor i goes back to repeat the local for j ˆ 1; 2; ÁÁÁ;n. scheduling. 3. void task_run(int repeat)This function starts the runtime system to execute the tasks in the hash Global scheduling. First, processor i always gets structure in parallel. If the tasks are going to execute Count = K  g † groups of tasks from its local task i i s at second time, the variable repeat is set to 1 so that chain to execute until its local task chain is empty. Then, the runtime system can keep the hash structure in processor i gets 1= K  g † groups of the remaining i s order to eliminate the overhead of rebuilding it. In tasks in the local task chain of the most heavily loaded this situation, the runtime system exploits cache processor until all the processors empty their local task locality by using existing partitions of tasks. Other- chains. wise, the variable repeat is set to 0. In the local scheduling phase of the LAS algorithm, tasks are executed in a bin-by-bin order. Only when a processor is 2.6 Classifications of Application Programs trying to help some processors in the global phase are tasks In order to reflect all types of applications while not getting in a bin allocated by groups to execute remotely. In the into exhaustive investigation which is not necessary, we global scheduling phase, emphasis is put on load balancing. classify applications based on three factors: computation 2.5 The Application Programming Interface (API) pattern, memory-access pattern, and data-dependence The interface functions are mainly used to provide applica- pattern. Computation patterns can be classified as two tion-dependent hints for the run-time system. The current types: regular, where the computation tasks of an applica- system is implemented in C language. The API of the tion are naturally balanced, and irregular, where the system is simple, and consists of the following three computation tasks of an application are not naturally functions: balanced. Furthermore, memory-access patterns and data- dependence patterns can be respectively classified into 1. void cacheminer_init(int csize, float f, static patterns that can be determined at compile-time, and int p, int n, long s1, void à b1, ÁÁÁ, long sn, void dynamic patterns which can not be determined at compile- à bn). This function provides the following types of time. Intuitively, based on these patterns, applications can hints: be classified into eight possible types. However, the . Cache size csize gives the size in Kbytes of the computation pattern, the memory-access pattern, and the secondary cache on each processor. data-dependence pattern interact one another. Usually, . Task control parameter f is a floating point when an application has a dynamic memory-access pattern number in 0; 1Š, giving the usage percentage of or a dynamic data-dependence pattern, it has an irregular a cache to cluster tasks. computation pattern. In addition, the memory-access . p is the number of processors. pattern of an application is affected by its data-dependence . n is the total number of arrays, on which hints pattern. When the data-dependence pattern can not be are provided. determined at compile-time, its memory-access pattern . Hints on arrays are given in pairs. Each pair of sj must not be determined, either. Considering these effects, and bj (j ˆ 1; 2; ÁÁÁ;n†, give respectively the size applications finally fall into the following four types, listed and the starting physical addresses of a refer- in increasing difficulty degree for locality optimization. enced array by tasks. All the arrays are arranged in a user-defined order. (There is no specific . Type 1. Applications with regular computation requirement on the array order.) patterns, static memory-access patterns, and static Based on the information provided by this data dependence. function, the runtime system builds a n ‡ 1†- . Type 2. Applications with irregular computation dimensional hash structure. patterns, static memory-access patterns, and static data dependence. 2. task_create(void (à fun), int m, int t1, ÁÁÁ, . Type 3. Applications with irregular computation int tm, void Ãa1, ÁÁÁ, void Ãam). This function creates a task with its computing function, denoted patterns, dynamic memory-access patterns, and static data dependence. as void fun (t1, t2; ...;tm), and carries hints . Type 4. Applications with irregular computation a1;a2; :::; am on the access pattern of the task into patterns, dynamic memory-access patterns, and the runtime system. Here ai (i ˆ 1, ÁÁÁ, m) is the starting access address of the task on the ith array. dynamic data dependence. If the number of hints, m, is larger than the Based on the above classification, we choose each number of hinted data arrays, n, only the first n hints benchmark from the first three application types that fit are used. This flexibility would allow a task function into our programming model. For the most difficult 366 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

Fig. 8. Dense matrix multiplication: the sequential program is given on the left and the parallelized version on the runtime system is given on the right.

Fig. 9. Adjoint convolution (AC): the sequential program is given on the left and the parallelized version on the Cacheminer system is given on the right. applications in Type 4, our technique can also be used to Cacheminer system is shown on the right in Fig. 9. Each improve memory performance by combining some data- task accesses respectively a contiguous partion of arrays B dependence detection techniques. Next, we will show how and C. Different tasks have different sizes of work sets so to use the Cacheminer to rewrite ordinary C programs to that the tasks have imbalanced load. Hence, cache locality exploit different types of cache localities in a SMP system. must be traded off with load balance to minimize parallel These applications are used to evaluate the runtime system. execution time. This application is more difficult than the DMM for a compiler to exploit its cache locality. 2.6.1 Programming Examples The third application is Sparse Matrix-Matrix (SMM) The first program is dense matrix multiplication (DMM) multiplication with dense representation, which belongs to with a regular memory access pattern, which belongs to program Type 3. The elements of two source matrices are program Type 1. The sequential program and its paralle- input at runtime. To simplify the presentation, we assume lized version on the runtime system are presented in Fig. 8. that two source M Â M sparse matrices have been stored in The array B is assumed to have been transposed so that the one-dimensional arrays, A and B, respectively. The elements cache locality of the sequential program is optimized. In the in A are stored by rows and the elements in B are stored by parallelized version, the innermost loop is treated as a task columns. Array Arow gives the position in A of the first nonzero element of each row for the first source matrix. function with two parameters, i and j. The two outer loops Array Acol gives the column number of each nonzero are data independent, which create N2 tasks with the same element in A. Similarly, array Bcol gives the position in B of computation load. Each task reads, respectively, a row of A the first nonzero element of each column for the second and B. source matrix. Array Brow gives the row number of each The second application is adjoint convolution (AC), element in B. The sequential sparse matrix multiplication is which belongs to program Type 2. The sequential program described on the left side of Fig. 10. The two outer loops are is given on the left in Fig. 9. The outer loop is data data independent and are parallelized on the runtime independent and the inner loop is data dependent. So, we system as described by the program presented on the right parallelize the outer loop by creating N Â N tasks, each task side of Fig. 10. The created tasks have irregular memory executes the inner loop. The parallelized version on the access patterns so that tasks have an imbalanced workload, YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 367

Fig. 10. Sparse matrix-matrix multiplication: the left side is the sequential program version and the right side is the rewritten version on the runtime system. which is closely related to the input matrices. The elements Replacement misses. Misses caused by reads or writes on of arrays A and B accessed by each task are determined at data that were brought into the cache but replaced by runtime. This application is hard for a compiler to exploit other block data at the most recent time. cache locality. In our experiments, two sparse matrices, A Coherence misses. Misses caused by reads or writes on and B, are set to have 30 percent nonzero elements, which data that were brought into the cache but invalidated by are randomly generated from a uniform distribution. other processors at the most recent time. The total number of the first two types of misses is a 3PERFORMANCE EVALUATION METHOD AND good measure of the data reuse in caches. The last type of ENVIRONMENTS misses evaluate the data sharing degree among processors. The selected benchmark programs are the DMM, the AC, This section introduces our performance evaluation method and the SMM, which are described in Section 2.6.1. Their and evaluation environments. The runtime system was first optimized versions, which exploit locality on the runtime implemented and tested on an event-driven simulator built system, are denoted as DMM_LO, AC_LO, and SMM_LO, on MINT [21]. The preliminary results were reported in respectively. We parallelized the three programs by using [26]. We have recently implemented the system on an well-known compiler optimizations or runtime techniques SimOS simulated SMP, a much more reliable environment as follows: for performance evaluation. We have also measured the performance of the system on two commercial SMP multi- 1. For the dense matrix-matrix multiplication program, processors. we used the sequential block algorithm proposed by Wolf and Lam [23]. This algorithm has been shown 3.1 Evaluation Method and Metrics to effectively exploit cache locality. The locality of The performance evaluation of the runtime system was the block algorithm is further improved by transpos- conducted using simulation and measurement. Simulation ing one matrix so that the innermost loop accesses was used to study the effectiveness of the runtime system in contiguous memory regions on the two arrays. exploiting the cache locality of applications with respect to Based on this transposed block algorithm, a paralle- the changes of the cache miss rate, bus traffics, execution lized algorithm, denoted as DMM_WL presented in time, and cache interference. Measurements were used to Fig. 11, is given by uniformly partitioning computa- further verify the effectiveness of the runtime technique on tions on multiple processors, so that good load given commercial SMP systems. Miss rate, load coefficiency balance is achieved. 2. For the adjoint convolution program, each iteration (i.e., the ratio of deviation to mean), and execution time are of the outermost loop accesses a contiguous segment three metrics used in the performance evaluation. Cache of arrays A and C, respectively. In order to misses are classified as the following three types: investigate the effect of locality optimization, we Compulsory misses. Misses caused by reads or writes on assume that arrays A and C are too large to fill in a data that have never been brought into the cache before. cache. Two iterations of the outer loop that have 368 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

locality. By comparing with this parallel program, denoted AC_BF, we evaluated the quality of our runtime technique which optimizes load balance and exploits cache locality. 3. The SMM program has an irregular memory access pattern that is determined by runtime input data. This type of application is very hard for compiler- based techniques to effectively conduct partition and cache locality optimization. Here we use the linearly adaptive scheduling technique, proposed in [25], to schedule the executions of parallel iterations in the SMM where parallel iterations are initially created as Fig. 11. A well-tuned parallel version of the DMM program published in parallel tasks in multiple task queues. The adaptive [23]. Here, pid is a thread id of value from 0 to p-1. p is the number of runtime scheduling technique has been shown to threads. outperform previous runtime scheduling techniques for imbalanced parallel loops [25]. Hereafter, We closer values of index i have larger overlap between denote this parallelized version as SMM_A. their data sets. From the standpoint of optimizing Although the SMM_A has a similar execution cache locality, the AC program should be paralle- procedure to the SMM_LO, three significant differ- lized by using the blocking technique to chunk the ences are: 1) Initially, the SMM_LO groups and outermost loop. By this, each processor will be partitions tasks with the objective of minimizing allocated with a set of adjacent outermost iterations. data sharing between partitions and maximizing Because the iterations of the outermost loop have data reuse in a partition. The SMM_A just cyclically decreasing workload as index i decreases, a vary- puts tasks in local queues of processors. The ing-sized blocking technique should be used to SMM_LO has higher runtime scheduling overhead optimize both locality and load balance. For any than the SMM_A. 2) Although both the SMM_LO given N and the number of processors, it is difficult and the SMM_A use the linearly adaptive schedul- or impossible to choose a set of different block sizes ing algorithm, the scheduling in the SMM_LO is locality-oriented, which has a better chance to to balance load among processors. Here, we have reduce its number of memory accesses. Compared integrated several compiler optimizations to solve with the SMM_A, the SMM_LO is expected to this problem. At first, we equally split the outermost further reduce execution time by optimizing mem- loop into two loops and reversed the computations ory performance on modern computers. of the second loop, as shown in Fig. 12b. Then, the The last issue is how to select the problem sizes of the second loop was aligned and fused with the first loop to make a new loop with balanced iterations, tested programs. Because our goal is to evaluate the which is illustrated by Fig. 12c and Fig. 12d. The new effectiveness of the runtime system in exploiting cache loop can be equally blocked onto multiple proces- locality, we select the problem sizes based on the under- sors to maintain both load balance and cache lying cache size so that the data set of a program is too large

Fig. 12. A well-tuned parallel version of the AC application. (a) Original AC. (b) Splited AC with reversed execution order in the second loop. (c) Align the second loop with the first loop. (d) Fused AC. YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 369 enough for the cache to hold. In order to shorten simulation TABLE 1 time without losing the confidence of simulation results, we Architectural Parameters of the Simulation selected relatively small problem sizes for the programs. We scaled down the cache size accordingly for these programs. These selections can prevent the advantage of the runtime system from being shadowed by hardware caches, because on a given commercial system, cache capacity is fixed but the problem size of an application varies. 3.2 The SimOS Simulation Environment The SimOS [18] is used to simulate a bus-based cache coherent SMP system. SimOS is a machine simulation environment developed by Stanford University. SimOS simulates the complete hardware of a computer system booting a commercial operating system and running a realistic workload, such as our application programs 509 nanoseconds. The ratio of cache miss time to cache hit supported by the Cacheminer runtime system, on top of time is about 30. it. The simulator contains software simulation of all the Our Sun Hyper-SPARCstation-20 has four hyperSPARC hardware components of a computer system: processors, processors operating at 100MHz. Each processor has a two- memory management units, caches, memory systems, as level cache hierarchy: a 64 Kbyte on-chip cache and a well as I/O devices such as SCSI disks, Ethernets, hardware 256 Kbyte virtual secondary cache where the cache line size clocks, and consoles. The current version of SimOS is 64 bytes. Compared with the S-class, the larger cache line simulates the hardware of MIPS-based multiprocessors to of the Hyper-SPARCstation-20 exploits better spatial lo- run Silicon Graphics' IRIX operating system. The cache cality for applications. The cache hit time is about 300 coherence protocol of the simulated SMP is write- nanoseconds. A cache miss time is about 13,360 nanose- invalidate. conds. The ratio of cache miss time to cache hit time is about The simulated SMP is very similar to popular commer- 36. Cache coherence is maintained by the well known bus- cial SMPs. The simulated SMP consists of a number of 200 based snooping protocol. MHz R10000 processors, each of which has 8 KB instruction Compared with HP/Convex S-class with respect to cache, 8 KB data cache, and a unified 64 KB L2 cache. The instructions issuing rate and memory access latency, the size of the memory is set to 64 MB. The access times to L1, Sun Hyper-SPARCstation-20 is much slower. In measure- L2 caches and the shared-memory are 2, 7, and 100 cycles, ment, we focused on the comparisons of relative perfor- respectively. As the memory size is scaled down eight mance results. times, so is the workload. This has significantly reduced the simulation time. 4PERFORMANCE EVALUATION 3.3 HP/Convex S-class and Sun Hyper-SPARC 4.1 Simulation Results Station-20 The architectural parameters of the simulation are shown in The HP/Convex S-class [3] is a crossbar-based cache Table 1. The selected array sizes of programs DMM, AC, coherent SMP system with 16 processors, while the Sun and SMM are 256 Â 256, 16,384, and 512 Â 512, respectively, Hyper-SPRACstation-20 is a bus-based cache coherent SMP which correspond to working set sizes 1,536 Kbytes, system with four processors. The architectural differences 384 Kbytes, and about 1,350 Kbytes, respectively. The cache of these two SMP systems provide the runtime system with block size is 32 bytes. different opportunities/challenges to improve the perfor- mance of applications. 4.1.1 Cache Performance The HP/Convex S-class has 16 PA8000 processors of 720 Table 2 comparatively presents the cache performance of peak MFLOPS. A PA8000 is a four-way super-scalar RISC the parallel programs optimized by different techniques. processor, supporting 64-bit virtual address space, which Regarding regular program DMM, the locality optimized operates at 180MHz. A PA8000 processor has a single level parallel version (DMM_LO) using the runtime technique is primary cache with separate instruction cache and data 10 percent higher than the well-tuned version (DMM_WL) cache of size 1 Mbytes each. The cache is direct-mapped in the number of cache misses. Both versions had similar using a write back policy, which has cache line size of 32 performance with respect to their compulsory misses and bytes and cache hit time of three cycles (about 16.7 invalidations. As the number of processors increases, the nanoseconds). Cache coherence is maintained by a dis- numbers of compulsory misses and invalidations in both tributed directory-based hardware cache coherent protocol. optimized versions increase, but cache replacement perfor- The S-class has a pipelined, 32-way interleaved shared mance is also improved. The former is due to the increase in memory of eight memory boards, which is interconnected sharing degree between caches. The latter is due to the use with processors by a 8 Â 8 nonblocking crossbar. The data of more caches. This is consistent with previous research path from the crossbar to the memory controller is 64-bits work. The measured speedup for SMM_A is 1.81, 3.04, and wide and operates at 120 MHz. The access latency of a 4.67 on two, four, and eight processors, respectively, 32-byte cache line from the shared memory to a cache is compared with the sequential time of DMM_WL. 370 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

TABLE 2 Cache Performance Comparisons

misses, comp., and inv., respectively, give the numbers of misses, compulsory misses, and replacement misses. K=103 and M=106. DMM_LO was measured with shrinking factor f =1.

The memory access pattern of the AC program is not as DMM_LO had worse load balance and longer data transfer regular as that of the DMM program. The AC_LO, a locality time. Although the parallel iterations were perfectly optimized version using the runtime technique, is shown to partitioned among multiple processors in the DMM_WL, achieve moderately better cache performance than the slight load imbalance was also observed. Regarding pro- AC_BF, a well-tuned version. The numbers of replacement gram AC, the AC_LO outperformed the AC_BF from misses were improved by the runtime technique. The cache 5 percent to 7 percent in terms of execution time. This performance of both parallel versions do not show a improvement was also mainly contributed by certain significant change when the number of processors is reductions in load traffic. The AC_LO had worse load increased. This is different from the execution of the balance than that of the AC_BF because the AC_LO was DMM program. The measured speedup of AC_LO is 1.99, trying to balance load based on pre-grouped tasks. 3.94, and 7.73 on two, four, and eight processors, However, this imbalance does not impact the overall respectively. performance significantly. This also shows that locality Regarding program SMM, the runtime locality technique optimization is more important in this case. Regarding is shown to be highly effective in reducing cache misses. program SMM, the SMM_LO performed much better than This reduction is mainly contributed by a significant the SMM_A by reducing 8 to 16 percent execution time reduction in replacement cache misses. Both parallel using the runtime technique. versions present similar invalidation performance. These results show the great potential of optimizing the cache 4.1.3 Runtime Overhead locality using runtime techniques for applications with Runtime overhead is another important factor which affects dynamic memory access patterns. The measured speedup the effectiveness of a runtime technique. In our proposed of SMM_LO is 1.88, 3.33, and 5.00 on two, four, and eight runtime technique, runtime overhead is mainly caused by processors, respectively. task organization and task runtime scheduling. The task organization overhead is affected by the number of tasks 4.1.2 Execution Performance and Bus Traffic created at runtime and the number of arrays accessed. The In Table 3, the execution performance of different parallel runtime scheduling overhead is affected by the imbalance versions are presented. The overall performance of a in the initial task partition and in the runtime executions of program was measured by its execution time. The perfor- multiple processors. Table 4 gives the percentage of the mance differences between different parallel versions can runtime overhead in the total execution time. For both the be clarified by the differences in bus traffic and load balance DMM_LO and the SMM_LO, the runtime overhead had a quality. bigger influence on execution performance than the Regarding program DMM, the DMM_WL slightly out- AC_LO. This difference is mainly due to the difference in performed the DMM_LO. This is mainly because the the computation granularities of tasks. The tasks in the YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 371

TABLE 3 Execution Performance and Bus Traffic

All the timing results are given in simulation second; M ˆ 106. Load balance was measured by the division of time derivation by mean time. The locality optimized programs using our runtime approach use shrinking factor f ˆ 1.

AC_LO had the largest computation granularity and the Table 5 also gives the runtime overhead of the task tasks in the SMM_LO had the smallest computation reorganization. Among all the applications, the SMM_LO granularity. had the largest runtime overhead in term of the percentage 4.2 Measurements in the total execution time, and the AC_LO had the lowest. 4.2.1 Measurements on HP/CONVEX S-class This is consistent with the simulation results. As mentioned before, this is mainly affected by the task granularity. Measurement results of the different parallel versions on Regarding the effect of different values of f on perfor- HP/CONVEX S-class are presented in Table 5. Regarding program DMM, the DMM_WL consistently performed a mance, Table 6 presents the measurement results. For the little bit better than the DMM_LO. The better load balance DMM_LO, the execution time decreased as f decreased, in the DMM_WL is a reason for this. For program AC, the resulting in groups with a smaller number of tasks. The AC_LO performed significantly better than the AC_BF on AC_LO is not sensitive to the change of f, which is two processors. When more processors were applied, the consistent with our simulation results. The SMM_LO had execution times were close. But, the AC_BF always longer execution time when a smaller f was used. balanced loads better due to its perfect initial partition. 4.2.2 Measurements on HyperSPARC Station-20 But, the load imbalance that occurred in the AL_LO was not Table 7 gives the execution times of the parallel versions on larger than 1 percent. For the SMM, the SMM_LO had HyperSPARC station-20, a much slower multiprocessor achieved a significant performance improvement over the workstation than the S-class. The DMM_LO still achieved a SMM_A. This further confirms the effectiveness of the close performance to the DMM_WL, not worse than runtime technique in improving the performance of applications with dynamic memory access patterns. How- 9 percent in execution time. The runtime overhead in the ever, the SMM_A still achieved better load balance than the DMM_LO was about 10 percent of its execution time. For SMM_LO. One reason for this is that the SMM_LO used a program AC, the AC_LO outperformed the AC_A for locality preserved scheduling algorithm, which tried to 8.5 percent in execution time reduction, although it had keep the tasks in a group to execute together on a processor. worse load balance. Compared with the SMM_A, the This can increase data reuse in a cache. But, it also tended to SMM_LO reduced execution time up to 40 percent. These cause more imbalance. measurements are consistent with that on the S-class, although the absolute performance results are different. The effects of different values of f are presented in TABLE 4 Table 8. The DMM_LO, the AC_LO, and the SMM_LO Runtime Overhead in Percentage of Total Execution Time achieved the best performance, respectively, at f = 0.25, f = 0.5, and f =1.

5CONCLUSION The locality of a program is affected by a wide range of performance factors. The design of efficient locality- optimization techniques relies on an insightful 372 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

TABLE 5 TABLE 7 Execution Time (in Seconds) Based Comparison on HP/Convex Execution Time (in Seconds) Based Comparison on HP/Convex S-class S-Class

Columns time and overhead, respectively, give total execution time and task organization overhead in second. Balance presents load balance measurements in term of the rate of the time deviation to the mean time. (f=1).

3. Integration, which trades off locality with other performance factors to improve overall performance when the tasks are scheduled on an SMP. Our locality optimization technique targets applications with dynamic memory-access patterns. We have shown that Columns time and overhead, respectively, give total execution time the multidimensional internal structure is effective to and task organization overhead in second. Balance presents load balance measurements in term of the rate of the time deviation to the integrate both static and dynamic hints. We have shown mean time. (f = 1). that the runtime overhead is acceptable, which, in most of our test cases, is not larger than 10 percent of the total understanding of these performance factors. This paper execution time of a program. presents a runtime approach to exploit cache locality on We have also shown that there is a good potential for the SMP multiprocessors. We have built a runtime library runtime locality optimization technique to improve the including the following three functionalities: performance of application programs with irregular com- 1. Information acquisition, which collects information on putation and dynamic memory access patterns. The the cache-access pattern of a program. The informa- runtime technique reduces the number of memory accesses tion on the data-access sequence of a program is to alleviate increasing demand on memory-bus bandwidth. essential for locality optimization. Higher precision In comparison with a regular application which was well- in information acquisition is achieved at the cost of optimized by compiler-based techniques, we have shown higher runtime overhead. that the runtime optimizations could perform competitively 2. Optimization, which reorganizes the data layout and execution sequences of a program to maximize data as well. Our run-time system was implemented as a set of reuse in caches and to minimize data sharing among simple and portable library functions. It can be conveniently caches. used by users on commercial SMPs. The run-time system is not aimed at replacing compiler-based techniques, but at

TABLE 6 The Effect of Different Values of f on Execution Time (in TABLE 8 Seconds) for the DMM_LO and the AC_LO on Four Processors The Effect of Different Values of f on Execution Time (in of HP/Convex S-Class Seconds) for the DMM_LO and the AC_LO on Four Processors of HyperSPARC Station-20 YAN ET AL.: CACHEMINER: A RUNTIME APPROACH TO EXPLOIT CACHE LOCALITY ON SMP 373 complementing a compiler to optimize those applications [3] G. Astfalk and T. Brewer, ªAn Overview of the HP/Convex Exemplar Hardware,º technical report, Hewlett-Packard, Mar. that are beyond of its optimization capability. 1997. Our runtime technique has the following limits: [4] R. Chandra, A. Gupta, and J.L. Hennessy, ªData Locality and Load Balancing in COOL,º Proc. ACM SIGPLAN Symp. Principles . We assume that the array data elements are and Practice of Parallel Programming, pp. 249±259, May 1993. contiguously allocated. Many operating systems [5] A. Charlesworth, N. Aneshansley, M. Haakmeester, D. Drogichen, G. Gilbert, R. Williams, and A. Phelps, ªThe Startfire SMP attempt to do so. Our experiments using SimOS Interconnect,º Proc. Supercomputing '97, Nov. 1997. with SGI's IRIX operating system also show con- [6] S. Coleman and K.S. McKinley, ªTile Size Selection Using Cache sistent results. However, if this assumption is not Organization and Data Layout,º Proc. SIGPLAN '95 Conf. true in rare cases, a task may access an array in Programming Language Design and Implementation, pp. 279±289, June 1995. different memory regions, and the estimation of our [7] M. Galles and E. Williams, ªPerformance Optimizations, Imple- method would not be accurate. mentation, and Verification of the SGI Challenge Multiprocessor,º . When the task memory-access space is reduced to Proc. 27th Hawaii Int'l Conf. System Sciences, vol. 1, pp. 134±143, Jan. the bin-space by the cache capacity, a uniform scalar 1994. [8] T.E. Jeremiassen and S.J. Eggers, ªReducing False Sharing on is used in each dimension for the reduction. Shared Memory Multiprocessors through Compile Time Data However, a task may access different arrays in Transformations,º Proc. Fifth ACM SIGPLAN Symp. Principles and different sizes. The reduction by such a uniform Practice of Parallel Programming, pp. 179±188, July 1995. scalar is the cheapest way to predict, but may not [9] I. Kodukula, N. Ahmed, and K. Pingali, ªData-Centric Multi-Level Blocking,º Proc. ACM SIGPLAN Conf. Programming Language fully use the cache capacity in some cases. Design and Implementation, pp. 346±357, May 1997. . In some loop structures, a parallel iteration may not [10] A.R. Lebeck and D.A. Wood, ªDynamic Self-Invalidation: Redu- contiguously access an array. In order to fully utilize cing Coherence Overhead in Shared-Memory Multiprocessors,º Proc. 22nd Annual Int'l Symp. Computer Architecture, pp. 48±59, the cache, the loop iteration may be split further into 1995. smaller iterations so that each iteration accesses a [11] M.S. Lam, E.E. Rothberg, and M.E. Wolf, ªThe Cache Performance contiguous region in each array. and Optimizations of Blocked Algorithms,º Proc. ASPLOS '91, pp. 63±74, Apr. 1991. The runtime technique only takes into consideration the [12] H. Li, S. Tandri, M. Stumm, and K.C. Sevcik, ªLocality and Loop nested loops without data-dependence. In the program Scheduling on NUMA Multiprocessors,º Int'l Conf. Parallel behavior classifications given in Section 2.6, Type 4 has Processing vol. II, pp. 140±144, 1993. [13] E.P. Markatos and T.J. LeBlanc, ªUsing Processor Affinity in Loop irregular computational patterns, dynamic memory pat- Scheduling Scheme on Shared-Memory Multiprocessors,º IEEE terns, and dynamic data-dependence patterns. This type of Trans. Parallel and Distributed Systems, vol. 5, no. 4, pp. 379±400, program is the most difficult for the locality optimization, Apr. 1994. [14] K.S. McKinley, S. Carr, and C.W. Tseng, ªImproving Data Locality because both the data-dependence and the locality optimi- with Loop Transformations,º ACM Trans. Programming Languages zation must be resolved at runtime. We are developing new and Systems, vol. 18, no. 4, pp. 424±453, July 1996. [15] K.S. McKinley and O. Teman, ªA Quantitative Analysis of Loop methods to address this problem by combining our runtime Nest Locality,º Proc. ASPLOS '96, pp. 94±104, Oct. 1996. technique with some existing methods on data dependence [16] J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li, ªThread recognization. Scheduling for Cache Locality,º Proc. ASPLOS '96, pp. 60±71, Oct. 1996. [17] C. Polychronopoulos and D. Kuck, ªGuided Self-Scheduling: A Practical Self-Scheduling Scheme for Parallel Supercomputers,º ACKNOWLEDGMENTS IEEE Trans. Computers, vol. 36, no. 12, pp. 1,425±1,439, Dec. 1987. This work is supported in part by the U.S. National Science [18] M. Rosenblum, E. Bugnion, S. Devine, and S.A. Herrod, ªUsing the SimOS Machine Simulator to Study Complex Computer Foundation under grants CCR-9400719, CCR-9812187, and Systems,º ACM Trans. Modeling and Computer Simulation, vol. 7, EIA-9977030, by the Air Force Office of Scientific Research no. 1, pp. 78-103, 1997. under grant AFOSR-95-1-0215, and by [19] M.B. Steinman, G.J. Harris, A. Kocev, V.C. Lamere, and R.D. Pannell, ªThe AlphaServer 4100 Cached Processor Module under grant EDUE-NAFO-980405. We thank Craig Douglas Architecture and Design,º Digital Technical J. vol. 8, no. 4, for sending us his thread library. We are grateful to Greg pp. 21±37, 1996. [20] J. Torrellas, M.S. Lam, and J.L. Hennessy, ªFalse Sharing and Astfalk for his constructive suggestions and help in using Spatial Locality in Multiprocessor Caches,º IEEE Trans. Computers, the HP/Convex S-class. Dr. Neal Wagner carefully read the vol. 43, no. 6, pp. 651±663, June 1994. manuscript and made constructive comments. Finally, we [21] J.E. Veenstra and R.J. Fowler, ªMINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors,º Proc. MASCOTS appreciate the insightful comments and critiques from the '94, pp. 201±207, Jan. 1994. anonymous referees, which are helpful to improve the [22] M. Wolfe, High Performance Compilers For Parallel Computing. quality and readability of the paper. Addison-Wesley, 1996. [23] M.E. Wolf and M. Lam, ªA Data Locality Optimizing Algorithm,º Proc. ACM SIGPLAN '91 Conf. Programming Language Design and Implementation, pp. 30±44, June 1991. REFERENCES [24] Y. Yan, ªExploiting Cache Locality on Symmetric Multiprocessors: [1] J.M. Anderson, S.P. Amarasinghe, and M.S. Lam, ªData and A Run-Time Approach,º PhD dissertation, Dept. Computer Computation Transformations for Multiprocessors,º Proc. Fifth Science, College of William and Mary, June 1998. ACM SIGPLAN Symp. Principles and Practice of Parallel Program- [25] Y. Yan, C.M. Jin, and X. Zhang, ªAdaptively Scheduling Parallel ming, pp. 166±178, July 1995. Loops in Distributed Shared-Memory Systems,º IEEE Trans. [2] J.M. Anderson and M.S. Lam, ªGlobal Optimizations for Parallel and Distributed Systems, vol. 8, no. 1, pp. 70±81, Jan. 1997. Parallelism and Locality on Scalable Parallel Machines,º Proc. [26] Y. Yan, X. Zhang, and Z. Zhang, ªA Memory-Layout Oriented ACM SIGPLAN'93 Conf. Programming Language Design and Runtime Technique for Locality Optimization,º Proc. 1998 Int'l Implementation, pp. 112±125, June 1993. Conf. Parallel Processing (ICPP '98), pp. 189-196, Aug. 1998. 374 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 4, APRIL 2000

Yong Yan received the BS and MS degrees in Zhao Zhang received his BS and MS degrees in computer science from Huazhong University of computer science from Huazhong University of Science and Technology, China, in 1984 and Science and Technology, China, in 1991 and 1987, respectively, and his PhD in computer 1994, respectively. He is a PhD candidate of science from the College of William and Mary in computer science at the College of William and 1998. He is currently a member of the technical Mary. His research interests are computer staff of the Computer Systems Labs in the architecture and parallel systems. He is a Hewlett Packard Laboratories. He was a system member of the IEEE and ACM. architect in HAL Computer Systems Inc. from 1998 to 1999. His current research objectives and interests are to build effective multiprocessor servers for both scientific and commercial workloads, and to provide efficient system support to the servers. He has extensively published in the areas of parallel and distributed computing, computer architecture, performance evaluation, operating systems, and algorithm analysis. He is a member of the IEEE and ACM.

Xiaodong Zhang received the BS in electrical engineering from Beijing Polytechnic University, China, in 1982, and the MS and PhD degrees in computer science from the University of Colorado at Boulder in 1985 and 1989, respec- tively. He is a professor of computer science at the College of William and Mary. His research interests are parallel and distributed systems, computer system performance evaluation, and scientific computing. He is an associate editor of the IEEE Transactions on Parallel and Distrib- uted Systems, and has chaired the IEEE Computer Society Technical Committee on Supercomputing Applications. He is a senior member of the IEEE.