Detailed Cache Coherence Characterization for Openmp Benchmarks

1 Detailed Cache Coherence Characterization for OpenMP Benchmarks Anita Nagarajan, Jaydeep Marathe, Frank Mueller Dept. of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 [email protected], phone: (919) 515-7889 Abstract models in their implementation (e.g., [6], [13], [26], [24], [2], [7]), and they operate at different abstraction levels Past work on studying cache coherence in shared- ranging from cycle-accuracy over instruction-level to the memory symmetric multiprocessors (SMPs) concentrates on operating system interface. On the performance tuning end, studying aggregate events, often from an architecture point work mostly concentrates on program analysis to derive op- of view. However, this approach provides insufficient infor- timized code (e.g., [15], [27]). Recent processor support for mation about the exact sources of inefficiencies in paral- performance counters opens new opportunities to study the lel applications. For SMPs in contemporary clusters, ap- effect of applications on architectures with the potential to plication performance is impacted by the pattern of shared complement them with per-reference statistics obtained by memory usage, and it becomes essential to understand co- simulation. herence behavior in terms of the application program con- In this paper, we concentrate on cache coherence simu- structs — such as data structures and source code lines. lation without cycle accuracy or even instruction-level sim- The technical contributions of this work are as follows. ulation. We constrain ourselves to an SPMD programming We introduce ccSIM, a cache-coherent memory simulator paradigm on dedicated SMPs. Specifically, we assume the fed by data traces obtained through on-the-fly dynamic absence of workload sharing, i.e., only one application runs binary rewriting of OpenMP benchmarks executing on a on a node, and we enforce a one-to-one mapping between Power3 SMP node. We explore the degrees of freedom in threads and processors. These assumptions are common for interleaving data traces from the different processors and high-performance scientific computing [29], [30]. assess the simulation accuracy by comparing with hard- We make the following contributions in the paper. We ware performance counters. The novelty of ccSIM lies in have designed and implemented ccSIM, a cache-coherent its ability to relate coherence traffic — specifically coher- simulator. We demonstrate good correlation between cc- ence misses as well as their progenitor invalidations — to SIM results and hardware performance counters for a 4-way data structures and to their reference locations in the source Power3 SMP node on a variety of OpenMP benchmarks. program, thereby facilitating the detection of inefficiencies. We obtain address traces per processor through dynamic bi- Our experiments demonstrate that (a) cache coherence traf- nary rewriting. We demonstrate that ccSIM obtains detailed fic is simulated accurately for SPMD programming styles information indicating causes of invalidations and relates as its invalidation traffic closely matches the corresponding these events to their program location and data structures. hardware performance counters, (b) we derive detailed co- This enables us to detect coherence bottlenecks and allows herence information indicating the location of invalidations us to infer opportunities for optimizations. in the application code, i.e, source line and data structures The paper is structured as follows. We introduce the in- and (c) we illustrate opportunities for optimizations from strumentation framework METRIC that is utilized to ob- these details. By exploiting these unique features of ccSim, tain data traces. We then provide details on the design and we were able to identify and locate opportunities for pro- implementation of ccSIM, the cache coherence simulator. gram transformations, including interactions with OpenMP Next, we describe the experimental methodology followed constructs, resulting in both significantly decreased coher- by simulation results as well as performance counter mea- ence misses and savings of up to 73% in wall-clock execu- surements. From these results, we derive opportunities for tion time for several real-world benchmarks. code transformations and assess their benefits. We conclude with related work and a summary of our contributions. 1. Introduction Prior work on cache coherence concentrates on two ar- 2. Framework Overview eas: simulation and performance tuning. Many architec- Figure 1 depicts our simulation framework. Simulation tural and system simulators support different coherence of coherence traffic is based on two software components, This work was supported in part by NSF CAREER grant CCR-0237570. a binary instrumentation tool to extract data traces from ac- Application Tracefile1 Expander1 uniproc. Statistics Binary cache1 uniproc. Tracefile2 Expander2 Statistics cache2 METRIC shared bus ... ... ... ... TracefileN ExpanderN uniproc. Statistics cacheN compressed traces decompressors ccSIM Fig. 1. Overview of Framework tual application executions and a cache simulator that con- speed by instrumenting only floating point or integer ac- sumes these traces to simulate coherence traffic. cesses. It also allows certain accesses such as local stack The instrumentation tool uses dynamic binary rewriting accesses to be ignored, since they often do not perceptibly to instrument a running application. Memory accesses are affect the overall access metrics of the target program. instrumented to emit address reference information, includ- Once the instrumentation is complete, the target is al- ing the relation of a reference to its source line in the pro- lowed to continue. The trace logging of the access stream gram [20]. This information is compressed on-the-fly to of each OpenMP thread proceeds in parallel without inter- conserve space before it is written to storage. Notice that the action with other OpenMP threads, thus increasing the trac- instrumentation does not affect the trace data, i.e., we col- ing speed. For each thread, the instrumentation code calls lect the original trace addresses (even though performance handler functions in a shared library. The handler functions is perturbed by the instrumentation). Later comparisons compress the generated trace online and write the com- with performance counters are based on the uninstrumented pressed trace to stable storage. application to ensure that the same work is being performed For each thread, the accesses generated are compressed and instrumentation is excluded in the measurements. using the compression primitives described in our previ- The trace of each thread comprises the input to a unipro- ous work [20]. Each compression primitive has a unique cessor cache hierarchy. Each cache hierarchy (correspond- sequence id field, which is globally unique for that ing to a separate processor) emits coherence messages on thread, and anchors the compression primitive in the overall a shared bus according to the selected coherence protocol. access stream for that thread. OpenMP supports SMP paral- Events, such as invalidation messages and cache misses, are lelism via compiler directives (#pragma omp or !$OMP). being logged in the simulator and associated with source We instrument the compiler-generated functions imple- data structures as well as instruction locations in source files menting these directives. as they occur. This allows the generation of detailed statistics on a per-reference base with regard to invalidations, re- 4. ccSIM: A Multi-Processor Cache Simulator sulting misses and data subsequently evicted from cache. The compressed access trace generated from the instru- 3. Instrumentation and Trace Generation mented OpenMP application is used for incremental multiprocessor memory hierarchy simulation. We have designed Cache simulation for SMPs is based on address traces and implemented a memory access simulator for cache co- collected by METRIC, our framework for dynamic binary herent shared-memory multiprocessor systems. The unipro- instrumentation of memory references [20]. METRIC in- cessor components were derived from MHSim [23]. serts probes through a control program into an executing The ccSIM tool feeds each trace file into a driver object, application to generate highly compressed address traces. each of which corresponds to a separate processor (see Fig- The process of dynamic binary rewriting operates as fol- ure 1). Such a trace contains entries corresponding to the lows. A control program instruments the target OpenMP sequence of events during its execution, e.g, data references application executable using our customized extensions of as well as OpenMP directives. Each driver feeds an instance the DynInst binary rewriting API [4]. These customizations, of a uniprocessor cache hierarchy. During the simulation, part of the METRIC framework, have been further extended fine-grained statistics on cache metrics and coherence traf- to capture traces of OpenMP threads for this work. For fic are maintained on a per-reference and per-data structure each OpenMP thread, the memory access points (i.e., the basis. These statistics provide important feedback about the loads and stores) are instrumented to capture the applica- application behavior and performance. The causes of bottle- tion access trace. To reduce the overhead on target execu- necks on the SMP architecture of interest can be determined tion, METRIC can trade off simulation accuracy for tracing from the information obtained. 4.1 Interleaving

Detailed Cache Coherence Characterization for Openmp Benchmarks

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7

Mjb › Cs575 › Handouts › Cache.1Pp.Pdf Caching Issues in Multicore Performance