Performance Tuning of Scientific Applications

Chapter 3 Software Interfaces to Hardware Counters Shirley V. Moore University of Tennessee Daniel K. Terpstra University of Tennessee Vincent M. Weaver University of Tennessee 3.1 Introduction ...................................................... 34 3.2 Processor Counters ............................................... 35 3.3 Off-Core and Shared Counter Resources ........................ 35 3.4 Platform Examples ............................................... 36 3.4.1 AMD ..................................................... 36 3.4.2 Intel ...................................................... 37 3.4.3 IBM Blue Gene .......................................... 37 3.5 Operating System Interfaces ..................................... 38 3.5.1 Perf Events .............................................. 38 3.5.2 Perfmon2 ................................................ 39 3.5.3 Oprofile .................................................. 39 3.6 PAPI in Detail ................................................... 40 3.6.1 Extension to Off-Processor Counters .................... 40 3.6.2 Countable Events ........................................ 41 3.7 Counter Usage Modes ............................................ 42 3.7.1 Counting Mode .......................................... 43 3.7.2 Sampling Mode .......................................... 43 3.7.3 Data Address Sampling ................................. 44 3.8 Uses of Hardware Counters ...................................... 44 3.8.1 Optimizing Cache Usage ................................ 44 3.8.2 Optimizing Bus Bandwidth ............................. 45 3.8.3 Optimizing Prefetching .................................. 45 3.8.4 TLB Usage Optimization ................................ 46 3.8.5 Other Uses ............................................... 46 3.9 Caveats of Hardware Counters ................................... 47 3.9.1 Accuracy ................................................. 47 33 34 Performance Tuning of Scientific Applications 3.9.2 Overhead ................................................ 47 3.9.3 Determinism ............................................. 48 3.10 Summary ......................................................... 48 3.11 Acknowledgment ................................................. 48 3.1 Introduction Hardware performance counters are registers available on modern CPUs that count low-level events within the processor with minimal overhead. Hard- ware counters may also exist on other system components, such as memory controllers and network interfaces. Data from the counters can be used for performance evaluation and tuning. Each processor family has a different set of hardware counters, often with different names even for the same types of events. Models in the same processor family can also differ in the specific events available. In general similar types of events are available on most CPUs. Hardware counters are a powerful tool, but widespread adoption has been hindered by scant documentation, lack of cross-platform interfaces, and poor operating system support. Historically hardware counters were used inter- nally by chip designers and interfaces were not always documented or made publicly available. As application and performance tool developers began to discover the usefulness of counters, vendors started supporting the interfaces (SGI’s perfex interface is an early example). These interfaces were platform- specific; an application or tool developer had to use a different interface on each platform. Typically special operating system support is needed to access the counters, and for some popular operating systems (such as Linux) this support was missing. Linux support for hardware counters has been substandard for years. Until recently the default kernel did not support access to the counters; the kernel had to be specifically patched. Various counter patch sets were available, including the perfctr patches (developed by Mikael Pettersson of Uppsala Uni- versity) and the perfmon and perfmon2 [129] projects (developed by Stéphane Eranian). In 2009 performance counter support was finally merged into the Linux kernel, as the Performance Counters for Linux (PCL) project (since renamed to simply “perf events”). The divergent nature of performance counters across various platforms, architectures, and operating systems has led to the need for a layer of ab- straction that hides these differences. This is addressed by the Performance API (PAPI) [64] project at the University of Tennessee, which has developed a platform-independent interface to the underlying hardware counter implementations. The remainder of this chapter explores the features, uses, and limitations of hardware performance counters, and describes operating system interfaces and the PAPI interface. Software Interfaces to Hardware Counters 35 3.2 Processor Counters The hardware counters available on CPUs expose the underlying architec- ture of the chip for program analysis. Various types of events are available, but care should be taken when extrapolating conclusions from the results. Counters are a limited resource; in general only a few (typically from 2-5) events can be counted at a time. On modern CPUs there may be hundreds of available events, making it difficult to decide which events are relevant and worth counting. Commonly available events include the following: Cycle count • Instruction count • Cache and memory statistics • Branch predictor statistics • Floating point statistics • Resource utilization • Counts can be collected as aggregate totals, or else sampled. Sampling helps when counting more events than available counters, although this in- troduces some degree of error in the results. Counts can be restricted to only userspace or kernel events, as well as specifying whether monitoring should be per-process or system-wide. 3.3 Off-Core and Shared Counter Resources In addition to the counters found on CPUs, various other hardware sub- systems found in modern computers also provide performance counters. This includes hardware such as hard disks, graphical processing units (GPUs), and networking equipment. Many network switches and network interface cards (NICs) contain counters that can monitor various performance and reliability events. Possible events include checksum errors, dropped packets, and packets sent and re- ceived. Communication in OS-bypass networks such as Myrinet and Infini- band is asynchronous to the application; therefore hardware monitoring, in addition to being low overhead, may be the only way to obtain important data about communication performance. As processor densities climb, the thermal properties and energy usage of 36 Performance Tuning of Scientific Applications high performance systems are becoming increasing important. Such systems contain large numbers of densely packed processors that require a great deal of electricity. Power and thermal management are becoming critical to successful resource utilization [133]. Modern systems are beginning to provide interfaces for monitoring thermal and power consumption sensors. Unlike on-processor counters, off-processor counters and sensors usually measure events in a system-wide rather than a process- or thread-specific context. When an application has exclusive use of a machine partition (or runs on a single core of a multi-core node) it may be possible to interpret such events in the context of the application. Alternately, when running a multi-threaded application it may be possible to deconvolve the system-wide counts and develop a coarse picture of single thread response. Section 3.6 discusses extensions to the PAPI hardware counter library to handle off-processor counters. There is currently no standard mechanism to provide such performance information in a way that is useful for application performance tuning. New methods need to be developed to collect and interpret hardware counter information for shared resources in an appropriate manner. Research in underway in the PAPI project to explore these issues. 3.4 Platform Examples Hardware performance counters are available for most modern CPU implementations. This includes not only the popular x86 line of processors, but also VLIW systems (ia64), RISC systems (Alpha, MIPS, Power and SPARC) as well as embedded systems (ARM, SH, AVR32). Unfortunately counter implementations vary wildly across architectures, and even within chip families. Below is a brief sampling of available implementations. 3.4.1 AMD AMD performance counter implementation has remained relatively stable from the 32-bit Athlon processors up to and including current 64-bit systems. Each core supports four counters, with the available events varying slightly across models. Performance counter measurements become more complicated as multiple cores are included per chip package. Some measurements are package-wide as opposed to being per-core. These values include counts such as per-package cache and memory controller events. The L3 cache events in AMD Opteron quad-core processors are not monitored in four independent set of hardware counter registers but in a single set of registers not associated with a specific core (often referred to as “shadow” registers). When an L3 event is programmed into one of the four per-core counters, it gets copied by hardware Software Interfaces to Hardware Counters 37 to the shadow register. Thus, only the last event programmed into any core is the one used by all cores. When several cores try to share a shadow register, the results are not clearly defined.

Load more