Perfmon2: a Flexible Performance Monitoring Interface for Linux

Perfmon2: a flexible performance monitoring interface for Linux Stéphane Eranian HP Labs [email protected] Abstract • the program level: the program is instru- mented by adding explicit calls to routines that collect certain metrics. Instrumenta- Monitoring program execution is becoming tion can be inserted by the programmer or more than ever key to achieving world-class the compiler, e.g., the -pg option of GNU performance. A generic, flexible, and yet pow- cc. Tools such as HP Caliper [5] or Intel erful monitoring interface to access the perfor- PIN [17] can also instrument at runtime. mance counters of modern processors has been With those tools, it is possible to collect, designed. This interface allows performance for instance, the number of times a func- tools to collect simple counts or profiles on a tion is called, the number of time a basic per kernel thread or system-wide basis. It in- block is entered, a call graph, or a memory troduces several innovations such as customiz- access trace. able sampling buffer formats, time or overflow- based multiplexing of event sets. The cur- • the hardware level: the program is not rent implementation for the 2.6 kernel supports modified. The information is collected by all the major processor architectures. Several the CPU hardware and stored in perfor- open-source and commercial tools based on in- mance counters. They can be exploited terface are available. We are currently working by tools such as OProfile and VTUNE on on getting the interface accepted into the main- Linux. The counters measure the micro- line kernel. This paper presents an overview of architectural behavior of the program, i.e., the interface. the number of elapsed cycles, how many data cache stalls, how many TLB misses. When analyzing the performance of a pro- 1 Introduction gram, a user must answer two simple ques- tions: where is time spent and why is spent time there? Program-level monitoring can, in many Performance monitoring is the action of col- situations and with some high overhead, answer lecting information about the execution of a the first, but the second question is best solved program. The type of information collected de- with hardware-level monitoring. For instance, pends on the level at which it is collected. We gprof can tell you that a program spends 20% distinguish two levels: of its time in one function. The difficulty is 270 • Perfmon2: a flexible performance monitoring interface for Linux to know why. Is this because the function is requirements such as cache sizes, cache laten- called a lot? Is this due to algorithmic prob- cies, or bus bandwidth. lems? Is it because the processor stalls? If so, what is causing the stalls? As this simple ex- Hardware performance counters are logically ample shows, the two levels of monitoring can implemented by the Performance Monitoring be complementary. Unit (PMU) of the CPU. By nature, this is a fairly complex piece of hardware distributed all The current CPU hardware trends are increas- across the chip to collect information about key ing the need for powerful hardware monitor- components such as the pipeline, the caches, ing. New hardware features present the op- the CPU buses. The PMU is, by nature, very portunity to gain considerable performance im- specific to each processor implementation, e.g., provements through software changes. To ben- the Pentium M and Pentium 4 PMUs [9] have efit from a multi-threaded CPU, for instance, not much in common. The Itanium proces- a program must become multi-threaded itself. sor architecture specifies the framework within To run well on a NUMA machine, a program which the PMU must be implemented which must be aware of the topology of the machine helps develop portable software. to adjust memory allocations and thread affinity One of the difficulties to standardize on a per- to minimize the number of remote memory ac- formance monitoring interface is to ensure that cesses. On the Itanium [3] processor architec- it supports all existing and future PMU mod- ture, the quality of the code produced by com- els without preventing access to some of their pilers is a big factor in the overall performance model specific features. Indeed, some models, of a program, i.e, the compiler must extract the such as the Itanium 2 PMU [8], go beyond just parallelism of the program to take advantage of counting events, they can also capture branch the hardware. traces, where cache misses occur, or filter on opcodes. Hardware-based performance monitoring can help pinpoint problems in how software uses In Linux and across all architectures, the wealth those new hardware features. An operating sys- of information provided by the PMU is often- tem scheduler can benefit from cache profiles times under-exploited because a lack of a flex- to optimize placement of threads to avoiding ible and standardized interface on which tools cache thrashing in multi-threaded CPUs. Static can be developed. compilers can use performance profiles to im- In this paper, we give an overview of perfmon2, prove code quality, a technique called Profile- an interface designed to solve this problem for Guided Optimization (PGO). Dynamic compil- all major architectures. We begin by reviewing ers, in Managed Runtime Environments (MRE) what Linux offers today. Then, we describe the can also apply the same technique. Profile- various key features of this new interface. We Guided Optimizations can also be applied di- conclude with the current status and a short de- rectly to a binary by tools such as iSpike [11]. scription of the existing tools. In virtualized environments, such as Xen [14], system managers can also use monitoring information to guide load balancing. Developers can also use this information to optimize the lay- 2 Existing interfaces out of data structures, improve data prefetch- ing, analyze code paths [13]. Performance pro- The problem with performance monitoring in files can also be used to drive future hardware Linux is not the lack of interface, but rather the 2006 Linux Symposium, Volume One • 271 multitude of interfaces. There are at least three with which distribution. We believe this situa- interfaces: tion does not make it attractive for developers to build modern tools on Linux. In fact, Linux is lagging in this area compared to commercial • OProfile [16]: it is designed for DCPI- operating systems. style [15] system-wide profiling. It is supported on all major architectures and is en- abled by major Linux distributions. It can generate a flat profile and a call graph per 3 Design choices program. It comes with its own tool set, such as opcontrol. Prospect [18] is an- First of all, it is important to understand why a other tool using this interface. kernel interface is needed. A PMU is accessi- • perfctr [12]: it supports per-kernel-thread ble through a set of registers. Typically those and system-wide monitoring for most ma- registers are only accessible, at least for writ- jor processor architectures, except for Ita- ing, at the highest privilege level of execution nium. It is distributed as a stand-alone ker- (pl0 or ring0) which is where only the kernel nel patch. The interface is mostly used by executes. Furthermore, a PMU can trigger in- tools built on top of the PAPI [19] perfor- terrupts which need kernel support before they mance toolkit. can be converted into a notification to a user- level application such as a signal, for instance. • VTUNE [10]: the Intel VTUNE perfor- For those reasons, the kernel needs to provide mance analyzer comes with its own kernel an interface to access the PMU. interface, implemented by an open-source driver. The interface supports system- The goal of our work is to solve the hardware- wide monitoring only and is very specific based monitoring interface problem by design- to the needs of the tool. ing a single, generic, and flexible interface that supports all major processor architectures. The new interface is built from scratch and intro- All these interfaces have been designed with a duces several innovations. At the same time, specific measurement or tool in mind. As such, we recognize the value of certain features of their design is somewhat limited in scope, i.e., the other interfaces and we try to integrate them they typically do one thing very well. For in- wherever possible. stance, it is not possible to use OProfile to count the number of retired instructions in a thread. The interface is designed to be built into the The perfctr interface is the closest match to kernel. This is the key for developers, as it what we would like to build, yet it has some ensures that the interface will be available and shortcomings. It is very well designed and supported in all distributions. tuned for self-monitoring programs but sampling support is limited, especially for non self- To the extent possible, the interface must al- monitoring configurations. low existing monitoring tools to be ported without many difficulties. This is useful to ensure With the current situation, it is not necessarily undisrupted availability of popular tools such easy for developers to figure out how to write as VTUNE or OProfile, for instance. or port their tools. There is a question of func- tionalities of each interfaces and then, a ques- The interface is designed from the bottom up, tion of distributions, i.e., which interface ships first looking at what the various processors pro- 272 • Perfmon2: a flexible performance monitoring interface for Linux vide and building up an operating system in- 4 Core Interface terface to access the performance counters in a uniform fashion. Thus, the interface is not de- The interface leverages a common property of signed for a specific measurement or tool.

Perfmon2: a Flexible Performance Monitoring Interface for Linux

Developer Guide

Analysing CMS Software Performance Using Igprof, Oprofile and Callgrind

Understanding the Linux Kernel, 3Rd Edition by Daniel P

Best Practice Guide - Generic X86 Vegard Eide, NTNU Nikos Anastopoulos, GRNET Henrik Nagel, NTNU 02-05-2013

Pipenightdreams Osgcal-Doc Mumudvb Mpg123-Alsa Tbb

Improving the Approach to Linux Performance Analysis an Analyst Point of View

Performance Tuning

Perf Tool: Performance Analysis Tool for Linux

Technical Notes All Changes in Fedora 13

Gerbes-Automatic Profiling for Climate Modeling.Pdf

Hardware Performance Counters (HWPMC)

ARM Cortex-A Series Programmer's Guide