Computationally Efficient Multiplexing of Events on Hardware Counters

Computationally Efficient Multiplexing of Events on Hardware Counters Robert V. Lim David Carrillo-Cisneros University of California, Irvine University of California, Irvine [email protected] [email protected] Wail Y. Alkowaileet Isaac D. Scherson University of California, Irvine University of California, Irvine [email protected] [email protected] Abstract only helps analysts identify hotspots, but can lead to code optimization opportunities and performance tuning This paper proposes a novel approach for scheduling n enhancements. Hardware manufacturers provide hun- performance monitoring events onto m hardware per- dreds of performance events that can be programmed formance counters, where n > m. Whereas existing onto the PMCs for monitoring. For instance, Intel pro- scheduling approaches overlook monitored task infor- vides close to 200 events for the current i7 architecture mation, the proposed algorithm utilizes the monitored [6], while AMD provides close to 100 events [12]. Other task’s behavior and schedules the combination of the architectures that provide event monitoring capabilities most costly events. The proposed algorithm was imple- include NVIDIA’s nvprof and Qualcomm’s Adereno mented in Linux Perf Event subsystem in kernel space profilers [11, 14]. While manufacturers have provided (build 3.11.3), which provides finer granularity and less an exhaustive list of event types to monitor, the issue is system perturbation in event monitoring when compared that microprocessors usually provide two to six perfor- to existing user space approaches. Benchmark exper- mance counters for a given architecture, which restricts iments in PARSEC and SPLASH.2x suites compared the number of events that can be monitored simultane- the existing round-robin scheme with the proposed rate- ously. of-change approach. Results demonstrate that the rate- Calculating performance metrics involves n low-level of-change approach reduces the mean-squared error on hardware events, and modern microprocessors provide average by 22%, confirming that the proposed method- m physical counters (two to six), making scheduling ology not only improves the accuracy of performance multiple performance events impractical when n > m. measurements read, but also makes scheduling multiple A single counter can monitor only one event at a time, event measurements feasible with a limited number of which means that two or more events assigned to the hardware performance counters. same register cannot be counted simultaneously (con- flicting events) [9]. 1 Introduction Monitoring more events than available counters can be achieved with time interpolation techniques, such as Modern performance tools (PAPI, Perf Event, Intel multiplexing and trace alignment. Multiplexing consists vTune) incorporate hardware performance counters in of scheduling events for a fraction of the execution and systems monitoring by sampling low-level hardware extrapolating the full behavior of each metric from its events, where each performance monitoring counter samples. Trace alignment, on the other hand, involves (PMC) is programmed to count the number of occur- collecting separate traces for each event run and com- rences of a particular event, and its counts are peri- bining the independent runs into a single trace. odically read from these registers. The monitored results collectively can provide insights into how the task Current approximation techniques for reconstructing behaves on a particular architecture. Projecting per- event traces yield estimation errors, which provides in- formance metrics such as instructions-per-cycle (IPC), accurate measurements for performance analysts [10]. branch mispredictions, and cache utilization rates not The estimation error increases with multiplexing be- • 101 • 102 • Computationally Efficient Multiplexing of Events on Hardware Counters count Performance Application Programming Interface n 4 (PAPI) is an architecture independent framework that ct4 provides access to generalized high-level hardware k4 events for modern processors, and low-level native n3 ct3 n2 events for a specific processor [2]. PAPI incorporates c k3 t2 MPX and a high resolution interval timer to perform k2 counter multiplexing [9]. The TAU Performance n1 ct1 System, which integrates PAPI, is a probed-based k1 instrumentation framework that profiles applications, time (t0;c0) t1 t2 t3 t4 = 18 libraries, and system codes, where execution of probes become part of the normal control flow of the program Figure 1: An example of a sequence of sampled values. [15]. PAPI’s ease of use, and feature-rich capabilities make the framework a top choice in systems running The sampling times t1 :::t4 and counts ct0 = 0;:::ct4 are known. UNIX/Linux, ranging from traditional microprocessors to high-performance heterogeneous architectures. cause each event timeshares the PMC with the other events, which results in loss of information when the Perfmon2, a generic kernel-level performance monitor- event is not being monitored at a sampled interval. ing interface, provides access to the hardware perfor- Trace alignment may not be feasible in certain situa- mance monitoring unit (PMU) and supports a variety of tions, where taking multiple runs of the same appli- architectures, including Cray X2, Intel, and IBM Pow- cation for performance monitoring might take days or erPC [4]. Working at the kernel level provides fine weeks to complete. In addition, the authors have shown granularity and less system perturbation when accessing that between-runs variability affects the correlation be- hardware performance counters, compared to user space tween the sampled counts for monitored events, due access [17]. Scheduling multiple events in Perfmon2 is to hardware interrupts, cache contention, and system handled via round-robin, where the order of event decla- calls [16]. Current implementations schedule monitor- ration determines its initial position in the queue. Linux ing events in a round-robin fashion, ignoring any infor- Perf Event subsystem is a kernel level monitoring plat- mation about the program task. Opportunities for better form that also provides multi-architectural support (x86, event scheduling exist if information about the behavior PowerPC, ARM, etc.) [13]. Perf has been mainlined in of the task is taken into account, an area we address in the Linux kernel, making Perf monitoring tool available this paper. in all Linux distributions. Our proposed methodology was implemented in Perf Event. This paper is organized as follows. Section2 discusses previous work. Our multiplexing methodology is pre- sented in Section3. Section4 evaluates the experimen- 2.2 Perf Event in Linux tal results. Lastly, Section5 concludes with future work. 2 Previous Work Perf Event samples monitoring events asynchronously, where users set a period (at every ith interval) or a fre- To the best of our knowledge, there has not been any quency (the number of occurrence of events). Users de- prior work similar to our methodology for multiplexing clare an event to monitor by creating a file descriptor, n performance events onto m hardware counters. The which provides access to the performance monitoring next subsections discuss several performance monitor- unit (PMU). The PMU state is loaded onto the counter ing tools and its respective multiplexing strategy. register with a perf_install_in_context call. Sim- ilar to Perfmon2, the current criteria for multiplexing 2.1 Performance Monitoring Tools events is round-robin. Performance monitoring tools provide access to hard- A monitoring Perf Event can be affected under the fol- ware performance counters either through user space or lowing three scenarios: hrtimer, scheduler tick, and kernel space. interrupt context. 2014 Linux Symposium • 103 count 2.3 Linear interpolation D00 ct 3 C To define linear interpolation for asynchronous event sampling, we will first define a sample, and then use B ct2 a pair of samples to construct a linear interpolation. D A sample si = (ti; cti ) is the i-th sample of a PMC count- ing the occurrences of an arbitrary event. The sample si occurs at time ti and has a value cti . We define: D0 ct1 A ki =cti − cti−1 (1) time ni =ti −ti−1 (2) (t0;c0) t1 t2 t3 as the increments between samples si−1 and si for an Figure 2: Triangle representing rate-of-change calcula- event’s count and time, respectively. tion for the recent three observations A, B, and C. The slope of the linear interpolation between the two samples is defined as follows: 2.2.1 hrtimer ki mi = (3) hrtimer [5] is a high resolution timer that gets triggered ni when the PMU is overcommitted [7]. hrtimer invokes rotate_ctx, which performs the actual multiplexing Since all performance counters store non-negative in- of events on the hardware performance counters and is tegers, then 0 ≤ ki;0 ≤ ni;0 ≤ mi, for all i. An event where our rate-of-change algorithm is implemented. sample represents a point in an integer lattice. Figure1 displays sampled values and the variables defined above. 2.2.2 Scheduler tick 3 Multiplexing Methodology Performance monitoring events and its count values are Time interpolation techniques for performance monitor- removed and reprogrammed on the PMU registers dur- ing events have shown large accuracy errors when re- ing each operating system scheduler tick, usually set at constructing event traces [10]. Although increasing the HZ times per second. number of observations correlates with more accurate event traces, taking too many samples may adversely affect the quality of

Load more