Processor Hardware Counter Statistics As a First-Class System Resource

∗ Processor Hardware Counter Statistics As A First-Class System Resource XiaoZhang SandhyaDwarkadas GirtsFolkmanis KaiShen Department of Computer Science, University of Rochester Abstract which hides the differences and details of each hardware platform from the user. Today's processors provide a rich source of statis- In this paper, we argue that processor hardware coun- tical information on program execution characteristics ters are a first-class resource, warranting general OS uti- through hardware counters. However, traditionally, op- lization and requiring direct OS management. Our dis- erating system (OS) support for and utilization of the cussion is within the context of the increasing ubiquity hardware counter statistics has been limited and ad hoc. and variety of hardware resource-sharing multiproces- In this paper, we make the case for direct OS manage- sors. Examples are memory bus-sharing symmetric mul- ment of hardware counter statistics. First, we show the tiprocessors (SMPs), L2 cache-sharing chip multiproces- utility of processor counter statistics in CPU scheduling sors (CMPs), and simultaneous multithreading (SMTs), (for improved performance and fairness) and in online where many hardware resources including even the pro- workload modeling, both of which require online contin- cessor counter registers are shared. uous statistics (as opposed to ad hoc infrequent uses). Processor metrics can identify hardware resource con- Second, we show that simultaneous system and user use tention on resource-sharing multiprocessors in addition of hardware counters is possible via time-division multi- to providing useful information on application execution plexing. Finally, we highlight potential counter misuses behavior. We reinforce existing results to demonstrate to indicate that the OS should address potential security multiple uses of counter statistics in an online continu- issues in utilizing processor counter statistics. ous fashion. We show (via modification of the Linux scheduler) that on-line processor hardware metrics-based simple heuristics may improve both the performance and 1 Introduction the fairness of CPU scheduling. We also demonstrate the effectiveness of using hardware metrics for application- Hardware counters are commonplace on modern pro- level online workload modeling. cessors, providing detailed information such as instruc- A processor usually has a limited number of counter tion mix, rate of execution (instructions per cycle), registers to which a much larger number of hardware branch (control flow) prediction accuracy, and memory metrics can map. Different uses such as system-level access behaviors (including miss rates at each level of functions (e.g., CPU scheduling) and user-level tasks the memory hierarchy as well as bus activity). These (e.g., workload profiling) may desire conflicting sets of counters were originally provided for hardware verifi- processor counter statistics at the same time. We demon- cation and debugging purposes. Recently, they have strate that such simultaneous use is possible via time- also been used to support a variety of tasks concern- division multiplexing. ing software systems and applications, including adap- Finally, the utilization of processor counter statistics tive CPU scheduling [2,6,11,15,18], performance mon- may bring security risks. For instance, a non-privileged itoring/debugging [1, 5, 19], workload pattern identifica- user application may learn execution characteristics of tion [4,7], and adaptive application self-management [8]. other applications when processor counters report com- Except for guiding CPU scheduling, so far the oper- bined hardware metrics of two resource-sharing sibling ating system’s involvement in managing the processor processors. We argue that the OS should be aware of counter statistics has been limited. Typically the OS such risks and minimize them when needed. does little more than expose the statistics to user applications. Additional efforts mostly concern the presentation of counter statistics. For instance, the PAPI project [5] proposed a portable cross-platform interface that appli- 2 Counter Statistics Usage Case Studies cations could use to access hardware events of interests, ∗ We present two usage case studies of processor hard- This work was supported in part by NSF grants CNS-0411127, CCF-0448413, CNS-0509270, CNS-0615045, CNS-0615139, and ware counter statistics: operating system CPU schedul- CCF-0621472. ing and online workload modeling. In both cases, the processor counter statistics are utilized in a continuous Ideal (no slowdown due to resource contention) Default Linux scheduler online fashion (as opposed to ad hoc infrequent uses). IPC scheduler Bus−utilization scheduler 2.1 Efficient and Fair CPU Scheduling 1 It is well known that different pairings of tasks on 0.9 resource-sharing multiprocessors may result in differ- 0.8 0.7 ent levels of resource contention and thus differences 0.6 in performance. Resource contention also affects fairness since a task may make less progress under higher resource contention (given the same amount of CPU time). A fair scheduler should therefore go beyond al- Normalized performance locating equal CPU time to tasks. A number of pre- gzip wupwise swim mesa art parser vious studies [2, 6, 10, 11, 15, 18] have explored adap- Applications tive CPU scheduling to improve system performance and some have utilized processor hardware statistics. The Figure 1: Normalized performance of individual SPEC- case for utilizing processor counter statistics in general CPU2000 applications under different scheduling schemes. CPU scheduling can be strengthened if a counter-based simple heuristic improves both scheduling performance system consisting of 2 Intel Xeon 3.0GHz CPUs with and fairness. Hyper-Threading disabled. In this case study, we consider two simple scheduling For experiments on SPEC-CPU2000 applications, policies using hardware counter statistics. The first (pro- we run gzip, parser, and swim (low, medium, and posed by Fedorova et al. [11]) uses instruction-per-cycle high bus utilization, respectively) on one CPU, and (IPC) as an indicator of whether a task is CPU-intensive mesa, wupwise, and art (again, low, medium, and (high IPC) or memory-access-intensive (low IPC). The high bus utilization, respectively) on the other CPU. IPC scheduler tries to pair high-IPC tasks with low-IPC In this scenario, ideally, complementary tasks (high- tasks to reduce resource contention. The second is a new low, medium-medium) should be executed simulta- policy that directly measures the usage on bottleneck re- neously in order to smooth out resource demand. sources and then matches each high resource-usage task We define the normalized application performance as with a low resource-usage task on resource-sharing sib- “ execution time under ideal condition ”. The ideal ex- ling processors. In the simple case of SMPs, a single execution time under current condition ecution time is that achieved when the application runs resource — the memory bus — is the bottleneck. alone (with no processor hardware resource contention). Our implementation, based on the Linux 2.6.10 ker- Figure 1 shows the normalized performance of SPEC- nel, requires only a small change to the existing CPU CPU2000 applications under different schedulers. scheduler. We monitor the bus utilization (or IPC) of We define two metrics to quantify the overall system each task using hardware counter statistics. During performance and fairness. The system normalized per- each context switch, we try to choose one ready task formance metric is definedas the geometric mean of each whose monitored bus utilization (IPC) is complemen- application’s normalized performance. The unfairness tary to the task or tasks currently running on the other factor metric is defined as the coefficient of variation CPU or CPUs (we use last-value prediction as a simple (standard deviation divided by the mean) of all applica- yet reasonable predictor, although other more sophisti- tion performance. Ideally, the system normalized perfor- cated prediction schemes [9] could easily be incorpo- mance should be 1 (i.e., no slowdown due to resource rated). Note that our implementation does not change the contention) and the unfairness factor 0 (i.e., all applica- underlying Linux scheduler’s assignation of equal CPU tions are affected by exactly the same amount). Com- time to CPU-bound tasks within each scheduling epoch. pared to the default Linux scheduler, the bus-utilization By smoothing out overall resource utilization over time, scheduler improves system performance by 7.9% (from however, the scheduler may improve both fairness and 0.818 to 0.883) and reduces unfairness by 58% (from performance, since with lower variation in resource con- 0.178 to 0.074). Compared to the IPC scheduler, it im- tention, tasks tend to make more deterministic progress. proves system performance by 6.5% and reduces unfair- Experimental results We present results on CPU ness by 55%. The IPC scheduler is inferior to the bus- scheduling in terms of both performance and fairness utilization scheduler because IPC does not always accu- using two sets of workloads — sequential applications rately reflect the utilization of the shared bus resource. (SPEC-CPU2000) and server applications (a web server We also experimented with counter statistics-assisted workload and TPC-H). The test is performed on an SMP CPU scheduling using two server applications. The first Ideal (uni−processor system, no resource

Processor Hardware Counter Statistics As a First-Class System Resource

Binary Counter

Cross Architectural Power Modelling

Experiment No

1 DIGITAL COUNTER and APPLICATIONS a Digital Counter Is

Isolation, Resource Management, and Sharing in Java

Libresource - Getting System Resource Information with Standard Apis Tuesday, 13 November 2018 14:55 (15 Minutes)

Introduction to Performance Management

Linux® Resource Administration Guide

SUSE Linux Enterprise Server 11 SP4 System Analysis and Tuning Guide System Analysis and Tuning Guide SUSE Linux Enterprise Server 11 SP4

Design and Implementation of High Speed Counters Using “MUX Based Full Adder (MFA)”

IBM Power Systems Virtualization Operation Management for SAP Applications

REST API Specifications