Happy: Hyperthread-Aware Power Profiling Dynamically
Total Page:16
File Type:pdf, Size:1020Kb
HaPPy: Hyperthread-aware Power Profiling Dynamically Yan Zhai, University of Wisconsin; Xiao Zhang and Stephane Eranian, Google Inc.; Lingjia Tang and Jason Mars, University of Michigan https://www.usenix.org/conference/atc14/technical-sessions/presentation/zhai This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. HaPPy: Hyperthread-aware Power Profiling Dynamically Yan Zhai Xiao Zhang, Stephane Eranian Lingjia Tang, Jason Mars University of Wisconsin Google Inc. University of Michigan [email protected] xiaozhang,eranian @google.com lingjia,profmars @eesc.umich.edu { } { } Abstract specified power threshold by suspending a subset of jobs. Quantifying the power consumption of individual appli- Scheduling can also be used to limit processor utilization cations co-running on a single server is a critical compo- to reach energy consumption goals. Beyond power bud- nent for software-based power capping, scheduling, and geting, pricing the power consumed by jobs in datacen- provisioning techniques in modern datacenters. How- ters is also important in multi-tenant environments. ever, with the proliferation of hyperthreading in the last One capability that proves critical in enabling software few generations of server-grade processor designs, the to monitor and manage power resources in large-scale challenge of accurately and dynamically performing this datacenter infrastructures is the attribution of power con- power attribution to individual threads has been signifi- sumption to the individual applications co-running on cantly exacerbated. Due to the sharing of core-level re- a single server. This ability allows software to control sources such as functional units, prior techniques are not power consumption at the level of individual applications suitable to attribute the power consumption between hy- on a single machine, as well as across entire clusters. perthreads sharing a physical core. However, accurate attribution on real-world commodity In this paper, we present a runtime mechanism that hardware has proven challenging for modern server de- quantifies and attributes power consumption to individ- signs, particularly due to the fact that simultaneous multi- ual jobs at fine granularity. Specifically, we introduce threading, (or hyperthreading [14]) is now commonplace a hyperthread-aware power model that differentiates be- in current server designs. tween the states when both hardware threads of a core are Processors that are hyperthreaded allow two or more in use, and when only one thread is in use. By capturing hardware thread contexts to share a single physical core. these two different states, we are able to accurately at- Although the OS views each hardware thread context as a tribute power to each logical CPU in modern servers. We logical CPU, a number of core-level resources are shared conducted experiments with several Google production across contexts such as functional units, alias register, workloads on an Intel Sandy Bridge server. Compared and cache resources, among others. Modern processors to prior hyperthread-oblivious model, HaPPy is substan- do not provide specific power monitors for each hard- tially more accurate, reducing the prediction error from ware thread context and thus attributing the power con- 20.5% to 7.5% on average and from 31.5% to 9.4% in sumption of individual processes and threads across log- the worst case. ical CPUs has proven particularly challenging. In this work, we present HaPPy,aHyperthread-aware 1 Introduction Power Profiling Dynamically. HaPPy is able to dy- As more of the world’s computation moves into large- namically and near instantaneously attribute the power scale datacenter infrastructures, power management and consumed (in watts) to individual processes or threads. provisioning becomes increasingly important. In fact, To the best of our knowledge, this is the first such prior work [4] shows that the cost of powering the servers hyperthread-aware power estimation approach. Central housed in these infrastructures comprises about 30% of to HaPPy is an estimation model that uses Intel Running the total cost of ownership (TCO) of modern datacen- Average Power Limit (RAPL) power/performance mon- ter infrastructures. As we are reaching the limits of cur- itoring interface [14] that is widely available on current rent power delivery systems, many datacenter infrastruc- commodity servers (Sandy Bridge/Ivy Bridge/etc). Al- tures house more machines than can be powered by the though RAPL provides no power monitoring information supply infrastructure [17]. In tandem with these trends, of individual cores nor hardware thread contexts, HaPPy datacenter designers and operators have been investigat- uses a novel execution isolation technique implemented ing techniques to manage the available power resources on top of existing performance counter tool to predict the via software techniques such as power-capping [13, 11], power consumed by individual threads. scheduling [12], and energy accounting/pricing [28], We evaluate HaPPy on six data-intensive Google pro- among others. Software power capping and provision- duction workloads using real commodity server configu- ing techniques ensure that servers do not use more than a rations found in datacenters. Compared to prior work, 1 USENIX Association 2014 USENIX Annual Technical Conference 211 HaPPy is substantially more accurate and reduces the our hyperthread aware CPU power modeling is applica- prediction error from 20.5% to 7.5% on average and from ble to tasks concurrently running on a hyperthreaded ma- 31.5% to 9.4% in worse cases. chines. 2 Background 2.3 The RAPL Interface The primary goal of this work is provide a technique to Recently Intel released the RAPL model specific regis- enable the attribution of power consumption to individ- ters (MSRs). These performance counters enable soft- ual threads. In this section, we first describe the need ware to read processor energy consumption at run time for power estimation and the most related works. Then, on newer Intel processors such as Sandy Bridge and Ivy we describe the underlying hardware monitoring inter- Bridge. RAPL MSRs separate processor energy con- face that underpins our HaPPy approach. sumption into three parts: pp0, package, and dram. pp0 counts total energy consumed by all cores of a pro- 2.1 Need for Power Estimation cessor (note that RAPL does not provide per-core mea- Power estimation for individual jobs is critical for power surements on Sandy Bridge or Ivy Bridge); package management systems in datacenters [13, 11, 17, 12, 28]. includes both cores and uncore (e.g. last-level-cache) To lower the total cost of ownership, particularly the cost energy consumption; dram here means on-chip dram of power infrastructures, modern datacenters are often channels, not the commonly referred off-chip memory designed to house more servers than can be powered DIMM. Total processor energy consumption can be cal- by the underlying power supply infrastructure. At peak culated by aggregating package and dram readings. The time, the power demand of the datacenter may surpass reported energy during a given time window can then be the supply of the power infrastructure, in which case converted to the average power. power capping techniques are applied to lower the de- The Linux kernel provides a powerful open-source mand to under the provisioning threshold. There are var- tool, called perf [1], to configure and monitor hardware ious types of power capping techniques, including sus- performance counters. We have extended this interface pending or limiting the processor utilization of certain to enable access to Intel’s RAPL counters. The extension jobs. These approaches require accurate power estima- is implemented as a separate socket-level performance tion for individual jobs. Power estimation allows us monitoring unit (PMU). To monitor energy consumption to accurately identify the minimum amount of power- of a multi-socket system, it is only necessary to moni- hungry jobs the system needs to suspend given the tar- tor the RAPL events from one CPU on each socket. Our get power demand threshold. A more conservative power perf patch has been open-sourced [2] and will appear in capping system without the power estimation might need upstream kernels (Linux 3.14 and newer). to suspend all low-priority jobs, which is much less cost- effective. In addition to power capping, power estimation 3 Power Modeling is also critical for facilitating accurate pricing and ac- In this section, we first present a hyperthread-oblivious counting in multi-tenant cloud infrastructures. Accurate model commonly used in prior work. We then dis- power usage estimation for individual applications on a cuss why it is insufficient and inaccurate on modern shared server allows us to design power-based billing and servers with hyperthreading. Finally, we present our pricing for cloud infrastructure users. hyperthread-aware model that can accurately attribute 2.2 State of Power Estimation in Datacenters power consumption across co-running tasks. Power constraints are well recognized as one of the pri- 3.1 Hyperthread-oblivious Model mary limiting factors for datacenter design, and there is We first present a hyperthread-oblivious (HT-oblivious) a significant body of work [19, 13, 11, 26] targeting ad- model, which is used in prior work for event-driven vanced power estimation and management in datacen- power accounting [6, 26]. The model is based on the ters. Two works emerge as most related. The work by hypothesis that the power consumption of a task is pro- Fan et al. presents power provisioning designs for data- portional to the amount of computation CPUs perform centers [11]. The model presented in this paper focuses for that task, and one can estimate the amount of CPU on coarse-granularity prediction, which is suitable for its computation using hardware events, such as CPU cycles goal. However, it is hyperthread-oblivious and incurs and instructions.