Per-Application Power Delivery
Total Page:16
File Type:pdf, Size:1020Kb
Per-Application Power Delivery Akhil Guliani Michael M. Swift University of Wisconsin-Madison, USA University of Wisconsin-Madison, USA [email protected] [email protected] Abstract Datacenter servers are often under-provisioned for peak power consumption due to the substantial cost of providing power. When there is insufficient power for the workload, servers can lower voltage and frequency levels to reduce consumption, but at the cost of performance. Current proces- sors provide power limiting mechanisms, but they generally apply uniformly to all CPUs on a chip. For servers running heterogeneous jobs, though, it is necessary to differentiate the power provided to different jobs. This prevents inter- ference when a job may be throttled by another job hitting a power limit. While some recent CPUs support per-CPU Figure 1. Performance interference between applications power management, there are no clear policies on how to with RAPL, normalized to standalone execution at 85W. Ap- distribute power between applications. Current hardware plication gcc is low demand and cam4 is high demand. power limiters, such as Intel’s RAPL throttle the fastest core first, which harms high-priority applications. In this work, we propose and evaluate priority-based and 1 Introduction share-based policies to deliver differential power to applica- Power has become a first-class citizen in modern data centers: tions executing on a single socket in a server. For share-based it is a primary cost in provisioning data centers [16], which policies, we design and evaluate policies using shares of emphasize improving overall efficiency. It is a major concern power, shares of frequency, and shares of performance. These for server designers [21], who have focused on efficiency variations have different hardware and software require- and power proportionality [5]. It is also a major concern for ments, and different results. Our results show that power data-center operators, who must ensure their systems stay shares have the worst performance isolation, and that fre- within provisioned power limits [18, 28, 56]. quency shares are both simpler and generally perform better Across systems, it is currently possible to provision more than performance shares. power to some hosts than others, such as using more power on hosts executing more applications. The distribution of CCS Concepts • Hardware → Platform power issues; power is accomplished explicitly through power-aware job Enterprise level and data centers power issues; placement [56] and implicitly by normal job scheduling. Keywords Power Management, DVFS, Proportional Shares Within a single host, modern processors provide a rich ACM Reference Format: toolset for managing available power. This toolset is only Akhil Guliani and Michael M. Swift. 2019. Per-Application Power getting richer as power management rises in importance. For Delivery. In Fourteenth EuroSys Conference 2019 (EuroSys ’19), March example, some Intel and AMD processors provide per-core 25–28, 2019, Dresden, Germany. ACM, New York, NY, USA, 16 pages. dynamic voltage and frequency scaling (DVFS) in hardware, https://doi.org/10.1145/3302424.3303981 automatic enforcement and reporting of power limits (run- ning average power limit or RAPL [22]), and hardware per- Permission to make digital or hard copies of all or part of this work for formance hints (hardware-managed P-states or HWP [1]). personal or classroom use is granted without fee provided that copies Currently, much of the software leveraging these hard- are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights ware capabilities target mobile systems where managing for components of this work owned by others than the author(s) must battery lifetime (energy) and thermals (temperature) is of be honored. Abstracting with credit is permitted. To copy otherwise, or prime importance. For example, Linux’s cpufreq and power republish, to post on servers or to redistribute to lists, requires prior specific governors are largely used by mobile systems. However, in permission and/or a fee. Request permissions from [email protected]. cloud and internet-service data centers there is a large class EuroSys ’19, March 25–28, 2019, Dresden, Germany of heterogeneous workload machines that operate under a © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. power budget due to the high cost of power provisioning [16]. ACM ISBN 978-1-4503-6281-8/19/03...$15.00 Tools that use these mechanisms to control application power https://doi.org/10.1145/3302424.3303981 consumption, can benefit such systems. As a motivating example, Figure 1 shows the impact of 2 Background Intel processors’ current power limiting mechanism, RAPL. In this section we describe the various power management This figure shows the relative performance of two different and capping mechanisms implemented by recent micropro- applications running concurrently on different cores of a cessors. We focus on power (energy per unit time) which is Skylake processor when subjected to a power limit with of greater importance to datacenters as opposed to energy RAPL. Further, gcc is low demand (i.e., uses less power at a efficiency, which is often the focus for mobile systems. given frequency), while cam4 is high demand (i.e., uses more). Also, cam4 uses Intel AVX instructions, which limits it to a 2.1 Power Management lower maximum frequency of 1667 MHz [29], as compared to 2360 MHz for gcc. The following mechanisms are provided by microprocessors When these two applications run together under progres- to control power consumption. sively lower power limits, the processor start throttling the Dynamic voltage and frequency scaling (DVFS): DVFS frequency of gcc, even though it consumes less power than varies the frequency of a processor. With reduced frequency, cam4. At 50W, gcc is throttled to 1975 MHz (12% reduction) voltage can be dropped leading to a cubic drop in power and while cam4 is throttled only to 1570 MHz (5%). At 40W both higher energy efficiency. This occurs because dynamic power are throttled to the same frequency of 1240 MHz, but this (Pdyn) is related to voltage (V ) and frequency (f ) according to represents a 48% reduction in frequency for gcc but only 2 the equation Pdyn / V f . Current implementations of DVFS a 25% reduction for cam4. These results show that current are fast (taking 1-30µsec [35]). DVFS has been adopted by power-limiting mechanisms do not address the difference in almost all current processors [2, 10, 20] and is the preferred power usage and frequency limits across cores. Furthermore, mechanism for power management. We see two implemen- any administrator-provided priorities, such as a preference tations in practice. to run gcc faster while the power-hungry cam4 sacrifices Global frequency and voltage scaling: The most common im- performance at lower power levels, are ignored. This makes plementation of DVFS scales voltage and frequency for all co-locating latency-sensitive and batch tasks difficult, as cores simultaneously. It requires a single voltage controller. batch workloads may exceed power caps and affect the other Hence, all cores run at the same frequency, so it cannot be workloads [33]. used for differential power delivery. To address this problem, we focus on the problem of differ- Individual Frequency and Voltage Scaling: Introduced with ential power delivery: how can and should a system allocate the Haswell-EP [15] and AMD Ryzen [4] processors, indi- different amounts of power to different applications toef- vidual DVFS allows each core to have a different individual fect prioritization and isolation? We study the mechanisms voltage and frequency levels. However, this approach in- available in modern processors and develop two classes of creases hardware complexity by requiring on-chip per-core differential power-delivery policies. First, we investigate a voltage regulators (e.g., Intel’s FIVR [11]). In between global simple two-level priority model with foreground and back- and individual frequencies, a processor may also support ground tasks, with the goal that foreground applications 1 < N < #cores levels. For example, the Ryzen 1700x allows get peak performance subject to a power limit. Background only 3 unique voltage/frequency combinations across 8 cores, applications receive the rest. although the frequencies and voltages are configurable [4, Second, we look at a proportional share model, where 31, 38]. We discuss this limitation and its workaround in power is distributed according to shares [55]. Within this the next section. This workaround leads to an optimization class of policies, we investigate allocating shares of three dif- problem of determining which three frequencies are optimal ferent resources: power, performance, and frequency. These for a set of workloads. For finer-grained control processors three share types have different hardware and software de- may allow setting voltage and frequency separately . mands, and result in different, yet useful behavior. For ex- Performance states (P-States) The standard way to com- ample, distributing power directly requires per-core power municate DVFS states from supervisory software is perfor- measurements, which are not available on Intel platforms. mance states (P-States), which represent different power/per- Distributing performance requires a reasonable estimate of formance levels. This interface is presented to the