Unlocking Energy

Unlocking Energy Babak Falsafi, Rachid Guerraoui, Javier Picorel, and Vasileios Trigonakis, École Polytechnique Fédérale de Lausanne (EPFL) https://www.usenix.org/conference/atc16/technical-sessions/presentation/falsafi This paper is included in the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16). June 22–24, 2016 • Denver, CO, USA 978-1-931971-30-0 Open access to the Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC ’16) is sponsored by USENIX. Unlocking Energy Babak Falsafi achid uerraoui Javier Picorel Vasileios Trigonakis ∗ first.last @epfl.ch { EcoCloud,} EPFL Power Consumption Energy Efficiency Abstract (lower is better) (higher is better) 2 Locks are a natural place for improving the energy ef- mutex 1.5 spinlock ficiency of software systems. First, concurrent systems 1 are mainstream and when their threads synchronize, they 0.5 typically do it with locks. Second, locks are well-defined 0 abstractions, hence changing the algorithm implement- Relative to mutex (a) 10 20 (b) 10 20 ing them can be achieved without modifying the system. # Threads # Threads Third, some locking strategies consume more power than Figure 1: Power consumption and energy efficiency of others, thus the strategy choice can have a real effect. CopyOnWriteArrayList with mutex and spinlock. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with schedulers [48, 55, 59], and energy-oriented compil- improving their throughput. It is a win-win situation. ers [63, 64]. Basically, those techniues reuire changes We make our case for this throughput/energy- in hardware, installing new schedulers or runtime sys- efficiency correlation through a series of observations tems, or even rebuilding the entire system. obtained from an exhaustive analysis of the energy ef- We argue that there is an effective, complementary ficiency of locks on two modern processors and sixsoft- approach to saving energy: Optimizing synchroniation, ware systems: Memcached, MySL, SLite, RocksDB, specifically its most popular form, namely locking. The HamsterDB, and Kyoto Kabinet. We propose simple rationale is the following. First, concurrent systems are lock-based techniues for improving the energy effi- now mainstream and need to synchronize their activi- ciency of these systems by 33 on average, driven by ties. In most cases, synchronization is achieved through higher throughput, and without modifying the systems. locking. Hence, designing locking schemes that reduce 1 Introduction energy consumption can affect many software systems. Second, the lock abstraction is well defined and one can For several decades, the main metric to measure the effi- usually replace the algorithm implementing it without ciency of computing systems has been throughput. This any modification to the rest of the system. Third, the state of affairs started changing in the past few years as choice of the locking scheme can have a significant effect energy has become a very important factor [17]. Reduc- on energy consumption. Indeed, the main conseuence ing the power consumption of systems is considered cru- of synchronization is having some threads wait for one cial today [26, 30]. Various studies estimate that datacen- another–an opportunity for saving energy. ters have contributed over 2 of the total US electricity To illustrate this opportunity, consider the av- usage in 2010 [36], and proect that the energy footprint erage power consumption of two versions of a of datacenters will double by 2020 [1]. java.util.concurrent.CopyOnWriteArrayList [6] Hardware techniues for reducing energy consump- stress test over a long-running execution–Figure 1(a). tion include clock gating [38], power gating [52], as The two versions differ in how the lock handles con- well as voltage and freuency scaling [28, 56]. Soft- tention: Mutexes use sleeping, while spinlocks employ ware techniues include approximation [16, 27, 57], con- busy waiting. With sleeping, the waiting thread is put solidation [18, 21], energy-efficient data structures [23, to sleep by the OS until the lock is released. With busy 32], fast active-to-idle switching [44, 45], power-aware waiting, the thread remains active, polling the lock until ∗Author names appear in alphabetical order. the lock is finally released. Choosing sleeping as the USENIX Association 2016 USENIX Annual Technical Conference 393 waiting strategy brings up to 33 benefits on power. power consumption of busy waiting can simply not be Hence, as we pointed out, the choice of locking strategy reduced. The only way is to look into sleeping. can have a significant effect on power consumption. Sleeping can indeed save power. Our Xeon server has Accordingly, privileging sleeping with mutex locks approximately 55 Watts idle power and a max total power seems like the systematic way to go. This choice, how- consumption of 206 Watts. Once a hardware context is ever, is not as simple as it looks. What really matters is active, it draws power, regardless of the type of work it not only the power consumption, but the amount of en- executes. We can save this power if threads are put to ergy consumed for performing some work, namely en- sleep while waiting behind a busy lock. The OS can then ergy efficiency. In the Figure 1 example, although the put the core(s) in one of the low-power idle states [5]. spinlock version consumes 50 more power than mutex, Furthermore, when there are more software threads than it delivers 25 higher energy efficiency (Figure 1(b)) for hardware contexts in a system, sleeping is the only way it achieves twice the throughput. Hence, indeed, locking to go in locks, because busy waiting kills throughput. is a natural place to look for saving energy. Yet, choosing the best lock algorithm is not straightforward. owever, going to sleep hurts energy efficiency. The To finalize the argument that optimizing locks isa futex system call implements sleeping in Linux and is good approach to improve the energy efficiency of sys- used by pthread mutex locks. In most realistic scenar- tems, we need locks that not only reduce power, but also ios, the futex-call overheads offset the energy benefits do not hurt throughput. Is that even possible of sleeping over busy waiting, if any, resulting in worse We show that the answer to this uestion is positive. energy efficiency. Additionally, the spin-then-sleep pol- We argue for the POLY1 conecture: Energy efficiency icy of mutex is not tuned to account for these overheads. and throughput go hand in hand in the context of lock The mutex spins for up to a few hundred cycles before algorithms. POLY suggests that we can optimize locks to employing futex, while waking up a sleeping thread improve energy efficiency without degrading throughput takes at least 7000 cycles. As a result, it is common that the two go hand in hand. Conseuently, we can apply a thread makes the costly futex call to sleep, only to prior throughput-oriented research on lock algorithms al- be immediately woken up, thus wasting both time and most as is in the design of energy-efficient locks as well. energy. We design MUTEXEE, an optimized version of We argue for our POLY conecture through a thorough mutex that takes the futex overheads into account. analysis of the energy efficiency of locks on two mod- hus some threads have to go to sleep for long. An ern Intel processors and six software systems (i.e., Mem- unfair lock can put threads to sleep for long periods of cached, MySL, SLite, RocksDB, HamsterDB, and time in the presence of high contention. Doing so results Kyoto Kabinet). We conduct our analysis in three layers. in lower power consumption, as fewer threads (hardware We start by analyzing the hardware and software artifacts contexts) are active during the execution. In addition, available for synchronization (e.g., pausing instructions, lower fairness brings (i) better throughput, as the con- the Linux futex system calls). Then, we evaluate opti- tention on the lock is decreased, and (ii) higher tail la- mized variants of lock algorithms in terms of throughput tencies, as the latency distribution of acuiring the lock and energy efficiency. Finally, we apply our results to might include some large values. the six software systems. We derive from our analysis the following observations that underlie POLY: Overall, on current hardware, every power trade-off is also a throughput and a latency trade-off (motivating the Busy waiting inherently hurts power consumption. name POLY1): (i) sleeping vs. busy waiting, (ii) busy With busy waiting, the underlying hardware context re- waiting with vs. without DVFS or monitor/mwait, and mains active. On Intel machines, for example, it is not (iii) low vs. high fairness. practically feasible to reduce the power consumption of busy waiting. First, there is no power-friendly pause in- Interestingly, in our uest to substantiate POLY, we struction to be used in busy-wait loops. The conventional optimize state-of-the-art locking techniues to increase way of reducing the cost of these loops, namely the x86 the energy efficiency of our considered systems. Weim- pause instruction, actually increases power consump- prove the systems by 33 on average, driven by a 31 tion. Second, the monitor/mwait instructions reuire increase in throughput. These improvements are either kernel-level privileges, thus using them in user space due to completely avoiding sleeping using spinlocks, or incurs high overheads. Third, traditional DVFS tech- due to reducing the freuency of sleep/wake-up invoca- niues for decreasing the voltage and freuency of the tions using our new MUTEXEE scheme. cores (hence lowering their power consumption) are too We conduct our analysis on two modern Intel plat- coarse-grained and too slow to use. Conseuently, the forms as they provide tools (i.e., RAPL interface [4]) for accurately measuring the energy consumption of the pro- 1POLY stands for “Pareto optimality in locks for energy efficiency.

Unlocking Energy

Approaches in Green Computing

Power-Aware Design Methodologies for FPGA-Based Implementation of Video Processing Systems

A Dynamic Voltage Scaling Algorithm for Sporadic Tasks∗

Comptia Fc0-Gr1 Exam Questions & Answers

Vertigo: Automatic Performance-Setting for Linux

Scalable NUMA-Aware Blocking Synchronization Primitives

Power Reduction Techniques for Microprocessor Systems

Vertigo: Automatic Performance-Setting for Linux

ECE 571 – Advanced Microprocessor-Based Design Lecture 28

Non-Scalable Locks Are Dangerous

Dynamic Thermal Management for High-Performance Microprocessors

Recoverable Mutual Exclusion for Non-Volatile Main Memory Systems