IBM Linux Technology Center
Tweaking Linux for a Green Datacenter
Vaidyanathan Srinivasan
© 2009 IBM Corporation IBM Linux Technology Center
Agenda
. Platform features and Linux exploitation . Tuning scheduler and cpufreq . Saving power in an idle system . Saving power in under utilized system . NUMA constraints for power management . Can visualization save power? . Power trending and power capping
2 © 2009 IBM Corporation IBM Linux Technology Center
Energy Saving features in hardware
. Dynamic frequency and voltage scaling Predefined set of frequency and corresponding voltage states Dual core and quad core CPUs may share power domains and clock distribution . Sleep states at idle Sleep states with low latency Deep sleep states (more power savings, higher latencies) Choice of sleep states (wakeup latency vs power savings) Package level sleep states
© 2009 IBM Corporation IBM Linux Technology Center
Energy Saving features in Linux kernel
. OnDemand frequency scaling (cpufreq) . Tickless kernel (NO_HZ) . Deferrable timers (for kernel code) . Scheduler domains, CPU affinity, cpusets . Multi core power saving heuristics sched_mc_power_savings . Support for deep sleep states . Latency based selection of various low power deep sleep states (cpuidle governor) . Device power management infrastructure (USB,PCI,...)
© 2009 IBM Corporation IBM Linux Technology Center
Idle CPU Power Management
P1 P0 P2 P3 P4 Low system utilization CPU 0 CPU 1 CPU 2 CPU 3
P1 P3 Move process, timers and interrupts Consolidate P0 P2 P4 workloads
CPU 0 CPU 1 CPU 2 CPU 3
zzZ zzZ Tickless kernel helps idle CPU to sleep longer
© 2009 IBM Corporation IBM Linux Technology Center Power saving enhancements Power savings at Idle No Powersaving
Sleep state
Dynamic Voltage and Frequency Scaling + Sleep
DVFS + Sleep + Tickless
Feature DVFS + Sleep + Tickless + sched_mc=1
DVFS + Sleep + Tickless + sched_mc=2
82 84 86 88 90 92 94 96 98 100
Across the stack: Normalised Average Pow er hardware, firmware Power savings at 50% load and Linux kernel No Powersaving
Sleep state
Dynamic Voltage and Frequency Scaling + Sleep
DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1
DVFS + Sleep + Tickless + sched_mc=2
82 84 86 88 90 92 94 96 98 100
Normalised Average Power Approximate power savings percentages obtained across different experiments and hardware platforms 6 © 2009 IBM Corporation IBM Linux Technology Center OnDemand CPU Frequency switching
7 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- Kernbench
8 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- SPECPower
9 © 2009 IBM Corporation IBM Linux Technology Center
CPU Task consolidation – Use sibling threads
10 © 2009 IBM Corporation IBM Linux Technology Center Saving power in Idle system
. Optimize applications to reduce wake-ups at idle . Increase low power sleep state residency
PowerTOP 1.11 (C) 2007, 2008 Intel Corporation
Cn Avg residency C0 (cpu running) ( 0.8%) polling 0.0ms ( 0.0%) C1 mwait 13.3ms ( 1.1%) C3 mwait 46.1ms (98.0%) P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0%
Wakeups-from-idle per second : 22.1 interval: 235.0s no ACPI power usage estimate available Top causes for wakeups: 49.0% ( 67.4)
11 © 2009 IBM Corporation IBM Linux Technology Center Optimised idle load balancer . One CPU among the idle CPUs run sched tick and watch for overload from busy cpus. . Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle
Busy CPU running task Idle CPUs in deep sleep
Idle load balancer Core 0 Core 1 Core 4 Core 5 running loadbalance and sched tick
Core 2 Core 3 Core 6 Core 7
zzZ Package Deep sleep state
Move idle load balancer to semi-idle CPU package
12 © 2009 IBM Corporation IBM Linux Technology Center Timer migration . Tasks can be consolidated using sched_mc framework . Interrupts can be consolidated using the user space irqbalancer daemon . Migrating timers from idle cpus to the idle-load-balancer cpu coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time . Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle . Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state
13 © 2009 IBM Corporation IBM Linux Technology Center Saving power in underutilized system
. Understanding workloads: Number of software threads and their cpu utilization Relation between threads – amount of data sharing Latency sensitive vs throughput of workloads . Knobs available CPU frequency governors and policies Scheduler tunables, shced_mc and sched smt_powersavings CPU idle governor and PM QoS framework
14 © 2009 IBM Corporation IBM Linux Technology Center
NUMA Constraints for power management
. Onchip memory controller make each package a NUMA node Task consolidation may increase memory latency for tasks Constraints in package level power management Performance tradeoffs are very sensitive work workloads
15 © 2009 IBM Corporation IBM Linux Technology Center
Can Virtualization save power?
. Visualization can improve system utilization Operate system at better power efficiency VM guest configurations and resource allocations can be optimized . Power saving optimizations within guest is limited Hypervisor needs to coordinate policies across guests
16 © 2009 IBM Corporation IBM Linux Technology Center Idle Virtual machines: PowerTOP 1.9 (C) 2007 Intel Corporation PowerTOP 1.10 (C) 2007, 2008 Intel Corporation Collecting data for 15 seconds Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Cn Avg residency Wakeups-from-idle per second : 256.4 interval: 15.0s C0 (cpu running) (48.2%) Top causes for wakeups: C0 0.0ms ( 0.0%) 97.5% (996.7)
17 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: IBM Active Energy Manager
18 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: pwrkap
http://pwrkap.sourceforge.net/
19 © 2009 IBM Corporation IBM Linux Technology Center
Questions ?
Acknowledgments
. Arun R Bhardwaj . Darrick J Wong . Gautham Shenoy . Jeffery J Heroux . Naren Devaiah . Premalatha M Nair . Susanne Libischer
20 © 2009 IBM Corporation IBM Linux Technology Center
Reference
. OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf . OLS PM Mini summit http://lwn.net/Articles/292447/ . sched_mc=2 framework - 2.6.29 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 . sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 . Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 . Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 . Pwrkap http://pwrkap.sourceforge.net/ . Active Energy Manager http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html
Thank You
21 © 2009 IBM Corporation IBM Linux Technology Center
Legal Statements
● Copyright International Business Machines Corporation 2009. ● Permission to redistribute in accordance with Linux Foundation Collaboration Summit submission guidelines is granted; all other rights reserved. ● This work represents the view of the authors and does not necessarily represent the view of IBM or Intel. ● IBM, IBM logo, ibm.com are trademarks of International Business Machines Corporation in the United States, other countries, or both. ● Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. ● Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ● Other company, product, and service names may be trademarks or service marks of others. ● References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates.
22 © 2009 IBM Corporation IBM Linux Technology Center
Legal Statements
● INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.
23 © 2009 IBM Corporation