IBM Technology Center

Tweaking Linux for a Green Datacenter

Vaidyanathan Srinivasan Jenifer Hopper

© 2009 IBM Corporation IBM Linux Technology Center

Agenda

. Platform features and Linux exploitation . Tuning scheduler and cpufreq . Saving power in an idle system . Saving power in under utilized system . NUMA constraints for power management . Can visualization save power? . Power trending and power capping

2 © 2009 IBM Corporation IBM Linux Technology Center

Energy Saving features in hardware

. Dynamic frequency and voltage scaling  Predefined set of frequency and corresponding voltage states  Dual core and quad core CPUs may power domains and clock distribution . Sleep states at idle  Sleep states with low latency  Deep sleep states (more power savings, higher latencies)  Choice of sleep states (wakeup latency vs power savings)  Package level sleep states

© 2009 IBM Corporation IBM Linux Technology Center

Energy Saving features in Linux kernel

. OnDemand frequency scaling (cpufreq) . Tickless kernel (NO_HZ) . Deferrable timers (for kernel code) . Scheduler domains, CPU affinity, cpusets . Multi core power saving heuristics sched_mc_power_savings . Support for deep sleep states . Latency based selection of various low power deep sleep states (cpuidle governor) . Device power management infrastructure (USB,PCI,...)

© 2009 IBM Corporation IBM Linux Technology Center

Idle CPU Power Management

P1 P0 P2 P3 P4 Low system utilization CPU 0 CPU 1 CPU 2 CPU 3

P1 P3 Move process, timers and interrupts Consolidate P0 P2 P4 workloads

CPU 0 CPU 1 CPU 2 CPU 3

zzZ zzZ Tickless kernel helps idle CPU to sleep longer

© 2009 IBM Corporation IBM Linux Technology Center Power saving enhancements Power savings at Idle No Powersaving

Sleep state

Dynamic Voltage and Frequency Scaling + Sleep

DVFS + Sleep + Tickless

Feature DVFS + Sleep + Tickless + sched_mc=1

DVFS + Sleep + Tickless + sched_mc=2

82 84 86 88 90 92 94 96 98 100

Across the stack: Normalised Average Pow er hardware, firmware Power savings at 50% load and Linux kernel No Powersaving

Sleep state

Dynamic Voltage and Frequency Scaling + Sleep

DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1

DVFS + Sleep + Tickless + sched_mc=2

82 84 86 88 90 92 94 96 98 100

Normalised Average Power Approximate power savings percentages obtained across different experiments and hardware platforms 6 © 2009 IBM Corporation IBM Linux Technology Center OnDemand CPU Frequency switching

7 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- Kernbench

8 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- SPECPower

9 © 2009 IBM Corporation IBM Linux Technology Center

CPU Task consolidation – Use sibling threads

10 © 2009 IBM Corporation IBM Linux Technology Center Saving power in Idle system

. Optimize applications to reduce wake-ups at idle . Increase low power sleep state residency

PowerTOP 1.11 (C) 2007, 2008 Intel Corporation

Cn Avg residency C0 (cpu running) ( 0.8%) polling 0.0ms ( 0.0%) C1 mwait 13.3ms ( 1.1%) C3 mwait 46.1ms (98.0%) P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0%

Wakeups-from-idle per second : 22.1 interval: 235.0s no ACPI power usage estimate available Top causes for wakeups: 49.0% ( 67.4) : extra timer interrupt 20.7% ( 28.5) : Rescheduling interrupts 6.5% ( 8.9) java : sk_reset_timer (tcp_write_timer) 5.8% ( 8.0) : cpucache_init (delayed_work_timer_fn) 4.3% ( 5.9) java : sk_reset_timer (tcp_delack_timer) 3.6% ( 5.0) java : futex_wait (hrtimer_wakeup) 3.3% ( 4.6) : eth0

11 © 2009 IBM Corporation IBM Linux Technology Center Optimised idle load balancer . One CPU among the idle CPUs run sched tick and watch for overload from busy cpus. . Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle

Busy CPU running task Idle CPUs in deep sleep

Idle load balancer Core 0 Core 1 Core 4 Core 5 running loadbalance and sched tick

Core 2 Core 3 Core 6 Core 7

zzZ Package Deep sleep state

Move idle load balancer to semi-idle CPU package

12 © 2009 IBM Corporation IBM Linux Technology Center Timer migration . Tasks can be consolidated using sched_mc framework . Interrupts can be consolidated using the user space irqbalancer daemon . Migrating timers from idle cpus to the idle-load-balancer cpu coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time . Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle . Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state

13 © 2009 IBM Corporation IBM Linux Technology Center Saving power in underutilized system

. Understanding workloads:  Number of software threads and their cpu utilization  Relation between threads – amount of data sharing  Latency sensitive vs throughput of workloads . Knobs available  CPU frequency governors and policies  Scheduler tunables, shced_mc and sched smt_powersavings  CPU idle governor and PM QoS framework

14 © 2009 IBM Corporation IBM Linux Technology Center

NUMA Constraints for power management

. Onchip memory controller make each package a NUMA node  Task consolidation may increase memory latency for tasks  Constraints in package level power management  Performance tradeoffs are very sensitive work workloads

15 © 2009 IBM Corporation IBM Linux Technology Center

Can Virtualization save power?

. Visualization can improve system utilization  Operate system at better power efficiency  VM guest configurations and resource allocations can be optimized . Power saving optimizations within guest is limited  Hypervisor needs to coordinate policies across guests

16 © 2009 IBM Corporation IBM Linux Technology Center Idle Virtual machines: PowerTOP 1.9 (C) 2007 Intel Corporation PowerTOP 1.10 (C) 2007, 2008 Intel Corporation Collecting data for 15 seconds Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Cn Avg residency Wakeups-from-idle per second : 256.4 interval: 15.0s C0 (cpu running) (48.2%) Top causes for wakeups: C0 0.0ms ( 0.0%) 97.5% (996.7) : extra timer interrupt C1 halt 0.0ms ( 0.0%) KVM 1.3% ( 12.9) : ata_piix C2 0.1ms ( 0.5%) 0.2% ( 2.2) : PS/2 keyboard/mouse/touchpad C3 0.3ms (51.3%) HOST 0.2% ( 1.7) gnome-terminal : schedule_timeout (process_timeout) P-states (frequencies) 0.1% ( 1.1) setroubleshootd : schedule_timeout (process_timeout) 2.17 Ghz 24.1% 0.1% ( 1.0) im-info-daemon : do_nanosleep (hrtimer_wakeup) 1.67 Ghz 0.9% 0.1% ( 0.5) : eth0 1333 Mhz 6.1% 0.1% ( 0.5) : e1000_intr (e1000_watchdog) 1000 Mhz 69.0% 0.1% ( 0.5) hald-addon-stor : schedule_timeout (process_timeout) Wakeups-from-idle per second : 1834.7 interval: 15.0s 0.1% ( 0.5) firefox : futex_wait (hrtimer_wakeup) no ACPI power usage estimate available Top causes for wakeups: 25.4% (1268.1) : function call interrupts 20.0% (1000.1) kvm : __kvm_migrate_pit_timer (pit_timer_fn) KVM Guest 19.8% (990.9) kvm : __kvm_migrate_apic_timer (apic_timer_fn) 15.1% (752.1) : Rescheduling interrupts 5.7% (285.1) : PS/2 keyboard/mouse/touchpad 3.7% (183.9) : iwl3945 3.6% (178.5) firefox : futex_wait (hrtimer_wakeup) 1.7% ( 86.1) opera : schedule_timeout (process_timeout)

17 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: IBM Active Energy Manager

18 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: pwrkap

http://pwrkap.sourceforge.net/

19 © 2009 IBM Corporation IBM Linux Technology Center

Questions ?

Acknowledgments

. Arun R Bhardwaj . Darrick J Wong . Gautham Shenoy . Jeffery J Heroux . Naren Devaiah . Premalatha M Nair . Susanne Libischer

20 © 2009 IBM Corporation IBM Linux Technology Center

Reference

. OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf . OLS PM Mini summit http://lwn.net/Articles/292447/ . sched_mc=2 framework - 2.6.29 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 . sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 . Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 . Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 . Pwrkap http://pwrkap.sourceforge.net/ . Active Energy Manager http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html

Thank You

21 © 2009 IBM Corporation IBM Linux Technology Center

Legal Statements

● Copyright International Business Machines Corporation 2009. ● Permission to redistribute in accordance with Linux Foundation Collaboration Summit submission guidelines is granted; all other rights reserved. ● This work represents the view of the authors and does not necessarily represent the view of IBM or Intel. ● IBM, IBM logo, ibm.com are trademarks of International Business Machines Corporation in the United States, other countries, or both. ● Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. ● Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ● Other company, product, and service names may be trademarks or service marks of others. ● References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates.

22 © 2009 IBM Corporation IBM Linux Technology Center

Legal Statements

● INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

23 © 2009 IBM Corporation