
IBM Linux Technology Center Tweaking Linux for a Green Datacenter Vaidyanathan Srinivasan <[email protected]> Jenifer Hopper <[email protected]> © 2009 IBM Corporation IBM Linux Technology Center Agenda . Platform features and Linux exploitation . Tuning scheduler and cpufreq . Saving power in an idle system . Saving power in under utilized system . NUMA constraints for power management . Can visualization save power? . Power trending and power capping 2 © 2009 IBM Corporation IBM Linux Technology Center Energy Saving features in hardware . Dynamic frequency and voltage scaling Predefined set of frequency and corresponding voltage states Dual core and quad core CPUs may share power domains and clock distribution . Sleep states at idle Sleep states with low latency Deep sleep states (more power savings, higher latencies) Choice of sleep states (wakeup latency vs power savings) Package level sleep states © 2009 IBM Corporation IBM Linux Technology Center Energy Saving features in Linux kernel . OnDemand frequency scaling (cpufreq) . Tickless kernel (NO_HZ) . Deferrable timers (for kernel code) . Scheduler domains, CPU affinity, cpusets . Multi core power saving heuristics sched_mc_power_savings . Support for deep sleep states . Latency based selection of various low power deep sleep states (cpuidle governor) . Device power management infrastructure (USB,PCI,...) © 2009 IBM Corporation IBM Linux Technology Center Idle CPU Power Management P1 P0 P2 P3 P4 Low system utilization CPU 0 CPU 1 CPU 2 CPU 3 P1 P3 Move process, timers and interrupts Consolidate P0 P2 P4 workloads CPU 0 CPU 1 CPU 2 CPU 3 zzZ zzZ Tickless kernel helps idle CPU to sleep longer © 2009 IBM Corporation IBM Linux Technology Center Power saving enhancements Power savings at Idle No Powersaving Sleep state Dynamic Voltage and Frequency Scaling + Sleep DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1 DVFS + Sleep + Tickless + sched_mc=2 82 84 86 88 90 92 94 96 98 100 Across the stack: Normalised Average Pow er hardware, firmware Power savings at 50% load and Linux kernel No Powersaving Sleep state Dynamic Voltage and Frequency Scaling + Sleep DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1 DVFS + Sleep + Tickless + sched_mc=2 82 84 86 88 90 92 94 96 98 100 Normalised Average Power Approximate power savings percentages obtained across different experiments and hardware platforms 6 © 2009 IBM Corporation IBM Linux Technology Center OnDemand CPU Frequency switching 7 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- Kernbench 8 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- SPECPower 9 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation – Use sibling threads 10 © 2009 IBM Corporation IBM Linux Technology Center Saving power in Idle system . Optimize applications to reduce wake-ups at idle . Increase low power sleep state residency PowerTOP 1.11 (C) 2007, 2008 Intel Corporation Cn Avg residency C0 (cpu running) ( 0.8%) polling 0.0ms ( 0.0%) C1 mwait 13.3ms ( 1.1%) C3 mwait 46.1ms (98.0%) P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0% Wakeups-from-idle per second : 22.1 interval: 235.0s no ACPI power usage estimate available Top causes for wakeups: 49.0% ( 67.4) <interrupt> : extra timer interrupt 20.7% ( 28.5) <kernel IPI> : Rescheduling interrupts 6.5% ( 8.9) java : sk_reset_timer (tcp_write_timer) 5.8% ( 8.0) <kernel core> : cpucache_init (delayed_work_timer_fn) 4.3% ( 5.9) java : sk_reset_timer (tcp_delack_timer) 3.6% ( 5.0) java : futex_wait (hrtimer_wakeup) 3.3% ( 4.6) <interrupt> : eth0 11 © 2009 IBM Corporation IBM Linux Technology Center Optimised idle load balancer . One CPU among the idle CPUs run sched tick and watch for overload from busy cpus. Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle Busy CPU running task Idle CPUs in deep sleep Idle load balancer Core 0 Core 1 Core 4 Core 5 running loadbalance and sched tick Core 2 Core 3 Core 6 Core 7 zzZ Package Deep sleep state Move idle load balancer to semi-idle CPU package 12 © 2009 IBM Corporation IBM Linux Technology Center Timer migration . Tasks can be consolidated using sched_mc framework . Interrupts can be consolidated using the user space irqbalancer daemon . Migrating timers from idle cpus to the idle-load-balancer cpu coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time . Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle . Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state 13 © 2009 IBM Corporation IBM Linux Technology Center Saving power in underutilized system . Understanding workloads: Number of software threads and their cpu utilization Relation between threads – amount of data sharing Latency sensitive vs throughput of workloads . Knobs available CPU frequency governors and policies Scheduler tunables, shced_mc and sched smt_powersavings CPU idle governor and PM QoS framework 14 © 2009 IBM Corporation IBM Linux Technology Center NUMA Constraints for power management . Onchip memory controller make each package a NUMA node Task consolidation may increase memory latency for tasks Constraints in package level power management Performance tradeoffs are very sensitive work workloads 15 © 2009 IBM Corporation IBM Linux Technology Center Can Virtualization save power? . Visualization can improve system utilization Operate system at better power efficiency VM guest configurations and resource allocations can be optimized . Power saving optimizations within guest is limited Hypervisor needs to coordinate policies across guests 16 © 2009 IBM Corporation IBM Linux Technology Center Idle Virtual machines: PowerTOP 1.9 (C) 2007 Intel Corporation PowerTOP 1.10 (C) 2007, 2008 Intel Corporation Collecting data for 15 seconds Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Cn Avg residency Wakeups-from-idle per second : 256.4 interval: 15.0s C0 (cpu running) (48.2%) Top causes for wakeups: C0 0.0ms ( 0.0%) 97.5% (996.7) <interrupt> : extra timer interrupt C1 halt 0.0ms ( 0.0%) KVM 1.3% ( 12.9) <interrupt> : ata_piix C2 0.1ms ( 0.5%) 0.2% ( 2.2) <interrupt> : PS/2 keyboard/mouse/touchpad C3 0.3ms (51.3%) HOST 0.2% ( 1.7) gnome-terminal : schedule_timeout (process_timeout) P-states (frequencies) 0.1% ( 1.1) setroubleshootd : schedule_timeout (process_timeout) 2.17 Ghz 24.1% 0.1% ( 1.0) im-info-daemon : do_nanosleep (hrtimer_wakeup) 1.67 Ghz 0.9% 0.1% ( 0.5) <interrupt> : eth0 1333 Mhz 6.1% 0.1% ( 0.5) <kernel core> : e1000_intr (e1000_watchdog) 1000 Mhz 69.0% 0.1% ( 0.5) hald-addon-stor : schedule_timeout (process_timeout) Wakeups-from-idle per second : 1834.7 interval: 15.0s 0.1% ( 0.5) firefox : futex_wait (hrtimer_wakeup) no ACPI power usage estimate available Top causes for wakeups: 25.4% (1268.1) <kernel IPI> : function call interrupts 20.0% (1000.1) kvm : __kvm_migrate_pit_timer (pit_timer_fn) KVM Guest 19.8% (990.9) kvm : __kvm_migrate_apic_timer (apic_timer_fn) 15.1% (752.1) <kernel IPI> : Rescheduling interrupts 5.7% (285.1) <interrupt> : PS/2 keyboard/mouse/touchpad 3.7% (183.9) <interrupt> : iwl3945 3.6% (178.5) firefox : futex_wait (hrtimer_wakeup) 1.7% ( 86.1) opera : schedule_timeout (process_timeout) 17 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: IBM Active Energy Manager 18 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: pwrkap http://pwrkap.sourceforge.net/ 19 © 2009 IBM Corporation IBM Linux Technology Center Questions ? Acknowledgments . Arun R Bhardwaj . Darrick J Wong . Gautham Shenoy . Jeffery J Heroux . Naren Devaiah . Premalatha M Nair . Susanne Libischer 20 © 2009 IBM Corporation IBM Linux Technology Center Reference . OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf . OLS PM Mini summit http://lwn.net/Articles/292447/ . sched_mc=2 framework - 2.6.29 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 . sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 . Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 . Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 . Pwrkap http://pwrkap.sourceforge.net/ . Active Energy Manager http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html Thank You 21 © 2009 IBM Corporation IBM Linux Technology Center Legal Statements ● Copyright International Business Machines Corporation 2009. ● Permission to redistribute in accordance with Linux Foundation Collaboration Summit submission guidelines is granted; all other rights reserved. ● This work represents the view of the authors and does not necessarily represent the view of IBM or Intel. ● IBM, IBM logo, ibm.com are trademarks of International Business Machines Corporation in the United States, other countries, or both. ● Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. ● Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ● Other company, product,
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages23 Page
-
File Size-