Tweaking Linux for a Green Datacenter

Tweaking Linux for a Green Datacenter

IBM Linux Technology Center Tweaking Linux for a Green Datacenter Vaidyanathan Srinivasan <[email protected]> Jenifer Hopper <[email protected]> © 2009 IBM Corporation IBM Linux Technology Center Agenda . Platform features and Linux exploitation . Tuning scheduler and cpufreq . Saving power in an idle system . Saving power in under utilized system . NUMA constraints for power management . Can visualization save power? . Power trending and power capping 2 © 2009 IBM Corporation IBM Linux Technology Center Energy Saving features in hardware . Dynamic frequency and voltage scaling Predefined set of frequency and corresponding voltage states Dual core and quad core CPUs may share power domains and clock distribution . Sleep states at idle Sleep states with low latency Deep sleep states (more power savings, higher latencies) Choice of sleep states (wakeup latency vs power savings) Package level sleep states © 2009 IBM Corporation IBM Linux Technology Center Energy Saving features in Linux kernel . OnDemand frequency scaling (cpufreq) . Tickless kernel (NO_HZ) . Deferrable timers (for kernel code) . Scheduler domains, CPU affinity, cpusets . Multi core power saving heuristics sched_mc_power_savings . Support for deep sleep states . Latency based selection of various low power deep sleep states (cpuidle governor) . Device power management infrastructure (USB,PCI,...) © 2009 IBM Corporation IBM Linux Technology Center Idle CPU Power Management P1 P0 P2 P3 P4 Low system utilization CPU 0 CPU 1 CPU 2 CPU 3 P1 P3 Move process, timers and interrupts Consolidate P0 P2 P4 workloads CPU 0 CPU 1 CPU 2 CPU 3 zzZ zzZ Tickless kernel helps idle CPU to sleep longer © 2009 IBM Corporation IBM Linux Technology Center Power saving enhancements Power savings at Idle No Powersaving Sleep state Dynamic Voltage and Frequency Scaling + Sleep DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1 DVFS + Sleep + Tickless + sched_mc=2 82 84 86 88 90 92 94 96 98 100 Across the stack: Normalised Average Pow er hardware, firmware Power savings at 50% load and Linux kernel No Powersaving Sleep state Dynamic Voltage and Frequency Scaling + Sleep DVFS + Sleep + Tickless Feature DVFS + Sleep + Tickless + sched_mc=1 DVFS + Sleep + Tickless + sched_mc=2 82 84 86 88 90 92 94 96 98 100 Normalised Average Power Approximate power savings percentages obtained across different experiments and hardware platforms 6 © 2009 IBM Corporation IBM Linux Technology Center OnDemand CPU Frequency switching 7 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- Kernbench 8 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation -- SPECPower 9 © 2009 IBM Corporation IBM Linux Technology Center CPU Task consolidation – Use sibling threads 10 © 2009 IBM Corporation IBM Linux Technology Center Saving power in Idle system . Optimize applications to reduce wake-ups at idle . Increase low power sleep state residency PowerTOP 1.11 (C) 2007, 2008 Intel Corporation Cn Avg residency C0 (cpu running) ( 0.8%) polling 0.0ms ( 0.0%) C1 mwait 13.3ms ( 1.1%) C3 mwait 46.1ms (98.0%) P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0% Wakeups-from-idle per second : 22.1 interval: 235.0s no ACPI power usage estimate available Top causes for wakeups: 49.0% ( 67.4) <interrupt> : extra timer interrupt 20.7% ( 28.5) <kernel IPI> : Rescheduling interrupts 6.5% ( 8.9) java : sk_reset_timer (tcp_write_timer) 5.8% ( 8.0) <kernel core> : cpucache_init (delayed_work_timer_fn) 4.3% ( 5.9) java : sk_reset_timer (tcp_delack_timer) 3.6% ( 5.0) java : futex_wait (hrtimer_wakeup) 3.3% ( 4.6) <interrupt> : eth0 11 © 2009 IBM Corporation IBM Linux Technology Center Optimised idle load balancer . One CPU among the idle CPUs run sched tick and watch for overload from busy cpus. Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle Busy CPU running task Idle CPUs in deep sleep Idle load balancer Core 0 Core 1 Core 4 Core 5 running loadbalance and sched tick Core 2 Core 3 Core 6 Core 7 zzZ Package Deep sleep state Move idle load balancer to semi-idle CPU package 12 © 2009 IBM Corporation IBM Linux Technology Center Timer migration . Tasks can be consolidated using sched_mc framework . Interrupts can be consolidated using the user space irqbalancer daemon . Migrating timers from idle cpus to the idle-load-balancer cpu coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time . Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle . Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state 13 © 2009 IBM Corporation IBM Linux Technology Center Saving power in underutilized system . Understanding workloads: Number of software threads and their cpu utilization Relation between threads – amount of data sharing Latency sensitive vs throughput of workloads . Knobs available CPU frequency governors and policies Scheduler tunables, shced_mc and sched smt_powersavings CPU idle governor and PM QoS framework 14 © 2009 IBM Corporation IBM Linux Technology Center NUMA Constraints for power management . Onchip memory controller make each package a NUMA node Task consolidation may increase memory latency for tasks Constraints in package level power management Performance tradeoffs are very sensitive work workloads 15 © 2009 IBM Corporation IBM Linux Technology Center Can Virtualization save power? . Visualization can improve system utilization Operate system at better power efficiency VM guest configurations and resource allocations can be optimized . Power saving optimizations within guest is limited Hypervisor needs to coordinate policies across guests 16 © 2009 IBM Corporation IBM Linux Technology Center Idle Virtual machines: PowerTOP 1.9 (C) 2007 Intel Corporation PowerTOP 1.10 (C) 2007, 2008 Intel Corporation Collecting data for 15 seconds Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Cn Avg residency Wakeups-from-idle per second : 256.4 interval: 15.0s C0 (cpu running) (48.2%) Top causes for wakeups: C0 0.0ms ( 0.0%) 97.5% (996.7) <interrupt> : extra timer interrupt C1 halt 0.0ms ( 0.0%) KVM 1.3% ( 12.9) <interrupt> : ata_piix C2 0.1ms ( 0.5%) 0.2% ( 2.2) <interrupt> : PS/2 keyboard/mouse/touchpad C3 0.3ms (51.3%) HOST 0.2% ( 1.7) gnome-terminal : schedule_timeout (process_timeout) P-states (frequencies) 0.1% ( 1.1) setroubleshootd : schedule_timeout (process_timeout) 2.17 Ghz 24.1% 0.1% ( 1.0) im-info-daemon : do_nanosleep (hrtimer_wakeup) 1.67 Ghz 0.9% 0.1% ( 0.5) <interrupt> : eth0 1333 Mhz 6.1% 0.1% ( 0.5) <kernel core> : e1000_intr (e1000_watchdog) 1000 Mhz 69.0% 0.1% ( 0.5) hald-addon-stor : schedule_timeout (process_timeout) Wakeups-from-idle per second : 1834.7 interval: 15.0s 0.1% ( 0.5) firefox : futex_wait (hrtimer_wakeup) no ACPI power usage estimate available Top causes for wakeups: 25.4% (1268.1) <kernel IPI> : function call interrupts 20.0% (1000.1) kvm : __kvm_migrate_pit_timer (pit_timer_fn) KVM Guest 19.8% (990.9) kvm : __kvm_migrate_apic_timer (apic_timer_fn) 15.1% (752.1) <kernel IPI> : Rescheduling interrupts 5.7% (285.1) <interrupt> : PS/2 keyboard/mouse/touchpad 3.7% (183.9) <interrupt> : iwl3945 3.6% (178.5) firefox : futex_wait (hrtimer_wakeup) 1.7% ( 86.1) opera : schedule_timeout (process_timeout) 17 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: IBM Active Energy Manager 18 © 2009 IBM Corporation IBM Linux Technology Center Power trending and capping tools: pwrkap http://pwrkap.sourceforge.net/ 19 © 2009 IBM Corporation IBM Linux Technology Center Questions ? Acknowledgments . Arun R Bhardwaj . Darrick J Wong . Gautham Shenoy . Jeffery J Heroux . Naren Devaiah . Premalatha M Nair . Susanne Libischer 20 © 2009 IBM Corporation IBM Linux Technology Center Reference . OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf . OLS PM Mini summit http://lwn.net/Articles/292447/ . sched_mc=2 framework - 2.6.29 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 . sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 . Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 . Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 . Pwrkap http://pwrkap.sourceforge.net/ . Active Energy Manager http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html Thank You 21 © 2009 IBM Corporation IBM Linux Technology Center Legal Statements ● Copyright International Business Machines Corporation 2009. ● Permission to redistribute in accordance with Linux Foundation Collaboration Summit submission guidelines is granted; all other rights reserved. ● This work represents the view of the authors and does not necessarily represent the view of IBM or Intel. ● IBM, IBM logo, ibm.com are trademarks of International Business Machines Corporation in the United States, other countries, or both. ● Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. ● Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ● Other company, product,

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    23 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us