Energy Aware Computing

Energy Aware Computing

ENERGY AWARE COMPUTING Luigi Brochard, Lenovo Distinguished Engineer, WW HPC & AI HPC Knowledge June 15 2017, Agenda . Different metrics for energy efficiency . Lenovo cooling solutions . Lenovo software for energy aware computing 2017 Lenovo. All rights reserved. Energy Aware Runtime 2 How to measure Power Efficiency Total Facility Power PUE = • PUE IT Equipment Power – Power usage effectiveness (PUE) is a measure of how efficiently a computer data center uses its power; – PUE is the ratio of total power used by a computer facility] to the power delivered to computing equipment. – Ideal value is 1.0 − It does not take into account how IT power can be optimised • ITUE (IT power + VR + PSU + Fan) ITUE = IT Power − IT power effectiveness ( ITUE) measures how the node power can be optimised − Ideal value if 1.0 ERE Total Facility Power – Treuse • ERE = IT Equipment Power − Energy Reuse Effectiveness measures how efficient a data center reuses the power dissipated by the computer − ERE is the ratio of total amount of power used by a computer facility] to the power delivered to computing equipment. − An ideal ERE is 0.0. ERE = PUE – Tresuse/IT EqPower; If no reuse, ERE = PUE 2017 Lenovo. All rights reserved. Energy Aware Runtime 3 Choice of Cooling Air Cooled with Air Cooled Rear Door Heat Exchangers Direct Water Cooled . Standard air flow with internal fans . Air cool, supplemented with . Direct water cooling with no internal fans . Fits in any datacenter RDHX door on rack . Higher performance per watt . Maximum flexibility . Uses chilled water with . Free cooling (45°C water) . Broadest choice of configurable economizer (18°C water) . Energy re-use options supported . Enables extremely tight rack . Densest footprint . Supports Native Expansion nodes placement . Ideal for geos with high electricity costs (Storage NeX, PCI NeX) and new data centers PUE ~1.4 – 1.2 PUE ~2 – 1.5 . Supports highest wattage processors ERE ~ 1.4 – 1.2 PUE ~ 1.1 ERE ~2 – 1.5 ERE < < 1 with hot water Choose for broadest choice of Choose for balance between configuration Choose for highest performance customizable options flexibility and energy efficiency and energy efficiency Energy Aware Runtime 4 2017 Lenovo. All Rights Reserved TCO: payback period for DWC vs RDHx New Existing Existing Existing $0.06/kWh $0.12/kWh $0.20/kWh DWC RDHx . New data centers: Water cooling has immediate payback. Existing air-cooled data center payback period strongly depends on electricity rate 2017 Lenovo. All Rights Reserved Energy Aware Runtime 5 iDataplex dx360M4 (2010-2013) . iDataplex rack with 84 dx360M4 servers . dx360 M4 nodes, 2xCPUs (130W, 115W), 16xDIMMS (4GB/8GB), 1HDD/2SSD, network card. 85% Heat Recovery, Water 18°C-45°C, 0.5 lpm / node. dx360M4 server dx360M4 Server iDataplex Rack 2017 Lenovo. All Rights Reserved Energy Aware Runtime 6 NextScale nx360M5 WCT (2013-2016) • NextScale Chassis 6U/12Nodes , 2 nodes / tray. • nx360M5 WCT 2xCPUs (up to 165W), 16xDIMMS (8GB/16GB/32GB), 1HDD/2SSD, 1 ML2 or PCIe Network Card. • 85% Heat Recovery, Water 18°C-45°C (and even upto 50°C), 0.5 lpm / node. copper waterloops 2 Nodes of nx-360M5 WCT in a Tray NextScale Chassis Scalable Manifold Rack Configuration nx360M5 with 2 SSDs 2017 Lenovo. All Rights Reserved Energy Aware Runtime 7 SuperMUC systems at LRZ: Phase 1 and Phase 2 Phase 1 Ranked 28 and 29 in Top500 June 2016 • Fastest Computer in Europe on Top 500, June 2012 – 9324 Nodes with 2 Intel Sandy Bridge EP CPUs – HPL = 2.9 PetaFLOP/s – Infiniband FDR10 Interconnect – Large File Space for multiple purpose • 10 PetaByte File Space based on IBM GPFS with 200GigaByte/s I/O bw Phase 2 • Innovative Technology for Energy Effective . Acceptance completed Computing – 3096 nx360m5 compute nodes – Hot Water Cooling Haswell EP CPUs – Energy Aware Scheduling – HPL = 2.8 PetaFLOP/s • Most Energy Efficient high End HPC System – Direct Hot Water Cooled, – PUE 1.1 Energy Aware Scheduling – Total Power consumption over 5 years to be reduced by – Infiniband FDR14 ~ 37% from 27.6 M€ to 17.4 M€ – GPFS, 10 x GSS26, 7.5 PB 2017 Lenovo. All Rights Reserved Energy Aware Runtime capacity , 100 GB/s IO bw 8 Lenovo Water Cooling added value • Classic Water Cooling • Lenovo Water Cooling – Direct Water cooling CPU only – Direct Water cooling CPU/DIMMS/VRs - Only 60% of heat goes to water - 80 to 85% of heat goes to water - => 40% still need to be air cooled - => just 10% still need to be air cooled – Inlet water temperature – Inlet water temperature - Upto35°C - Upto 45-50°C - => No free cooling all year long/all geo - => Free cooling all year long in all geo – Heat from water is wasted – Water is hot enough to be efficiently reused - like with Absorption chiller => ERE <<1 – Unproven technology – 3rd generation Water Cooling - More than 10000 nodes installed – Power of server is not managed – Power and energy are managed & optimized 2017 Lenovo. All Rights Reserved Energy Aware Runtime 9 DWC reduces Processor Temperature on Xeon 2697 v4 Conclusion: Direct Water Cooling lowers processor power consumption by about 5% and allows Higher processor frequency. NXT with 2 socket 2697v4, 128 GB 2400 MHz DIMM Inlet Water temperature is 28°C, 2017 Lenovo. All Rights Reserved Energy Aware Runtime 10 Air and DWC performance DC power on Xeon 2697v4 Conclusion: With Turbo OFF, Direct Water Cooling reduces power by 5% With Turbo ON, it increases performance by 3% and still reduces power by 1% DC energy is measured through aem DC energy accumulator 2017 Lenovo. All Rights Reserved Energy Aware Runtime 11 Savings from Lenovo Direct Water Cooling • Higher TDP processors • Reduced server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%) • Reduce cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%) • Additional savings with Energy Aware SW Total savings = ~35-40% • Free cooling all yea long => Less chillers => CAPEX savings 2017 Lenovo. All Rights Reserved Energy Aware Runtime 12 • Re-Use of Waste Heat , ● New buildings in Germany are very good thermally isolated: Standard heat requirement of only 50 W/m2 SuperMUCs waste heat would be sufficient to heat 40.000 m2 of office space (~10 x) ● What to do with the waste heat during summer? 2017 Lenovo. All Rights Reserved Energy Aware Runtime 13 CooLMUC-2: Waste Heat Re-Use for Chilled Water Production ERE=0.3 ● Lenovo NeXtScale Water Cool (WCT) system ● SorTech Adsorbtion Chillers technology based of zeolite coated metal fiber heat Water inlet temperatures 50 °C exchangers All season chiller-less cooling a factor 3 higher than current chillers 384 compute nodes based on silica gel 466 TFlop/s peak performance COP = 60% Total electricity reduced by 50+% Energy Reuse Effectiveness ( ERE) measures how efficient a data center reuses the power dissipated by the computer Total Facility Power – Treuse Energy Aware Runtime ERE = 14 2017 Lenovo. All Rights Reserved IT Equipment Power Total Facility Power – Treuse 120 87 ERE = = – = 0.32 CooLMUC-2: ERE = 0.3 IT Equipment Power 104 CooLMUC-2 power consumption CooLMUC-2 heat output into warm water cooling loop Cold water generated by absorption chillers (COP ~ 0,5 – 0,6) Leibniz Supercomputing Centre Energy Aware Runtime 15 Savins from Direct Water Cooling with Lenovo • Server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%) • Cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%) Total savings = ~35-40% • Additional savings with energy aware SW • Heat Reuse – With DWC at 50°C, additional 30% savings as free chilled water is generated With heat reuse total savings => 50+% 2017 Lenovo. All Rights Reserved Energy Aware Runtime 16 Lenovo references with DWC (2012-2016) Sites Nodes Country Instal date Max. In. Water LRZ SuperMUC 9216 Germany 2012 45°C LRZ SuperMUC 2 4096 Germany 2012 45°C LRZ SuperCool2 400 Germany 2015 50°C NTU 40 Singapore 2012 45°C Enercon 72 Germany 2013 45°C US Army 756 Hawai 2013 45°C Exxon Research 504 NA 2014 45°C NASA Goddard 80 NA 2014 45°C PIK 312 Germany 2015 45°C KIT 1152 Germany 2015 45°C Birmingham U ph1 28 UK 2015 45°C Birmingham U ph2 132 UK 2016 45°C MMD 296 Malaysia 2016 45°C UNINET 964 Norway 2016 45°C Peking U 204 China 2017 45°C More than 18.000 nodes up and running with DWC Lenovo technology Energy Aware Runtime 17 How to manage/control power and energy • Report – temperature and power consumption per node / per chassis – power consumption and energy per job • Optimize – Reduce power of inactive nodes – Reduce power of active nodes 2017 Lenovo. All Rights Reserved Energy Aware Runtime 18 Power Management on NeXtScale • IMM = Integrated Management Module . FPC = Fan/Power Controller (Node-Level Systems Management) (Chassis-Level Systems Mgmt – Monitors DC power consumed by node as a whole and – Monitors AC and DC power consumed by CPU and memory subsystems by individual power supplies and – Monitors inlet air temperature for node aggregates to chassis level – Caps DC power consumed by node as a whole – Monitors DC power consumed by individual fans and aggregates to Monitors CPU and memory – chassis level subsystem throttling caused by node-level throttling – Enables or disables power savings for node PCH = Platform Controller Hub (i.e., south bridge) ME = Management Engine (embedded in PCH, runs Intel NM firmware) HSC = Hot Swap Controller (provides power readings) 2017 Lenovo All rights reserved. Energy Aware Runtime 19 DC power sampling and reporting frequency High Level Software 1Hz 200Hz RAPL - IMM/BMC CPU/memory (energy MSRs) 1Hz 500Hz NM/ME Meter 10Hz HSC 1KHz Sensor 2017 Lenovo All rights reserved.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    51 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us