Energy Aware Computing
Total Page:16
File Type:pdf, Size:1020Kb
ENERGY AWARE COMPUTING Luigi Brochard, Lenovo Distinguished Engineer, WW HPC & AI HPC Knowledge June 15 2017, Agenda . Different metrics for energy efficiency . Lenovo cooling solutions . Lenovo software for energy aware computing 2017 Lenovo. All rights reserved. Energy Aware Runtime 2 How to measure Power Efficiency Total Facility Power PUE = • PUE IT Equipment Power – Power usage effectiveness (PUE) is a measure of how efficiently a computer data center uses its power; – PUE is the ratio of total power used by a computer facility] to the power delivered to computing equipment. – Ideal value is 1.0 − It does not take into account how IT power can be optimised • ITUE (IT power + VR + PSU + Fan) ITUE = IT Power − IT power effectiveness ( ITUE) measures how the node power can be optimised − Ideal value if 1.0 ERE Total Facility Power – Treuse • ERE = IT Equipment Power − Energy Reuse Effectiveness measures how efficient a data center reuses the power dissipated by the computer − ERE is the ratio of total amount of power used by a computer facility] to the power delivered to computing equipment. − An ideal ERE is 0.0. ERE = PUE – Tresuse/IT EqPower; If no reuse, ERE = PUE 2017 Lenovo. All rights reserved. Energy Aware Runtime 3 Choice of Cooling Air Cooled with Air Cooled Rear Door Heat Exchangers Direct Water Cooled . Standard air flow with internal fans . Air cool, supplemented with . Direct water cooling with no internal fans . Fits in any datacenter RDHX door on rack . Higher performance per watt . Maximum flexibility . Uses chilled water with . Free cooling (45°C water) . Broadest choice of configurable economizer (18°C water) . Energy re-use options supported . Enables extremely tight rack . Densest footprint . Supports Native Expansion nodes placement . Ideal for geos with high electricity costs (Storage NeX, PCI NeX) and new data centers PUE ~1.4 – 1.2 PUE ~2 – 1.5 . Supports highest wattage processors ERE ~ 1.4 – 1.2 PUE ~ 1.1 ERE ~2 – 1.5 ERE < < 1 with hot water Choose for broadest choice of Choose for balance between configuration Choose for highest performance customizable options flexibility and energy efficiency and energy efficiency Energy Aware Runtime 4 2017 Lenovo. All Rights Reserved TCO: payback period for DWC vs RDHx New Existing Existing Existing $0.06/kWh $0.12/kWh $0.20/kWh DWC RDHx . New data centers: Water cooling has immediate payback. Existing air-cooled data center payback period strongly depends on electricity rate 2017 Lenovo. All Rights Reserved Energy Aware Runtime 5 iDataplex dx360M4 (2010-2013) . iDataplex rack with 84 dx360M4 servers . dx360 M4 nodes, 2xCPUs (130W, 115W), 16xDIMMS (4GB/8GB), 1HDD/2SSD, network card. 85% Heat Recovery, Water 18°C-45°C, 0.5 lpm / node. dx360M4 server dx360M4 Server iDataplex Rack 2017 Lenovo. All Rights Reserved Energy Aware Runtime 6 NextScale nx360M5 WCT (2013-2016) • NextScale Chassis 6U/12Nodes , 2 nodes / tray. • nx360M5 WCT 2xCPUs (up to 165W), 16xDIMMS (8GB/16GB/32GB), 1HDD/2SSD, 1 ML2 or PCIe Network Card. • 85% Heat Recovery, Water 18°C-45°C (and even upto 50°C), 0.5 lpm / node. copper waterloops 2 Nodes of nx-360M5 WCT in a Tray NextScale Chassis Scalable Manifold Rack Configuration nx360M5 with 2 SSDs 2017 Lenovo. All Rights Reserved Energy Aware Runtime 7 SuperMUC systems at LRZ: Phase 1 and Phase 2 Phase 1 Ranked 28 and 29 in Top500 June 2016 • Fastest Computer in Europe on Top 500, June 2012 – 9324 Nodes with 2 Intel Sandy Bridge EP CPUs – HPL = 2.9 PetaFLOP/s – Infiniband FDR10 Interconnect – Large File Space for multiple purpose • 10 PetaByte File Space based on IBM GPFS with 200GigaByte/s I/O bw Phase 2 • Innovative Technology for Energy Effective . Acceptance completed Computing – 3096 nx360m5 compute nodes – Hot Water Cooling Haswell EP CPUs – Energy Aware Scheduling – HPL = 2.8 PetaFLOP/s • Most Energy Efficient high End HPC System – Direct Hot Water Cooled, – PUE 1.1 Energy Aware Scheduling – Total Power consumption over 5 years to be reduced by – Infiniband FDR14 ~ 37% from 27.6 M€ to 17.4 M€ – GPFS, 10 x GSS26, 7.5 PB 2017 Lenovo. All Rights Reserved Energy Aware Runtime capacity , 100 GB/s IO bw 8 Lenovo Water Cooling added value • Classic Water Cooling • Lenovo Water Cooling – Direct Water cooling CPU only – Direct Water cooling CPU/DIMMS/VRs - Only 60% of heat goes to water - 80 to 85% of heat goes to water - => 40% still need to be air cooled - => just 10% still need to be air cooled – Inlet water temperature – Inlet water temperature - Upto35°C - Upto 45-50°C - => No free cooling all year long/all geo - => Free cooling all year long in all geo – Heat from water is wasted – Water is hot enough to be efficiently reused - like with Absorption chiller => ERE <<1 – Unproven technology – 3rd generation Water Cooling - More than 10000 nodes installed – Power of server is not managed – Power and energy are managed & optimized 2017 Lenovo. All Rights Reserved Energy Aware Runtime 9 DWC reduces Processor Temperature on Xeon 2697 v4 Conclusion: Direct Water Cooling lowers processor power consumption by about 5% and allows Higher processor frequency. NXT with 2 socket 2697v4, 128 GB 2400 MHz DIMM Inlet Water temperature is 28°C, 2017 Lenovo. All Rights Reserved Energy Aware Runtime 10 Air and DWC performance DC power on Xeon 2697v4 Conclusion: With Turbo OFF, Direct Water Cooling reduces power by 5% With Turbo ON, it increases performance by 3% and still reduces power by 1% DC energy is measured through aem DC energy accumulator 2017 Lenovo. All Rights Reserved Energy Aware Runtime 11 Savings from Lenovo Direct Water Cooling • Higher TDP processors • Reduced server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%) • Reduce cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%) • Additional savings with Energy Aware SW Total savings = ~35-40% • Free cooling all yea long => Less chillers => CAPEX savings 2017 Lenovo. All Rights Reserved Energy Aware Runtime 12 • Re-Use of Waste Heat , ● New buildings in Germany are very good thermally isolated: Standard heat requirement of only 50 W/m2 SuperMUCs waste heat would be sufficient to heat 40.000 m2 of office space (~10 x) ● What to do with the waste heat during summer? 2017 Lenovo. All Rights Reserved Energy Aware Runtime 13 CooLMUC-2: Waste Heat Re-Use for Chilled Water Production ERE=0.3 ● Lenovo NeXtScale Water Cool (WCT) system ● SorTech Adsorbtion Chillers technology based of zeolite coated metal fiber heat Water inlet temperatures 50 °C exchangers All season chiller-less cooling a factor 3 higher than current chillers 384 compute nodes based on silica gel 466 TFlop/s peak performance COP = 60% Total electricity reduced by 50+% Energy Reuse Effectiveness ( ERE) measures how efficient a data center reuses the power dissipated by the computer Total Facility Power – Treuse Energy Aware Runtime ERE = 14 2017 Lenovo. All Rights Reserved IT Equipment Power Total Facility Power – Treuse 120 87 ERE = = – = 0.32 CooLMUC-2: ERE = 0.3 IT Equipment Power 104 CooLMUC-2 power consumption CooLMUC-2 heat output into warm water cooling loop Cold water generated by absorption chillers (COP ~ 0,5 – 0,6) Leibniz Supercomputing Centre Energy Aware Runtime 15 Savins from Direct Water Cooling with Lenovo • Server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%) • Cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%) Total savings = ~35-40% • Additional savings with energy aware SW • Heat Reuse – With DWC at 50°C, additional 30% savings as free chilled water is generated With heat reuse total savings => 50+% 2017 Lenovo. All Rights Reserved Energy Aware Runtime 16 Lenovo references with DWC (2012-2016) Sites Nodes Country Instal date Max. In. Water LRZ SuperMUC 9216 Germany 2012 45°C LRZ SuperMUC 2 4096 Germany 2012 45°C LRZ SuperCool2 400 Germany 2015 50°C NTU 40 Singapore 2012 45°C Enercon 72 Germany 2013 45°C US Army 756 Hawai 2013 45°C Exxon Research 504 NA 2014 45°C NASA Goddard 80 NA 2014 45°C PIK 312 Germany 2015 45°C KIT 1152 Germany 2015 45°C Birmingham U ph1 28 UK 2015 45°C Birmingham U ph2 132 UK 2016 45°C MMD 296 Malaysia 2016 45°C UNINET 964 Norway 2016 45°C Peking U 204 China 2017 45°C More than 18.000 nodes up and running with DWC Lenovo technology Energy Aware Runtime 17 How to manage/control power and energy • Report – temperature and power consumption per node / per chassis – power consumption and energy per job • Optimize – Reduce power of inactive nodes – Reduce power of active nodes 2017 Lenovo. All Rights Reserved Energy Aware Runtime 18 Power Management on NeXtScale • IMM = Integrated Management Module . FPC = Fan/Power Controller (Node-Level Systems Management) (Chassis-Level Systems Mgmt – Monitors DC power consumed by node as a whole and – Monitors AC and DC power consumed by CPU and memory subsystems by individual power supplies and – Monitors inlet air temperature for node aggregates to chassis level – Caps DC power consumed by node as a whole – Monitors DC power consumed by individual fans and aggregates to Monitors CPU and memory – chassis level subsystem throttling caused by node-level throttling – Enables or disables power savings for node PCH = Platform Controller Hub (i.e., south bridge) ME = Management Engine (embedded in PCH, runs Intel NM firmware) HSC = Hot Swap Controller (provides power readings) 2017 Lenovo All rights reserved. Energy Aware Runtime 19 DC power sampling and reporting frequency High Level Software 1Hz 200Hz RAPL - IMM/BMC CPU/memory (energy MSRs) 1Hz 500Hz NM/ME Meter 10Hz HSC 1KHz Sensor 2017 Lenovo All rights reserved.