ENERGY AWARE COMPUTING

Luigi Brochard, Distinguished Engineer, WW HPC & AI

HPC Knowledge June 15 2017, Agenda

. Different metrics for energy efficiency . Lenovo cooling solutions . Lenovo software for energy aware computing

2017 Lenovo. All rights reserved. Energy Aware Runtime 2 How to measure Power Efficiency

Total Facility Power PUE = • PUE IT Equipment Power – Power usage effectiveness (PUE) is a measure of how efficiently a computer data center uses its power; – PUE is the ratio of total power used by a computer facility] to the power delivered to computing equipment. – Ideal value is 1.0 − It does not take into account how IT power can be optimised

• ITUE (IT power + VR + PSU + Fan) ITUE = IT Power

− IT power effectiveness ( ITUE) measures how the node power can be optimised − Ideal value if 1.0

ERE Total Facility Power – Treuse • ERE = IT Equipment Power − Energy Reuse Effectiveness measures how efficient a data center reuses the power dissipated by the computer − ERE is the ratio of total amount of power used by a computer facility] to the power delivered to computing equipment. − An ideal ERE is 0.0. ERE = PUE – Tresuse/IT EqPower; If no reuse, ERE = PUE

2017 Lenovo. All rights reserved. Energy Aware Runtime 3 Choice of Cooling Air Cooled with Air Cooled Rear Door Heat Exchangers Direct Water Cooled

. Standard air flow with internal fans . Air cool, supplemented with . Direct water cooling with no internal fans . Fits in any datacenter RDHX door on rack . Higher performance per watt . Maximum flexibility . Uses chilled water with . Free cooling (45°C water) . Broadest choice of configurable economizer (18°C water) . Energy re-use options supported . Enables extremely tight rack . Densest footprint . Supports Native Expansion nodes placement . Ideal for geos with high electricity costs (Storage NeX, PCI NeX) and new data centers PUE ~1.4 – 1.2 PUE ~2 – 1.5 . Supports highest wattage processors ERE ~ 1.4 – 1.2 PUE ~ 1.1 ERE ~2 – 1.5 ERE < < 1 with hot water Choose for broadest choice of Choose for balance between configuration Choose for highest performance customizable options flexibility and energy efficiency and energy efficiency Energy Aware Runtime 4 2017 Lenovo. All Rights Reserved TCO: payback period for DWC vs RDHx New Existing Existing Existing

$0.06/kWh $0.12/kWh $0.20/kWh

DWC RDHx

. New data centers: Water cooling has immediate payback. . Existing air-cooled data center payback period strongly depends on electricity rate

2017 Lenovo. All Rights Reserved Energy Aware Runtime 5 iDataplex dx360M4 (2010-2013)

. iDataplex rack with 84 dx360M4 servers . dx360 M4 nodes, 2xCPUs (130W, 115W), 16xDIMMS (4GB/8GB), 1HDD/2SSD, network card. . 85% Heat Recovery, Water 18°C-45°C, 0.5 lpm / node.

dx360M4 server

dx360M4 Server

iDataplex Rack

2017 Lenovo. All Rights Reserved Energy Aware Runtime 6 NextScale nx360M5 WCT (2013-2016) • NextScale Chassis 6U/12Nodes , 2 nodes / tray. • nx360M5 WCT 2xCPUs (up to 165W), 16xDIMMS (8GB/16GB/32GB), 1HDD/2SSD, 1 ML2 or PCIe Network Card. • 85% Heat Recovery, Water 18°C-45°C (and even upto 50°C), 0.5 lpm / node.

copper waterloops

2 Nodes of nx-360M5 WCT in a Tray NextScale Chassis Scalable Manifold

Rack Configuration

nx360M5 with 2 SSDs 2017 Lenovo. All Rights Reserved Energy Aware Runtime 7 SuperMUC systems at LRZ: Phase 1 and Phase 2 Phase 1 Ranked 28 and 29 in Top500 June 2016 • Fastest Computer in Europe on Top 500, June 2012 – 9324 Nodes with 2 Intel Sandy Bridge EP CPUs – HPL = 2.9 PetaFLOP/s – Infiniband FDR10 Interconnect – Large File Space for multiple purpose • 10 PetaByte File Space based on IBM GPFS with 200GigaByte/s I/O bw Phase 2 • Innovative Technology for Energy Effective . Acceptance completed Computing – 3096 nx360m5 compute nodes – Hot Water Cooling Haswell EP CPUs – Energy Aware Scheduling – HPL = 2.8 PetaFLOP/s • Most Energy Efficient high End HPC System – Direct Hot Water Cooled, – PUE 1.1 Energy Aware Scheduling – Total Power consumption over 5 years to be reduced by – Infiniband FDR14 ~ 37% from 27.6 M€ to 17.4 M€ – GPFS, 10 x GSS26, 7.5 PB 2017 Lenovo. All Rights Reserved Energy Aware Runtime capacity , 100 GB/s IO bw 8 Lenovo Water Cooling added value

• Classic Water Cooling • Lenovo Water Cooling – Direct Water cooling CPU only – Direct Water cooling CPU/DIMMS/VRs - Only 60% of heat goes to water - 80 to 85% of heat goes to water - => 40% still need to be air cooled - => just 10% still need to be air cooled – Inlet water temperature – Inlet water temperature - Upto35°C - Upto 45-50°C - => No free cooling all year long/all geo - => Free cooling all year long in all geo – Heat from water is wasted – Water is hot enough to be efficiently reused - like with Absorption chiller => ERE <<1 – Unproven technology – 3rd generation Water Cooling - More than 10000 nodes installed – Power of server is not managed – Power and energy are managed & optimized

2017 Lenovo. All Rights Reserved Energy Aware Runtime 9 DWC reduces Processor Temperature on Xeon 2697 v4

Conclusion: Direct Water Cooling lowers processor power consumption by about 5% and allows Higher processor frequency.

NXT with 2 socket 2697v4, 128 GB 2400 MHz DIMM Inlet Water temperature is 28°C,

2017 Lenovo. All Rights Reserved Energy Aware Runtime 10 Air and DWC performance DC power on Xeon 2697v4

Conclusion: With Turbo OFF, Direct Water Cooling reduces power by 5% With Turbo ON, it increases performance by 3% and still reduces power by 1%

DC energy is measured through aem DC energy accumulator

2017 Lenovo. All Rights Reserved Energy Aware Runtime 11 Savings from Lenovo Direct Water Cooling

• Higher TDP processors

• Reduced server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%)

• Reduce cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%)

• Additional savings with Energy Aware SW Total savings = ~35-40%

• Free cooling all yea long => Less chillers => CAPEX savings

2017 Lenovo. All Rights Reserved Energy Aware Runtime 12

• Re-Use of Waste Heat

, ● New buildings in Germany are very good thermally isolated: Standard heat requirement of only 50 W/m2 SuperMUCs waste heat would be sufficient to heat 40.000 m2 of office space (~10 x)

● What to do with the waste heat during summer?

2017 Lenovo. All Rights Reserved Energy Aware Runtime 13 CooLMUC-2: Waste Heat Re-Use for Chilled Water Production

ERE=0.3

● Lenovo NeXtScale Water Cool (WCT) system ● SorTech Adsorbtion Chillers technology  based of zeolite coated metal fiber heat  Water inlet temperatures 50 °C exchangers  All season chiller-less cooling  a factor 3 higher than current chillers  384 compute nodes based on silica gel  466 TFlop/s peak performance  COP = 60%  Total electricity reduced by 50+% Energy Reuse Effectiveness ( ERE) measures how efficient a data center reuses the power dissipated by the computer Total Facility Power – Treuse Energy Aware Runtime ERE = 14 2017 Lenovo. All Rights Reserved IT Equipment Power Total Facility Power – Treuse 120 87 ERE = = – = 0.32 CooLMUC-2: ERE = 0.3 IT Equipment Power 104

CooLMUC-2 power consumption

CooLMUC-2 heat output into warm water cooling loop

Cold water generated by absorption chillers (COP ~ 0,5 – 0,6)

Leibniz Supercomputing Centre Energy Aware Runtime 15 Savins from Direct Water Cooling with Lenovo

• Server power comsumption – Lower processor power consumption (~ 5%) – No fan per node (~ 4%)

• Cooling power consumption – With DWC at 45°C, we assume free cooling all year long ( ~ 25%)

Total savings = ~35-40% • Additional savings with energy aware SW

• Heat Reuse – With DWC at 50°C, additional 30% savings as free chilled water is generated With heat reuse total savings => 50+%

2017 Lenovo. All Rights Reserved Energy Aware Runtime 16

Lenovo references with DWC (2012-2016)

Sites Nodes Country Instal date Max. In. Water LRZ SuperMUC 9216 Germany 2012 45°C LRZ SuperMUC 2 4096 Germany 2012 45°C LRZ SuperCool2 400 Germany 2015 50°C NTU 40 Singapore 2012 45°C Enercon 72 Germany 2013 45°C US Army 756 Hawai 2013 45°C Exxon Research 504 NA 2014 45°C NASA Goddard 80 NA 2014 45°C PIK 312 Germany 2015 45°C KIT 1152 Germany 2015 45°C Birmingham U ph1 28 UK 2015 45°C Birmingham U ph2 132 UK 2016 45°C MMD 296 Malaysia 2016 45°C UNINET 964 Norway 2016 45°C Peking U 204 China 2017 45°C

More than 18.000 nodes up and running with DWC Lenovo technology

Energy Aware Runtime 17 How to manage/control power and energy

• Report – temperature and power consumption per node / per chassis – power consumption and energy per job • Optimize – Reduce power of inactive nodes – Reduce power of active nodes

2017 Lenovo. All Rights Reserved Energy Aware Runtime 18 Power Management on NeXtScale

• IMM = Integrated Management Module . FPC = Fan/Power Controller (Node-Level Systems Management) (Chassis-Level Systems Mgmt – Monitors DC power consumed by node as a whole and – Monitors AC and DC power consumed by CPU and memory subsystems by individual power supplies and – Monitors inlet air temperature for node aggregates to chassis level – Caps DC power consumed by node as a whole – Monitors DC power consumed by individual fans and aggregates to Monitors CPU and memory – chassis level subsystem throttling caused by node-level throttling – Enables or disables power savings for node

PCH = Platform Controller Hub (i.e., south bridge) ME = Management Engine (embedded in PCH, runs Intel NM firmware) HSC = Hot Swap Controller (provides power readings)

2017 Lenovo All rights reserved. Energy Aware Runtime 19 DC power sampling and reporting frequency

High Level Software

1Hz 200Hz RAPL - IMM/BMC CPU/memory (energy MSRs) 1Hz 500Hz NM/ME Meter 10Hz

HSC

1KHz

Sensor

2017 Lenovo All rights reserved. Energy Aware Runtime 20 Power reading improvements on next gen. NeXtScale

Targets • Better than or equal to +/-3% power reading accuracy down to the node’s minimum active power (~40-50W DC). • Power granularity <=100mW • At least 100Hz update rate for node power readings. 100 readings are returned en masse. In addition, there is a beginning timestamp. Time between each successive readings is maintained at 10mS.

ipmi raw oem cmd

INA226 SN1405006 FPGA (used for ME (Node (used for (FIFO) IMM metering) Manager) capping)

Rsense Bulk 12V Node 12V

2017 Lenovo.. All rights reserved. Energy Aware Runtime 21 Demo at Lenovo HPC Software Stack ISC‘17 Lenovo OpenSource HPC Stack Customer Applications An OpenSource IaaS suite to run and manage optimally and transparently HPC, Big Data and User / Operator Antilles - Simplified User Job Workflows on a virtualized infrastructure Web Portal View / Customized adjusting dynamically to user and datacenter xCAT needs through energy policies Extreme Cloud Confluent Admin. Toolkit • Build on top of OpenHPC with xCAT Energy Awareness Lenovo EAR • Enhanced with Lenovo configuration, plugins Containers Singularity

and scripts Services Professional Enterprise

Add in IBM Spectrum Scale, Singularity* Parallel File IBM Spectrum Scale & • NFS Systems Lustre • Add in Lenovo Antilles and Energy Aware run time** OS OFED Installation and custom services, may not includenot mayservices, custom and Installation software party for third support service

Lenovo System x • Integrated and supported Virtual, Physical, Desktop, Server A ready to use HPC Stack Compute Storage Network OmniPath Infiniband

• collaborating w/ Greg Kurtzer from Lawrence Berkeley National Lab; Energy Aware Runtime 22 • 2017** : in Lenovo. collaboration All rights with reserved. BSC

Confluent: HeatMap

2017 Lenovo. All rights reserved. Energy Aware Runtime 23 How to manage/control power and energy

• Report – temperature and power consumption per node / per chassis xCAT/Confluent – power consumption and energy per job • Optimize – Reduce power of inactive nodes – Reduce frequency of active nodes

IBM Platform LSF SLURM PBSpro

2017 Lenovo. All rights reserved. Energy Aware Runtime 24 Demo at Energy Aware Run time: Motivation ISC‘17

• Power and Energy has become a critical constraint for HPC systems Configure Performance and Power consumption application for • Architecture X of parallel applications depends on: New architecture – Architectural parameters – Runtime node configuration – Application characteristics Execute with N – Input data Select optimal frequencies: frequency calculate time • Manual “best” frequency and energy – Difficult to select manually and it is a time consuming process (resources and then power) and not reusable – It may change along time New input set – It may change between nodes

22017 Lenovo. All rights reserved. Energy Aware Runtime 25 Energy Aware Run time (EAR) Goals

• Offer a dynamic and transparent solution to energy awareness : – Avoiding having to re-execute applications again and again – Easy to use - Without source code modifications - Without historic application information - Supporting standard programming models: MPI, MPI+OpenMP - Using standard libraries and tools as much as possible to be easily portable - Open Source – Frequency change based on simple Energy Policies with performance thresholds – Minimizing the overhead introduced - Remember we are HPC and we want to save energy!

2017 Lenovo. All rights reserved. Energy Aware Runtime 26 EAR proposal

• Automatic and dynamic frequency selection based on: – Architecture characterization – Application characterization - Outer loop detection (DPD) - Application signature computation (CPI,GBS,POWER,TIME) – Performance and power projection – Users/System policy definition for frequency selection (configured with thresholds) - MINIMIZE_ENERGY_TO_SOLUTION - Goal: To save energy by reducing frequency (with potential performance degradation) - We limit the performance degradation with a MAX_PERFORMANCE_DEGRADATION threshold

- MINIMIZE_TIME_TO_SOLUTION - Goal: To reduce time by increasing frequency (with potential energy increase) - We use a MIN_PERFORMANCE_EFFICIENCY_GAIN threshold to avoid that application that do not scale with frequency to consume more energy for nothing

2017 Lenovo. All rights reserved. Energy Aware Runtime 27 BQCD use case

• Berlin quantum chromodynamics program (BQCD) – https://www.rrz.uni-hamburg.de/services/hpc/bqcd – Two input sets generates two different use cases - Input 1 generates a cpu intensive use case, much more sensible to node frequency changes - Input 2 generates a memory intensive use case, less sensible to node frequency changes

2017 Lenovo. All rights reserved. Energy Aware Runtime 28 BQCD use case

BQCDatLenox MPI+OpenMP code • 600 • Configured with 16 mpis (4 nodes, 4 ppn) 500 – Number of threads adapted to number of cores 400 • Architecture tested 300 200 – Lenox cluster Exec.Time(sec.) 100 – Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.3GHz, 18c 0 • Making this comparison took 96H of 2.3GHz 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz computation x 4 nodes consumed !! CPU MEM

BQCDatLenox

300,00

250,00

200,00

150,00

100,00 Avg.Power(W) 50,00

0,00 2.3GHz 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz

CPU MEM

2017 Lenovo. All rights reserved. Energy Aware Runtime 29 BQCD static analysis

BQCD_CPU:Fivs.2.3GHz

• CPU intensive @ 1.8Ghz vs. 2.3Ghz 40% Power saving is 34% 35% – 30% – Performance degradation is 31% 25% 20% – Energy variation is 14% 15% Minimal energy is at 2.0GHz 10% – 5% 0% • Memory intensive @ 1.8Ghz vs. 2.3Ghz 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz – Power saving is 14% Perf.Degrada on PowerSaving EnergySaving – Performance degradation is 4% – Energy variation is 10% BQCD_MEM:Fivs.2.3GHz 16% – Minimal energy is at 1,8Ghz 14% 12% • If we set a maximum performance penalty 10% (i.e 20%) 8% 6% – CPU version will run at 2,0GHz 4% Mem. Version will run at 1,8GHz 2% – 0% 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz Perf.Degrada on PowerSaving EnergySaving 2017 Lenovo. All rights reserved. Energy Aware Runtime 30 Energy Aware Runtime

2017 Lenovo. All rights reserved. Energy Aware Runtime 31 EAR overall overview Learning Phase: at EAR installation (*)

Coefficients Kernels Coefficients Store computatio execution DB n

Read

Coefficients

DPD detects Compute power and performance metrics Optimal frequency outer loop for outer loop calculation Energy policy Energy

EAR execution (loaded with application)

(*) or every time cluster configuration is modified (more memory per node, new processors ...)

2017 Lenovo. All rights reserved. Energy Aware Runtime 32 Learning Phase

2017 Lenovo. All rights reserved. Energy Aware Runtime 33 Performance and Power models

• We use linear models to predict Power and Time at any available frequency fn from a reference frequency Rf based on: – HW coefficients (A(Rf,fn), B(Rf,fn), C(Rf,fn), D(Rf,fn), E(fn) and F(Rf,fn)) which represent the parameters of each node in the system, measured in the Learning Phase - At EAR installation or every time cluster configuration is modified – Application signature (Power, CPI and GBS) computed while the application is running - Power, CPI and GBS are the basic metrics that characterize an application: we define it as the Application signature

Execution of selected kernels with available frequencies on each node of the system

Metrics collection: Execution time, Avg.Power, CPI and GBS

Computation of node coefficients to be used in performance and power models 2017 Lenovo. All rights reserved. Energy Aware Runtime 34 Dynamic Application Characterization by Dynamic Periodicity Detection

2017 Lenovo. All rights reserved. Energy Aware Runtime 35 Dynamic characterization is driven by DPD

New Dynamic Pattern Detection

Signature -IF Signature changes -IF New Pattern detection computation: (CPI,GBS,POWER,TIME)

Optimal frequency Performance/Power selection based on projection models policy and arguments applied

2017 Lenovo. All rights reserved. Energy Aware Runtime 36 Dynamic Pattern Detection

• DPD algorithm detects repetitive sequences of events • EAR generates a event identifier per dynamic MPI call – Using mpi arguments such as type of call, sendbuff, recvbuff, etc • HPC applications usually present this repetitive behaviour since they exploit loops • DPD receives a new event as input and reports the dpd_status

New event DPD NEW_LOOP/NEW_ITERATION

2017 Lenovo Internal. All rights Energy Aware Runtime 37 reserved. DPD example Events sequence Time

0,4,8,1,2,3,1,2,3,1,2,3,1,2,3,4,8, 1,2,3,1,2,3,1,2,3,1,2,3,4,8……

NEW_LOOP Loop_id=1 NEW_LOOP Loop_size=3 Loop_id=1 Loop_size=3 NEW_ITERATION Loop_id=1 NEW_ITERATION Loop_size=3 Loop_id=1 Loop_size=3

END_LOOP

NEW_LOOP Loop_id=4 Loop_size=14

2017 Lenovo Internal. All rights reserved. Energy Aware Runtime 38 EAR at works on BQCD_CPU

BQCD_CPU:Outerloopsizedetected(mpirank0) BQCD_CPU:Outerloopsizedetected(mpirank8)

30000 30000 25000 25000 20000 20000 15000 15000 10000 10000 5000 5000 Big loop detected 0

0 0 0 220771 295401 383475 130783 461384 225947 299779 390597 131843 515154 1079442 1830691 2338181 2840801 3298260 3792725 4365887 4864175 5392171 5828340 6403551 6693295 7204960 7666811 8017997 8521646 8974694 9337736 1285534 2011522 2553471 3023401 3533108 3883160 4591497 5063327 5621436 6088931 6599507 6954154 7478425 7922780 8294067 8793099 9248717 26419798 56021729 85604660 18777011 49344177 79886200 115152228 144683945 174209742 203733324 233245157 262738421 292216336 321677811 351131866 380564787 410009928 439469002 468944263 498411778 527870308 557327668 575261394 110390411 140874404 171355272 201829244 232294824 262738421 293165791 323574238 353978730 384358646 414755367 445175899 475596024 506014315 536410396 566819213 Acuumulated me Acuumulated me

BQCD_CPU:Frequency(mpirank0) BQCD_CPU:Frequency(mpirank8)

2650000 2620000 2600000 2600000 M 2580000 M 2550000 2560000 2500000 2540000 2450000 2520000 Policy is applied 2500000 Frequency P Frequency 2400000 2480000 2460000 P 2350000 F: 2.6Ghz2.4Ghz 2440000

2300000 0 0 132238 248215 386318 502916 846670 134249 232508 321534 408810 692318 1414410 2166140 2657547 3182790 3679084 4148797 4872170 5253045 5903995 6272346 6776910 7296748 7802636 8272221 8640726 9145024 9656532 1506881 2762515 3130317 4367471 6438408 8504949 2107093 3787863 4882332 5566174 5918062 6951142 7481230 7985974 8813306 9337736 36068860 66603483 97121470 25467570 56976636 88462174 I 128576097 159059861 190494052 221913387 253303684 283750918 315114433 346461071 377801000 409146018 440503053 471881334 502293490 533647990 565005225 I 120870973 152311968 184688272 217057025 249418854 280817893 313128553 345428274 377717332 410009928 442326512 474646835 506014315 538310924 570616511 Accumulated me Accumulated me

BQCD_CPU:MeasuredPOWER(mpirank0) BQCD_CPU:MeasuredPOWER(mpirank8) 275 270 270 265 265 260 260 R 255 R 250 255 245 Avg.Power(W) 240 Avg.Power(W) 250 235 245

A 230 0 0 Power is reduced A 131195 244291 380251 491701 689551 131843 229109 304274 398976 671428 1366600 1933029 2593333 3105484 3613768 3963772 4669953 5145362 5703804 6013462 6681447 7034739 7555744 8004770 8374090 8868621 9330380 1330758 2082498 3595445 4788521 6693295 7213097 8188571 9061369 9572891 2573903 3099153 4065140 5169396 5820376 6187342 7718993 8557082 15973526 43705120 73298174 35985216 66519836 97037820 103789543 133340894 163821371 194299015 224765387 254257203 284701562 315114433 345511942 375902594 406291401 436701498 467128200 497543500 527003712 557411343 575230251 128492446 158976207 190410395 221829728 253220023 283667255 315030767 346377404 377717332 409062348 440419381 471797662 502209816 533564316 564921550 N Accumulated me Accumulated me N

BQCD_CPU:MeasuredItera on me(mpirank0) BQCD_CPU:MeasuredItera on me(mpirank8) K 1200000 1200000 K 1000000 1000000 800000 800000 me(usecs) me(usecs) 600000 600000 on 400000 on 400000 200000 200000 Itera Iteration time is Itera 0

0 0 0 similar 132238 248215 386318 0 502916 846670 8 134249 232508 321534 408810 692318 1414410 2166140 2657547 3182790 3679084 4148797 4872170 5253045 5903995 6272346 6776910 7296748 7802636 8272221 8640726 9145024 9656532 1506881 2762515 3130317 4367471 6438408 8504949 2107093 3787863 4882332 5566174 5918062 6951142 7481230 7985974 8813306 9337736 36068860 66603483 97121470 25467570 56976636 88462174 128576097 159059861 190494052 221913387 253303684 283750918 315114433 346461071 377801000 409146018 440503053 471881334 502293490 533647990 565005225 120870973 152311968 184688272 217057025 249418854 280817893 313128553 345428274 377717332 410009928 442326512 474646835 506014315 538310924 570616511 2017 Lenovo Internal. All rights reserved.Accumulated me Energy Aware Runtime Accumulated me 41 EAR policies

• MINIMIZE_ENERGY_TO_SOLUTION – Goal : Minimize the energy consumed with a limit to the performance penalty – User/Sysadmin defines a MAX_PERFORMANCE_PENALTY – Based on Signature and node coefficients, EAR computes a performance projection for each available frequency. – EAR selects the optimal frequency that minimizes Energy with (performance_penalty <= MAX_PERFORMANCE_PENALTY)

• MINIMIZE_TIME_TO_SOLUTION – Goal: Improve the execution time while guaranteeing a minimum performance improvement that justifies more energy consumption – User/Sysadmin defines a MIN_PERFORMANCE_GAIN – Based on Signature and node coefficients, EAR computes a performance projection for each available frequency – EAR selects the optimal frequency that improves application performance with performance_improvement >= MIN_PERFORMANCE_EFFICIENCY_GAIN

2017 Lenovo. All rights reserved. Energy Aware Runtime 42 EAR evaluation

2017 Lenovo Internal. All rights Energy Aware Runtime 43 reserved. Performance and power projection models: BQCD real vs. projected values

TIME POWER

BQCD_CPU:Realvs.Projecte me BQCD_CPU:Realvs.Projectedpower Error(ProjectedvsRealmetrics) 10% 600 300 8% 500 250 6%

400 200 Error 4%

300 150 2%

0% Time(sec.) 200 Power(W) 100 2,2GHz 2,1GHz 2,0GHz 1,9GHz 1,8GHz Frequency 100 50

0 0 ErrorTime(bqcd_cpu) ErrorPower(bqcd_cpu) 2,3GHz 2,2GHz 2,1GHz 2,0GHz 1,9GHz 1,8GHz 2,3GHz 2,2GHz 2,1GHz 2,0GHz 1,9GHz 1,8GHz ErrorTime(bqcd_cpu) ErrorPower(bqcd_mem)

Exec.Time(sec.) Proj.Time(sec.) Avg.Power Proj.Avg.Power

BQCD_MEM:Realvs.Projected me BQCD_MEM:RealvsProjectedpower Average error is 5% 500 300

400 250 200 300 150 200 Time(sec.) Power(W) 100

100 50

0 0 2,3GHz 2,2GHz 2,1GHz 2,0GHz 1,9GHz 1,8GHz 2,3GHz 2,2GHz 2,1GHz 2,0GHz 1,9GHz 1,8GHz

Exec.Time Proj.Exec.Time Avg.power Proj.Avg.Power

2017 Lenovo Internal. All rights reserved. Energy Aware Runtime 45 BQCD “best” static frequency selection

BQCD_CPU:Fivs.2.3GHz

• CPU intensive @ 1.8Ghz vs. 2.3Ghz 40% Power saving is 34% 35% – 30% – Performance degradation is 31% 25% 20% – Energy variation is 14% 15% Minimal energy is at 2.0GHz 10% – 5% 0% • Memory intensive @ 1.8Ghz vs. 2.3Ghz 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz – Power saving is 14% Perf.Degrada on PowerSaving EnergySaving – Performance degradation is 4% – Energy variation is 10% BQCD_MEM:Fivs.2.3GHz 16% – Minimal energy is at 1.8Ghz 14% 12% 10% 8% 6% 4% 2% 0% 2.2GHz 2.1GHz 2.0GHz 1.9GHz 1.8GHz

Perf.Degrada on PowerSaving EnergySaving 2017 Lenovo Internal. All rights reserved. Energy Aware Runtime 46 BQCD: EAR MINIMIZE_ENERGY_TO_SOLUTION

BQCD_CPU:EARMIN_ENERGY_TO_SOLUTION • Default frequency 2.3GHz 25% • MAX_PERFORMANCE_DEGRADATION 20% 15%

– Set to threshold TH=[5%,10%,15%] 10% 5% 0% • CPU use case 5% 10% 15% MAX_PERFORMANCE_DEGRADATION – TH= 5% [2.2Ghz – 2.3Ghz] selected – TH= 10%2.2Ghz selected PerfDegrada on PowerSaving EnergySaving – TH= 15%[2.0GHz – 2.2GHz] selected BQCD_MEM:EARMIN_ENERGY_TO_SOLUTION • MEM use case 10% 8%

– TH= 5% 1.8GHz selected 6% – TH= 10%1.8GHz selected 4% – TH= 15%1.8GHz selected 2% 0% 5% 10% 15% MAX_PERFORMANCE_DEGRADATION

PerfDegrada on PowerSaving EnergySaving

2017 Lenovo Internal. All rights reserved. Energy Aware Runtime 47 BQCD: EAR MINIMIZE_TIME_TO_SOLUTION

Default frequency 2.0Ghz BQCD_CPU:MIN_TIME_TO_SOLUTION • 10% • MIN_PERFORMANCE_EFFICIENCY_GAIN 8% – Set to 70% - 80% 6% 4%

• BQCD_CPU 2% – One node is executed at 2.3GHz 0% 70% 80% – The rest remains at 2.0GHz MIN_PERFORMANCE_EFFICIENCY_GAIN PerformanceGain Powerincrement Energyincrement • BQCD_MEM BQCD_MEM:EARMIN_TIME_TO_SOLUTION All the nodes execute at 2.0GHz 5% – 4%

3%

2%

1%

0% 70% 80% MIN_PERFORMANCE_EFFICIENCY_GAIN

PerformanceGain PowerIncrement EnergyIncrement

2017 Lenovo Internal. All rights reserved. Energy Aware Runtime 48 EAR distributed design

• EAR offers a distributed design where each node works locally  no centralized information • One mpi process per node takes care of: – Dynamic Pattern detection – EAR metrics collection – EAR node state: collecting metrics, evaluating signature, etc – EAR policy implementation – Per node logging messages: One summary file per node is generated

2017 Lenovo. All rights reserved. Energy Aware Runtime 49 EAR overhead evaluation

• Main EAR overhead comes from DPD Mpi rank DPD calls usecs/DPD algorithm 0 Current EAR version invokes DPD at 16.486.517 2,12usecs • 4 each MPI call 16.486.287 2,02usecs 8 16.484.781 2,07usecs • DPD consumes 2 usecs per invocation 12 16.484.559 2,04usecs • BQCD cpu intensive use case executes 16millions of DPD invocations in 500 seconds • Profile information reported by mpi ranks 0,4,8 and 12 • We are working in optimizations to reduce the number of DPD calls

Energy Aware Runtime 50 50 EAR Goals….where we are?

• Our main goal were … to offer a dynamic and transparent solution to energy awareness – Avoiding having to re-execute applications again and again ✔ – Easy to use - Without source code modifications ✔ - Without historic application information ✔ - Supporting standard programming models: MPI, MPI+OpenMP ✔ - Using standard libraries and tools as much as possible to be easily portable ✔ - Papi for hardware counters (core and uncore) when available and machine configuration - Libcpupower - ibmaem kernel module for node energy (linux distro) - Open Source (on progress) – Frequency change based on simple Energy Policies with performance thresholds ✔ – Minimizing the overhead introduced  Overhead has been significantly reduced, but there is still margin for improvement

2017 Lenovo. All rights reserved. Energy Aware Runtime 51 Conclusion

. Lenovo has been a leader for years in Energy Aware Computing . With end to end solutions to : . reduce CAPEX, OPEX and TCO . Maximize heat removal by water . Maximize temperature for . free cooling all year round . efficient use of adsorption chillers . manage systems and infrastructure . with open source and partners solutions . And going beyond: . Dynamically adjust node frequency in operation . Minimize TCO by working on all metrics . PUE + ITUE + ERE

2017 Lenovo. All rights reserved. Energy Aware Runtime 52 Lenovo ISC17 Booth EAR demo AI demo

NXS WTC demo Rack with various HW

Table 1 Open HPC Table 2 Software Stack

Table 3

HW Excelero Stark HW Interactive 2017 Lenovo Internal. All rights reserved. demoEnergy Aware Runtime Show & Tell 53