<<

Summer 2019 - Assessing the Facilities and Systems Supporting HPC at ORNL

Jim Rogers Director, Computing and Facilities National Center for Computational Sciences Oak Ridge National Laboratory

ORNL is managed by UT-Battelle, LLC for the US Department of Energy ORNL Leadership Class Systems 2004 - 2018

27 PF 2012 2.5 XK7 1 PF PF 263 2009 TF 2008 Cray XT5 62 Cray XT5 54 2008 25 TF TF Jaguar 18.5 TF Cray XT4 TF 2006 2007 Jaguar Cray XT3 Cray XT4 2004 2005 Jaguar From 2004 – 2018, HPC systems relied Cray X1E Cray XT3 Jaguar Phoenix Jaguar on chiller-based cooling (5.5° supply) with annualized PUEs to ~1.4 2 ORNL’s Transition to Warmer Facility Supply Temperatures Frontier: Custom mechanical : A combination of packaging is >95% room- Titan: Refrigerant-based direct on-package cooling neutral with a 30°C supply. per-rack cooling with and RDHX with 21°C supply • ~100% Evaporative Cooling, direct rejection of heat to is > 95% room-neutral. with supplemental HVAC for cold 5.5°C water • Above dewpoint parasitic loads • Below dewpoint • Contribution by chillers • 100% use of chillers ~20% of the hours of the year

~1.5 EF 200 2021/2022 PF Frontier 27 PF 2018/2019 2012 IBM Cray XK7 Summit Titan

3 GTC'154 Session S5566 GPU Errors on HPC Systems Oak Ridge National Laboratory’s Cray XK7 Titan

Operational from November 2012 through August 1 2019 18,688 compute nodes – 1 AMD + 1 Kepler/node 27PF (peak); 17.59PF (HPL); 9.5MW (peak) Delivered > 27B compute hours over its life

Titan entered as the #1 supercomputer in the world in November 2012, and was still #12 on the Top500 list at the time of its decommissioning 4 Perspective on Titan as an Air-cooled Supercomputer

ORNL’s Cray XK7 “Titan” Wait. What? • 200 cabinets, each with a 3,000 CFM fan • 600,000 CFM (air volume) Titan moves as much air as a • Dry air has a density of 13.076 cubic feet per pound long-range Airbus A340 at • Titan moves 600,000/13.076/60 -> 765 lbs/sec cruising altitude.

Airbus A340 • At takeoff, each of (4) engines generates 140kN of force and consumes 1000 lbs/sec of air • At cruise, the A340’s engines each produce ~29kN and consumes ~200 lbs/sec of air.

5 Motivation for “Warmer” Cooling Solutions Serving HPC Centers

• Reduced Cost, Both CAPEX and OPEX – Reduce or eliminate the need for traditional chillers. • No chillers, no ozone-depleting refrigerants (GREEN) – Oak Ridge calculates an annualized PUE for air--cooled devices of no better than 1.4 Oak Ridge, TN (ASHRAE Zone 4A – Mixed Humid) – HPC power budgets continue to grow – Summit has a design point for >12MW (HPC-only). Minimizing PUE/ITUE is critical to the budget.

• Easier, more reliable design – Design is reduced to pumps, evaporative cooling, heat exchangers. – Traditional chilled water may not be necessary at all (NREL, NCAR, et al)

6 OLCF Facilities Supporting Summit

• Titan – 9MW @ heavy load • Sitting on 250 pounds/ft2 raised floor • Uses 42F water and special CDUs (XDPs)

• Summit – 256 compute cabinets on-slab • 100% room-neutral design uses RDHX • 20MW warm-water cooling plant using centralized CDU/secondary loop

7 21°C/20MW/7700 ton Facility System Design System

• Summit – Demand: 3-10MW; • Secondary Loop – Supply 3300GPM (12,500 liters/min) @ 21°C; Return @ 29-33°C – CPUs and GPUs use cold plates – DIMMs and parasitic loads use RDHX – Storage and Network use RDHX

8 Cooling Design Existing chilled water cooling loop

Primary Loop uses Evaporative Cooling Towers (~80% of the hours of the year)

When the MTW RETURN is above the 21C setpoint, use a second set of Trim HX (with 5.5C on the other side) to drive MTW to the 21C setpoint.

The need for the trim-loop is about 20% of the hours in the year, and can ramp 0-100% to meet the setpoint back to Summit

New primary cooling loop

9 Benefits of Warm Water + Operating Dashboards

• Warm Water allows annualized PUE of 1.1 – ~$1M cost per MW-year for consumption on Summit; – ~$100k cost per MW-year for waste heat management • Integration with PLC allows us to tune water flow – Better delta(t); less pumping energy • Integration with IBM’s OpenBMC allows us to protect these 40k components from inadequate flow across the cold plates • Integration with the scheduler allows us to correlate power and temperature data with individual applications.

10 • Additional data streams to be added- most from the Facility PLC Comparing Energy Performance - Titan to Summit

• Demonstrated Performance on Titan (Oct 2012) • 17.59 PF 8.9MW peak 8.3MW average

• Demonstrated Performance on Summit (Oct 2018)

• 143.5 PF 11,065kW peak 9,783kW average 8.16x performance increase 17.9% avg. power increase Titan: ~2.1 GFLOPs/Watt Summit: 14.66 GFLOPs/Watt >7x increase in energy efficiency

11 12 13 14 Summit MTW Cooling Loads and Temperatures HPL Run 5/24/19 Duration: 5:48

PUE during HPL Run = 1.081 100 14000 95 90 12000 85 10000 80 IT Load follows F) ° 75 a traditional HPL profile 8000 70 Supply temperature to

Summit stayed kW 65 OAWBT remained Total power load rock-solid at 70F. 6000 60 at/above the on the energy supply target, Temperature( 55 plant reflects affecting ECT storage and other 4000 50 items A small portion of the total load 45 required the use of 2000 CHW (trim RDHX) 40 35 0

MTW kW (cooling) CHW kW (cooling) MTW Return Temp MTW Supply Temp

23:51:00 K10023:58:30 00:06:00Avg00:13:30 Space00:21:00 00:28:30 00:36:00 Temp00:43:30 00:51:00 00:58:30 01:06:00 01:13:30 01:21:00 01:28:30 Outdoor01:36:00 01:43:30 01:51:00 Air01:58:30 Wet02:06:00 02:13:30Bulb02:21:00 Temp02:28:30 02:36:00 02:43:30 02:51:00 02:58:30 03:06:00 IT kW03:13:30 03:21:00 03:28:30 03:36:00 03:43:30 03:51:00 03:58:30 04:06:00 04:13:30 04:21:00 04:28:30 04:36:00 04:43:30 04:51:00 04:58:30 05:06:00 05:13:30 05:21:00 05:28:30 05:36:00 15 Time OLCF-4 Summit HPL Power Measurements 10-25-18 - 03:31:48 to 09:32:00 Max power – Average power – 12000 Max power – 6000050000 11,065 kW 9,783 kW 11,065 kW

45000

10000 50000 40000

35000 8000 Average power – 40000

9,783 kW 30000 hoursAxis Title hoursAxis Title hoursAxis 6000 3000025000 - - kW kW

21,612 seconds; 20000 151,284 measurements kW Accumulated kW Accumulated 4000 20000 15000 Idle: 2.97MW

Total IT Load 58,730 kW-hours 10000 2000 Total Mech Load 1,449 kW-hours 10000

PUE 1.0246 5000

0 0 1 1 215 239 429 477 643 715 857 953 1071 1191 1285 1429 1499 1667 1713 1905 1927 2141 2143 2355 2381 2569 2619 2783 2857 2997 3095 3211 3333 3425 3571 3639 3809 3853 4047 4067 4281 4285 4495 4523 4709 4761 4923 4999 5137 5237 5351 5475 5565 5713 5779 5951 5993 6189 6207 6421 6427 6635 6665 6849 6903 7063 7141 7277 7379 7491 7617 7705 7855 7919 8093 8133 8331 8347 8561 8569 8775 8807 8989 9045 9203 9283 9417 9521 9631 9759 9845 9997 10059 10235 10273 10473 10487 10701 10711 10915 10949 11129 11187 11343 11425 11557 11663 11771 11901 11985 12139 12199 12377 12413 12615 12627 12841 12853 13055 13091 13269 13329 13483 13567 13697 13805 13911 14043 14125 14281 14339 14519 14553 14757 14767 14981 14995 15195 15233 15409 15471 15623 15709 15837 15947 16051 16185 16265 16423 16479 16661 16693 16907 16899 17121 17137 17335 17375 17549 17613 17763 17851 17977 18089 18191 18327 18405 18565 18619 18803 18833 19047 19041 19261 19279 19475 19517 19689 19755 19903 19993 20117 20231 20331 20469 20545 20707 20759 20945 20973 21187 21183 21401 21421 16 Measurement (1/second) OLCF-4 Summit HPL Power Measurements 10-25-18 - 03:31:48 to 09:32:00

12000

10000

8000

Oct 2018 143.5PF

6000 12000kW

4000 10000

2000 8000

0 1

6000 215 429 643 857 June 2018 1071 1285 1499 1713 1927 2141 2355 2569 2783 2997 3211 3425 3639 3853 4067 4281 4495 4709 4923 5137 5351 5565 5779 5993 6207 6421 6635 6849 7063 7277 7491 7705 7919 8133 8347 8561 8775 8989 9203 9417 9631 9845 10059 10273 10487 10701 10915 11129 11343 11557 11771 11985 12199 12413 12627 12841 13055 13269 13483 13697 13911 14125 14339 14553 14767 14981 15195 15409 15623 15837 16051 16265 16479 16693 16907 17121 17335 17549 17763 17977 18191 18405 18619 18833 19047 19261 19475 19689 19903 20117 20331 20545 20759 20973 21187 21401 122.3PF Measurement (1/second)

4000

17

2000

0 1 214 427 640 853 1066 1279 1492 1705 1918 2131 2344 2557 2770 2983 3196 3409 3622 3835 4048 4261 4474 4687 4900 5113 5326 5539 5752 5965 6178 6391 6604 6817 7030 7243 7456 7669 7882 8095 8308 8521 8734 8947 9160 9373 9586 9799 10012 10225 10438 10651 10864 11077 11290 11503 11716 11929 12142 12355 12568 12781 12994 13207 13420 13633 13846 14059 14272 14485 14698 14911 15124 15337 15550 15763 15976 16189 Cooling System Performance - PUE

18 20°C 21°C 22°C 24°C 25°C 26°C 27°C 19 23°C 26°C

19°C

20 Caution – Secondary (Closed) Loop Concerns Worsen as Temperatures Rise…

Allowable Wetted Materials Accelerated Corrosion • Lead-free copper alloys with less than 15% zinc. System Blockages • Stainless steels Reduced System Efficiency • EPDM • Polypropylene IBM’s Water Quality Requirements • All metals less than or equal to 0.10 ppm Water Chemistry - Biocides • Calcium less than or equal to 1.0 ppm • The choice of biocide depends on whether you are chasing anaerobic • Magnesium less than or equal to 1.0 ppm bacteria, aerobic bacteria, fungi, and/or algae • Manganese less than or equal to 0.10 ppm • Phosphorus less than or equal to 0.50 ppm • Silica less than or equal to 1.0 ppm • Sodium less than or equal to 0.10 ppm • Bromide less than or equal to 0.10 ppm • Nitrite less than or equal to 0.50 ppm • Chloride less than or equal to 0.50 ppm • Nitrate less than or equal to 0.50 ppm • Sulfate less than or equal to 0.50 ppm Fungi might not be • Conductivity less than or equal to 10.0 μS/cm. detected in the water, • pH 6.5 – 8.0 even though it can • Turbidity (NTU) less than or equal to 1 grow and cause EEHPCWG is looking for blockage of cooling common ground among the channels in cold plates 21 HPC suppliers for water quality 22