The : A POWER Collaboration Success Story Adam Bertsch Sierra Integration Project Lead Lawrence Livermore National Laboratory

OpenPOWER 2018, MGM Grand Las Vegas March 19th, 2018

LLNL-PRES-746388 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Overview

§ The Sierra Supercomputer § Collaboration — CORAL and OpenPower

2 LLNL-PRES-746388 Where did Sierra start for me?

§ On a spring bicycle ride in Livermore five years ago.

3 LLNL-PRES-746388 Sierra will be the next ASC ATS platform

Sequoia (LLNL)

ATS 1 – Trinity (LANL/SNL)

ATS 2 – Sierra (LLNL)

ATS 3 – Crossroads (LANL/SNL) Advanced Technology Advanced Systems (ATS) System Delivery ATS 4 – El Capitan (LLNL)

ATS 5 – (LANL/SNL)

Tri-lab Capacity Cluster II (TLCC II)

Procure& CTS 1 Deploy Use Retire CTS 2 Commodity Commodity Technology Systems (CTS)

‘13 ‘14 ‘15 ‘16 ‘17 ‘18 ‘19 ‘20 ‘21 ‘22 ‘23 Fiscal Year and Sierra are the current and next-generation Advanced Technology Systems at LLNL

4 LLNL-PRES-746388 The Sierra system that will replace Sequoia features a GPU-accelerated architecture Compute System Compute Node Compute Rack 4320 nodes Standard 19” 1.29 PB Memory 2 IBM POWER9 CPUs Warm water cooling 4 NVIDIA Volta GPUs 240 Compute Racks NVMe-compatible PCIe 1.6 TB SSD 125 PFLOPS 256 GiB DDR4 ~12 MW 16 GiB Globally addressable HBM2 Components associated with each GPU Coherent Shared Memory IBM POWER9 • Gen2 NVLink

Spectrum Scale NVIDIA Volta Mellanox Interconnect File System • 7 TFlop/s Single Plane EDR InfiniBand 154 PB usable storage • HBM2 2 to 1 Tapered Fat Tree 1.54 TB/s R/W bandwidth • Gen2 NVLink

5 LLNL-PRES-746388 Sierra Represents a Major Shift For Our Mission- Critical Applications

• IBM BlueGene, along with copious amounts of BG/L linux-based cluster technology dominate the HPC 2004 - landscape at LLNL. present • Apps scale to unprecedented 1.6M MPI ranks, but are still largely MPI-everywhere on ”lean” cores BG/P BG/Q • Heterogeneous GPGPU-based computing begins 2014 to look attractive with NVLINK and Unified Memory (P9/Volta) • Apps must undertake a 180 turn from BlueGene toward heterogeneity and large-memory nodes

2015 – 20??

• LLNL believes heterogeneous computing is here to stay, and Sierra gives us a solid leg-up in preparing for exascale and beyond

6 LLNL-PRES-746388 Cost-neutral architectural tradeoffs

§ Full bandwidth from dual ported Mellanox EDR HCAs to TOR switches § Half bandwidth from TOR switches to director switches § An economic trade-off that provides approximately 5% more nodes § DRAM cost double expectation § Evaluated two options to close the cost gap § Buy half as much memory § Taper the network § Selected both options!

This decision, counter to prevailing wisdom for system design, benefits Sierra’s planned UQ workload: 5% more UQ simulations at performance loss of < 1%

7 LLNL-PRES-746388 Sierra system architecture details

Sierra uSierra rzSierra Nodes 4,320 684 54 POWER9 processors per node 2 2 2 GV100 (Volta) GPUs per node 4 4 4 Node Peak (TFLOP/s) 29.1 29.1 29.1 System Peak (PFLOP/s) 125 19.9 1.57 Node Memory (GiB) 320(256+64) 320(256+64) 320(256+64) System Memory (PiB) 1.29 0.209 0.017 Interconnect 2x IB EDR 2x IB EDR 2x IB EDR Off-Node Aggregate b/w (GB/s) 45.5 45.5 45.5 Compute racks 240 38 3 Network and Infrastructure racks 13 4 0.5 Storage Racks 24 4 0.5 Total racks 277 46 4 Peak Power (MW) ~12 ~1.8 ~0.14

8 LLNL-PRES-746388 CORAL Targets compared to Sierra

§ Target speedup over current systems of 4x on Scalable benchmarks … § … and 6x on Throughput benchmarks § Aggregate memory of 4 PB and ≥ 1 GB per MPI task (2 GB preferred) § Max power of system and peripherals ≤ 20MW § Mean Time Between Application Failure that requires human intervention ≥ 6 days § Architectural Diversity § Delivery in 2017 with acceptance in 2018 § Peak Performance ≥ 100 PF

9 LLNL-PRES-746388 Collaboration

10 LLNL-PRES-746388 Sierra is part of CORAL, the Collaboration of Oak Ridge, Argonne and Livermore

LLNL’s IBM Blue Gene Systems Modeled on successful LLNL/ANL/IBM Blue Gene partnership (Sequoia/) BG/Q Sequoia BG/L BG/P Dawn

NRE contract Long-term contractual partnership ORNL Summit contract with 2 vendors 2 awardees for 3 platform LLNL Sierra contract RFP acquisition contracts NRE contract

2 nonrecurring eng. contracts ANL Aurora contract

CORAL is the next major phase in the U.S. Department of Energy’s scientific computing roadmap and path to exascale computing

11 LLNL-PRES-746388 OpenPOWER collaboration enabled technology behind Sierra

12 LLNL-PRES-746388 CORAL&NRE&will&provide&significant&benefit& to&the&final&systems&

! Motherboard'design' ! System'diagnostics' ! Water'cooled'compute'nodes'' ! System'scheduling''

! GPU'resilience'' ! Burst'buffer'' ! GPFS'performance'and' ! Switch'based'collectives,' FlashDirect' scalability' ! Cluster'management' ! Open'source'compiler' infrastructure' ! Open'source'tools' ! Center'of'Excellence'

Eight''joint'lab/contractor'working'groups'are'actively' working'on'all'the'above'and'more'''

1321& LLNL-PRES-683853LLNL-PRES-XXXXXX CORAL NRE drove Spectrum Scale advances that will vastly improve Sierra file system

Original Plan of Record

§ 120PB delivered capacity § Conservative drive capacity estimates § 1TB/s write, 1.2TB/s read § Concern about metadata targets NRE + Hardware Advancements Modified (Cost Neutral)

§ 154PB delivered capacity § Substantially increased drive capacities § 1.54 TB/s write/read § Enhanced Spectrum Scale metadata perf. (many optimizations already released) Close partnerships lead to ecosystem improvements

14 LLNL-PRES-746388 Outstanding benchmark analysis by IBM and NVIDIACORAL APPLICATIONdemonstrates PERFORMANCEthe system’s usability PROJECTIONS

Scalable Science Benchmarks Throughput Benchmarks

14x 14x

12x 12x

10x 10x

8x 8x

6x 6x

4x 4x Performance Relative Relative Performance Relative

2x 2x

0x 0x CAM-SE UMT2013 AMG2013 MCB QBOX LSMS HACC NEKbone

CPU-only CPU + GPU CPU-only CPU + GPU

13

Figure 5: CORALProjections benchmark included projections code show GPUchanges-accelerated that system showed is expected tractable to deliver substantially higher performanceannotation at the system-based level approach compared to (i.e., CPU- onlyOpenMP configuration) will. be competitive

15 TheLLNL demonstration-PRES-746388 of compelling, scalable performance at the system level across a wide range of applications proved to be one of the key factors in the U.S. DoE’s decision to build Summit and Sierra on the GPU-accelerated OpenPOWER platform.

Conclusion Summit and Sierra are historic milestones in HPC’s efforts to reach exascale computing. With these new pre-exascale systems, the U.S. DoE maintains its leadership position, trailblazing the next generation of while allowing the nation to stay ahead in scientific discoveries and economic competitiveness.

The future of large-scale systems will inevitably be accelerated with throughput-oriented processors. Latency-optimized CPU-based systems have long hit a power wall that no longer delivers year-on-year performance increase. So while the question of “accelerator or not” is no longer in debate, other questions remain, such as CPU architecture, accelerator architecture, inter-node interconnect, intra- node interconnect, and heterogeneous versus self-hosted computing models.

With those questions in mind, the technological building blocks of these systems were carefully chosen with the focused goal of eventually deploying exascale supercomputers. The key building blocks that allow Summit and Sierra to meet this goal are:

• The Heterogeneous computing model • NVIDIA NVLink high-speed interconnect • NVIDIA GPU accelerator platform • IBM OpenPOWER platform

With the unveiling of the Summit and Sierra supercomputers, Oak Ridge National Laboratory and Lawrence Livermore National Laboratory have spoken loud and clear about the technologies that they believe will best carry the industry to exascale. 9

Early Access systems provide critical pre-Sierra generation resources

Early Access Compute Node Early Access Compute System • “Minsky” HPC Server • 18 Compute Nodes • 2xPOWER8+ processors • EDR IB switch fabric • 4xNVIDIA GP100; NVLINK • mgmt • 256GB SMP (32x8GB DDR4) • Air-cooled

Early Access to CORAL Software Technology Based on IBM’s current technical programming offerings with Initial Early Access GPFS enhancements for new function and scalability. Includes: • GPFS Management servers • Initial messaging stack implementation • 2 POWER8 Storage controllers • Initial Compute node kernel • 1 GL-2; 2x58 6TB SAS drives or • 1 GL-4; 4x58 6TB SAS drives • Compilers: • EDR InfiniBand switch fabric o Open Source LLVM o PGI compiler • Enhanced to GL-6, 6x58 o XL compiler 6TB SAS drives

16 LLNL-PRES-746388 Three LLNL Early Access systems support ASC and institutional preparations for Sierra

ESS Shark ESS 597.6TF

Manta Compute Compute Infrastructure Compute Compute Infrastructure Storage 597.6 TF Storage

Compute nodes on Ray: Ray have 1.6TB NVMe drives 896.4 TF

Compute Compute Compute Infrastructure Storage

17 LLNL-PRES-746388 Sierra and its EA systems are beginning an accelerator-based computing era at LLNL § The advantages that led us to select Sierra generalize — Power efficiency — Network advantages of “fat nodes” — Clear path for performance portability — Balancing capabilities/costs implies complex memory hierarchies § Broad collaboration is essential to successful deployment of increasingly complex HPC systems — Computing centers collaborate on requirements — Vendors collaborate on solutions

Heterogeneous computing on Sierra represents a viable path to exascale

18 LLNL-PRES-746388 19 LLNL-PRES-746388