Bronson Messer Scientific Computing (OLCF) & Theoretical Physics Groups Oak Ridge National Laboratory

Effectively Targeting the Oak Ridge Leadership Computing Facility

ORNL is managed by UT-Battelle for the US Department of Energy Outline

•Mission of the Oak Ridge Leadership Computing Facility (OLCF) •Jaguar to Titan – Why GPUs? •What’s Next? - Summit (and Cori & Aurora) –A little about portability •Some nuts & bolts about how to really compute at OLCF A Little About ORNL…

3 Managed by UT-Battelle for the Department of Energy ORNL is DOE’s largest science and energy laboratory 1,888 journal articles published Nation’s in CY14 largest World’s 198

materials most intense invention

research neutron disclosures source portfolio Nation’s Managing in FY15 4,500 most diverse billion-dollar employees energy US ITER portfolio project 3,200 World- 66 $1.5B research class patents budget guests research issued reactor annually Forefront in FY15 $750M scientific modernization computing investment facilities What is the Leadership Computing Facility (LCF)?

• Collaborative DOE Office of Science user- • Highly competitive user allocation facility program at ORNL and ANL programs (INCITE, ALCC). • Mission: Provide the computational and • Projects receive 10x to 100x more data resources required to solve the most resource than at other generally challenging problems. available centers. • 2-centers/2-architectures to address • LCF centers partner with users to diverse and growing computational needs enable science & engineering of the scientific community breakthroughs (Liaisons, Catalysts). ORNL increased system performance by 1,000x (2004-2010) Hardware scaled from single-core Scaling applications and system through dual-core to quad-core and software was the biggest challenge dual-socket , 12-core SMP nodes

Cray XT5 Systems 12-core, dual-socket SMP Cray XT4 Quad- Cray XT4 2.3 PF Cray XT3 Dual-Core Core Cray XT3 263 TF Single-core Dual-Core 119 TF Cray X1 26 TF 54 TF 3 TF 2005 2006 2007 2008 2009 Architectural Trends – No more free lunch

• CPU clock rates quit increasing in 2003 • P = CV2f Power consumed is proportional to the frequency and to the square of the voltage • Voltage can’t go any lower, so frequency can’t go higher without increasing power • Power is capped by heat dissipation and $$$ • Performance increases have been coming through increased parallelism Herb Sutter: Dr. Dobb’s Journal: http://www.gotw.ca/publications/concurrency-ddj.htm astro-ph/9912202 GPUs provided a path forward using hierarchical parallelism

• Expose more parallelism through code CPU GPU Accelerator refactoring and source code directives – Doubles performance (relative to CPUs) of many codes • Use right type of processor for each task • Data locality: Keep data near processing • Optimized – GPU has high bandwidth to local memory for latency and • for rapid access sequential Optimized for throughput multitasking and many – GPU has large internal cache simultaneous tasks • 10× performance • Explicit data management: Explicitly per socket manage data movement between CPU • 5× more energy-efficient and GPU memories systems ORNL’s “Titan” hybrid system: Cray XK7 with AMD Opteron and NVIDIA Tesla

SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF • 24.5 GPU + 2.6 CPU • 18,688 Compute Nodes each with: • 16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU • 32 + 6 GB memory • 512 Service and I/O nodes 4,352 ft2 • 200 Cabinets 404 m2 • 710 TB total system memory • Cray Gemini 3D Torus Interconnect • 8.8 MW peak power Impact on applications

•We began planning for Titan in 2009 •At the time there were no large-scale GPU systems deployed anywhere •Furthermore, OLCF had little previous institutional knowledge of GPUs other than scattered individuals •However, the consensus was that codes will require restructuring for memory locality, threading and heterogeneity to get to exascale—we decided to do it now •Additionally, we didn’t want a machine delivered that had no functioning application software •Therefore, we selected a small set of applications for early porting, to spearhead an effort to move codes to Titan •We went through a process to selected a diverse set of codes to give broad coverage to represent use cases for our users •At the same time we wanted to capture institutional knowledge as lessons learned for going forward Center for Accelerated Application Readiness (CAAR)

LAMMPS WL-LSMS A molecular description of Illuminating the role of material membrane fusion, one of the disorder, statistics, and most common ways for fluctuations in nanoscale molecules to enter or exit living materials and systems. cells.

CAM-SE Answering questions about S3D specific climate change Understanding turbulent adaptation and mitigation combustion through direct scenarios; realistically represent numerical simulation with features like precipitation complex chemistry. patterns / statistics and tropical . storms.

12 Denovo NRDF Discrete ordinates radiation Radiation transport – important in transport calculations that astrophysics, laser fusion, can be used in a variety of combustion, atmospheric nuclear energy and dynamics, and medical imaging – technology applications. computed on AMR grids. Performance results

XK6 (w/ GPU) XK6 (w/ GPU) Cray XK6: Fermi GPU plus Interlagos CPU vs. XK6 (w/o vs. XE6 GPU) Cray XE6: Dual Interlagos and no GPU Performanc Performanc Applicatio e e Comment n Ratio Ratio • Turbulent combustion S3D 1.5 1.4 • 6% of Jaguar workload • 3D neutron transport for nuclear Denovo 3.5 3.3 reactors • 2% of Jaguar workload

• High-performance molecular dynamics LAMMPS 6.5 3.2 • 1% of Jaguar workload • Statistical mechanics of magnetic materials WL-LSMS 3.1 1.6 • 2% of Jaguar workload • 2009 Gordon Bell Winner • Community atmosphere model CAM-SE 2.6 1.5 • 1% of Jaguar workload Some lessons learned

•Up to 1-3 person-years required to port each code –Likely shorter today, given better tools, etc. –Also pays off for other systems—the ported codes often run significantly faster CPU-only (Denovo 2X, CAM-SE >1.7X) •We estimate 70-80% of developer time was spent in code restructuring, regardless of whether using CUDA / OpenCL / OpenACC / … •Each code team must make its own choice of using CUDA vs. OpenCL vs. OpenACC, based on the specific case—may be different conclusion for each code More lessons learned •Code changes that have global impact on the code are difficult to manage, e.g., data structure changes. An abstraction layer may help, e.g., C++ objects/ templates •Tools (compilers, debuggers, profilers) were lacking early on in the project but are becoming more available and are improving in quality •Debugging and profiling tools were useful in some cases (Allinea DT, CrayPat, Vampir, CUDA profiler)

•Science codes are under active development—porting to GPU can be pursuing a “moving target,” challenging to manage 2017 OLCF leadership system Hybrid CPU/GPU architecture

Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies® 5X-10X Titan’s Application Performance Approximately 4,500 nodes, each with: • Multiple IBM POWER9 CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA Volta architecture • CPUs and GPUs connected via high speed NVLink • Large coherent memory: over 512 GB/node (HBM + DDR4) – all directly addressable from the CPUs and GPUs • An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory • Over 40 TF peak performance Dual-rail Mellanox® EDR-IB full, non-blocking fat-tree interconnect IBM Elastic Storage (GPFS™) - 1TB/s I/O and 120 PB disk capacity. CAAR II

•New application readiness activity for Summit •Changes in management, etc. from CAAR I –Call for proposals • better coordination with code teams • code teams have skin in the game –CSEEN postdocs work with OLCF SciComp members • training in code development • science projects using new code •Increased emphasis on platform portability CAAR (II) Applications

Domain Application Methods PI Institution Astrophysics FLASH Grid, AMR Bronson Messer ORNL Chemistry DIRAC Particle, LA Lucas Visscher VUA Climate Science ACME (N) Unstr Mesh David Bader LLNL Engineering RAPTOR Kokkos Joseph Oefelein SNL Materials Science QMCPACK MC Paul Kent ORNL Nuclear Physics NUCCOR Particle Gaute Hagen ORNL Plasma Physics XGC (N) PIC, PETSc CS Chang PPPL Seismic Science SPECFEM Unstr Mesh Jeroen Tromp Princeton Astrophysics HACC(N,A) Grid Salman Habib ANL Biophysics NAMD (N) Particle Klaus Schulten UIUC Chemistry NWCHEM (N) Particle, LA Karol Kowalski PNNL Chemistry LSDALTON Particle, LA Poul Jørgensen Aarhus Plasma Physics GTC (N) PIC Zhihong Lin UCI

Selected via a call for proposals in Feb. 2015 IBM Systems IBM Systems Data Sheet Data Sheet

next generation of datacenter computing and provide differentiated performance, scalability, and low acquisition cost. The IBM Power System S822LC for High Performance Computing Fabric provides: I I next generation of datacenter computing and provide differenti- B B Early-access• 2x POWER8 CPUs, with 32 DIMM sockets delivered system by 115 GB/s being115 GB/sinstalled 8 memory daughter cards for up 1 TB of memory ated performance, scalability, and low acquisition cost. The DDR4 CPU CPU DDR4 • A differentiated platform for GPU acceleration and superior IBM Power System S822LC atfor HighI/O withOLCF Performance CAPI Computingnow 80 GB/s Fabric80 GB/s • POWER8 with NVLink Technology interconnecting the NV NV

nk Li nk Li Li Li provides: CPU to the GPU: faster link (80 GB/sec) to each of the nk nk 80 GB/sNV 80 GB/sNV 4 NVIDIA Tesla P100 GPUs delivering 2.8 times more I I IBM Systems 1 B B bandwidth to the GPUs than PCI-E based GPUs GPU NVLink GPUGPU NVLink GPU • Data Sheet 2x POWER8 CPUs, with 32 DIMM• Incredible socketsGPU:GPU communication:delivered by2.5 times faster link 115 GB/s 115 GB/s 80 GB/s 80 GB/s 8 memory daughter cards for up(80 1 GB/sec) TB ofbetween memory adjacent Tesla P100 GPUs on the same socket DDR4 CPU CPU DDR4 • A differentiated platform for GPU• Optional acceleration NVMe storage for and exceptionally superior fast storage input/outputIBM Power (I/O) System • 2 times more “Base Pairs Aligned” per Second running I/O with CAPI • TheS822LC robust features for of a POWER8High server platform SOAP3-dp with 2 instances per 80device. GB/s 4 80 GB/s • Performance Computing • 2.3 times better performance NV(57 percent reduction in NV POWER8 with NVLink TechnologyThe combination interconnecting of NVLink and NVIDIA the Tesla P100 GPUs nk Li 5 nk Li Tackle new problems with NVIDIA Tesla P100 on the execution time) Lirunning CPMD Li delivers unprecedented performance across multiple workloads nk nk CPU to the GPU: faster link (80 GB/sec) to each of the • only architecture with CPU:GPU NVLink 1.7 times better80 GB/sNV performance running the High Performance80 GB/sNV compared to x86 with Tesla K80 GPUs: 4 NVIDIA Tesla P100 GPUs delivering 2.8 times more Conjugate Gradients (HPCG) Benchmark.6 • 2.5 Endtimes users more across domainsqueries consistently per hour require1 running increased Kinetica system and “Filter by bandwidth to theHighlights GPUs than PCI-Egeographicapplication based performance. area” queries GPUs GPU 2computing has delivered game-changing GPU NVLink GPUGPU NVLink GPU performance to address these needs—dramatic acceleration for a variety 3 • • Breakthrough performance for your GPU • 1.9 oftimes HPC andmore enterprise GFLOPS applications. based on running LatticeQCD Incredible GPU:GPUaccelerated applications—to communication: help realize 2.5 times faster link faster time to insight 80 GB/s 80 GB/s System end-users, developers, and administrators need advancements in (80 GB/sec) between• Two POWER8® CPUs adjacent and 4 Tesla TeslaGPU performance,P100 programmability,GPUs andon in thethe ability sameto feed data to P100 with NVLink GPUs in a IBM Power System S822LC for High Performance Computing at a glance socket versatile 2U Linux server GPUs to unlock the next wave of accelerated computing. • Unlock new possibilities with POWER8 System configurations (8335-GTB) with NVLink—the only architecture with IBM® Power System S822LC for High Performance Computing pairs • Optional NVMeNVIDIA storage NVLink Technology fromfor exceptionallyMicroprocessorsthe strengths of the POWER8 fast CPUstorage withTwo 4 NVIDIA8-core 3.25Tesla P100GHz GPUs. POWER8 processor cards or CPU to GPU These best-in-class processors are tightlytwo bound 10-core with NVIDIA 2.86 GHz NVLink POWER8 processor cards Technology from CPU:GPU—to advance the performance, programma - • Designed for accelerated workloads input/output (I/O)in HPC, the enterprise datacenter, and Levelbility, 2 (L2) and cache accessibility of accelerated computing512 KB L2and cacheresolve theper PCI-E core accelerated cloud deployments bottleneck. • 2 times more “Base Pairs Aligned” per Second running • The robust features of a POWER8Level 3 (L3) server cache platform8 MB L3 cache per core This differentiated combination, with the only architecture delivering 4 Level 4 (L4) cache Up to 64 MB per socket CPU:GPU NVLink, unlocks new potential for GPUs across industries. SOAP3-dp with 2 instances per device. Memory Min/Max 4 GB, 8 GB, 16 GB, 32 GB DDR4 modules, 128 GB to 1 TB total memory IBM Power Systems LC Servers • 2.3 times better performance (57 percent reduction in The combination of NVLink andProcessor NVIDIAThe next to generation Memory Teslaof IBM Power P100 Systems™,115 GB/sec withGPUs POWER8 per socket, technol 230- GB/sec per system (Max sustained memory bandwidth to L4 cache from SCM) 5 Bandwidthogy, is a family of systems built innovatively170 GB/secto transform per the socket, power of 340 GB/sec per executionsystem (Max peak time)memory bandwidth running to DIMMs CPMD from L4 cache) delivers unprecedented performancebig dataacross & analytics, multiple into competitive advantages workloads in ways never before possible. Our new Linux scale-out systems embrace acceleration for the • 1.7 times better performance running the High Performance compared to x86 with Tesla K80 GPUs: Conjugate Gradients (HPCG) Benchmark.6

• 2.5 times more queries per hour running Kinetica “Filter by geographic area” queries2 3 • 1.9 times more GFLOPS based on running LatticeQCD

IBM Power System S822LC for High Performance Computing at a glance

System configurations (8335-GTB)

Microprocessors Two 8-core 3.25 GHz POWER8 processor cards or two 10-core 2.86 GHz POWER8 processor cards

Level 2 (L2) cache 512 KB L2 cache per core

Level 3 (L3) cache 8 MB L3 cache per core

Level 4 (L4) cache Up to 64 MB per socket

Memory Min/Max 4 GB, 8 GB, 16 GB, 32 GB DDR4 modules, 128 GB to 1 TB total memory

Processor to Memory 115 GB/sec per socket, 230 GB/sec per system (Max sustained memory bandwidth to L4 cache from SCM) Bandwidth 170 GB/sec per socket, 340 GB/sec per system (Max peak memory bandwidth to DIMMs from L4 cache)

2 What’s next? - Two architecture paths for today and future Leadership Systems Power concerns for large supercomputers are driving the largest systems to either Hybrid or Many-core architectures

Hybrid Multi-Core (like Titan/Summit) Many Core (Mira/Aurora) • CPU / GPU hybrid systems • 10’s of thousands of nodes with millions of cores • Multiple CPUs and GPUs per node • Homogeneous cores • Smaller number of very powerful nodes • Multiple levels of memory – on package, DDR, and non-volatile • Data movement issues to be much easier than previous systems – • Intel OmniPath interconnect NVLINK coherent shared memory within a node • Multiple levels of memory – on package, DDR, and persistent Next-gen systems: Cori, Aurora, Summit

Attributes 2016 Cori (NERSC) 2018 Aurora (Argonne) 2018 Summit (Oak Ridge)

Peak PF >30 180 200

Power MW <3.7 13 13.5 Intel Xeon Phi (KNL) and Processors Intel Xeon Phi (KNH) IBM Power9 + NVIDIA Volta Haswell ~1PB DDR4 + HBM + 1.5PB >7 PB HBM + Local + > 1.7 PB HBM+Local + Sys. Mem. persistent persistent 2.8 PB persistent Nodes 9,300 compute + 1,900 data > 50,000 ~4,500

File System 28 PB @ 744 GB/s Lustre 150 PB @ 1 TB/s Lustre 250PB @ 2.5 TB/s GPFS Architecture and performance portability Portability is important for leadership-class computational science teams – PIs often have access to multiple architectures at any given time – Code development teams often target users on multiple platforms – Applications have much longer lifespans than computer architectures • Portability can assist applications in smoother transition to future architectures – Porting to modern parallel platforms is labor intensive • Exposing parallelism through refactoring is the primary task Approach: ASCR facilities coordinate application readiness activities – Points of contact at ALCF, NERSC and OLCF for applications – Formulation of joint performance portability plan – Collaboration on porting of common applications to reduce duplication of effort – Common resources for application readiness programs API standardization: Past experience

1990s Autotasking Microtasking OpenMP shared Compass Pthreads Windows Pthreads … memory threads

1990s PVM p4 distributed Zipcode NX MPI memory Express

2010s CUDA OpenACC manycore AMP OpenMP 4 OpenCL Intel ??? CILK directives Three primary user programs for access to LCF

10% Director’s Discretionary

30% ALCC 60% INCITE ASCR Leadership Computing Challenge

Distribution of allocable hours LCFs support models • “Two-pronged” support model is shared • Specific organizational implementations differ slightly but user perspective is virtually identical

Advanced Data Data analytics and Workflows and visualization

Scientific computing Performanc Performance Liaisons Catalysts e engineering

User assistance and outreach User service and outreach Support basics

•User Assistance group provides “front-line” support for day-to-day computing issues •SciComp Liaisons provide advanced algorithmic and implementation assistance •Assistance in data analytics and workflow management, visualization, and performance engineering can also provided for each project (both tasks are “housed” in ADW group at OLCF) BigPanda, ATLAS, and Titan •ATLAS production is integrated with Titan •Titan has already contributed a noticeable fraction of computing resources for MC simulations •NYU and Manhattan College groups calculated the vector boson fusion channel for Higgs production and delivered more than 15 million fully simulated events on Titan, leading to an early physics results publication.

Invariant mass distribution of Z-bosons in the pp->H->ZZ->4l process calculated on Titan Titan Contribution to the ATLAS Scientific Program. ATLAS Monte-Carlo Jobs monthly usage Backfill mode (Jan-Aug)

Alexei Klimentov Summary •Jumping into the deep in of the pool with Titan was jarring, but is making the transition to Summit much easier (even pleasant!) –Exposing as much parallelism as possible — regardless of the new platform — is essential. •Programming models are still evolving, but one has to start somewhere •Several allocation programs available •OLCF is a bespoke shop: We tailor solutions via collaborations with SciComp and ADW groups •“On demand” computing is not easy for LCF’s, but sheer scale can lead to some possibilities. Questions? [email protected]