The US DOE Exascale Computing Project (ECP) Perspective for the HEP Community

Douglas B. Kothe (ORNL), ECP Director Lori Diachin (LLNL), ECP Deputy Director Erik Draeger (LLNL), ECP Deputy Director of Application Development Tom Evans (ORNL), ECP Energy Applications Lead

Blueprint Workshop on A Coordinated Ecosystem for HL-LHC Computing R&D Washington, DC October 23, 2019 DOE Exascale Program: The Exascale Computing Initiative (ECI)

Three Major Components of the ECI

ECI US DOE (SC) and National partners Nuclear Security Administration (NNSA) Exascale Selected program Computing office application Project development ECI Accelerate R&D, acquisition, and deployment to (BER, BES, (ECP) deliver exascale computing capability to DOE NNSA) mission national labs by the early- to mid-2020s

Exascale system ECI Delivery of an enduring and capable exascale procurement projects & computing capability for use by a wide range facilities focus of applications of importance to DOE and the US ALCF-3 () OLCF-5 (Frontier) ASC ATS-4 (El Capitan)

2 ECP Mission and Vision Enable US revolutions in technology development; scientific discovery; healthcare; energy, economic, and national security

ECP ECP mission vision Develop exascale-ready applications Deliver exascale simulation and and solutions that address currently data science innovations and intractable problems of strategic solutions to national problems importance and national interest. that enhance US economic competitiveness, change our quality Create and deploy an expanded and of life, and strengthen our national vertically integrated software stack on security. DOE HPC exascale and pre-exascale systems, defining the enduring US exascale ecosystem.

Deliver US HPC vendor technology advances and deploy ECP products to DOE HPC pre-exascale and exascale systems.

3 Vision: Exascale Computing Project (ECP) Lifts all U.S. High Performance Computing to a New Trajectory

10X

Capability

5X

2016 2021 2022 2023 2024 2025 2026 2027 Time

4 Relevant US DOE Pre-Exascale and Exascale Systems for ECP

5 The three technical areas in ECP have the necessary components to meet national goals

Performant mission and science applications @ scale Aggressive RD&D Mission apps & Deployment to DOE Hardware tech Project integrated S/W stack HPC Facilities advances

Application Software Hardware Development (AD) Technology (ST) and Integration (HI)

Deliver expanded and vertically Integrated delivery of ECP Develop and enhance the integrated software stack to products on targeted systems at predictive capability of achieve full potential of exascale leading DOE HPC facilities applications critical to the DOE computing 6 US HPC vendors focused on 24 applications including 70 unique software products exascale node and system national security, to energy, earth spanning programming models design; application integration systems, economic security, and run times, math libraries, and software deployment to materials, and data data and visualization facilities

6 ECP is a large, complex project Effective project management with three technical focus areas designed to deliver a capable exascale ecosystem

Distinctive characteristics

A capable exascale computing ecosystem made • RD&D and software development in nature Hardware and possible by integrating ECP applications, software • Two sponsoring DOE programs Integration (HI) and hardware innovations within DOE facilities • Numerous participating institutions Build a comprehensive, coherent software stack that enables the productive development of • Decentralized cost system Software highly parallel applications that effectively target • External project dependence Technology (ST) diverse exascale architectures • Broad and qualitative Develop and enhance predictive capability of mission need requirements applications critical to DOE across science, Application • Outcomes: Products and solutions Development (AD) energy, and national security mission space • Key performance parameters require innovation Measure progress and ensure execution Project within scope, schedule, and budget • Application of scope contingency Management (PM) • End of project transition

7 ECP by the Numbers

7 A seven-year, $1.8 B R&D effort that launched in 2016 YEARS $1.7B Six core DOE National Laboratories: Argonne, Lawrence 6 CORE DOE Berkeley, Lawrence Livermore, Oak Ridge, Sandia, Los Alamos LABS • Staff from most of the 17 DOE national laboratories take part in the project

4 FOCUS Four focus areas: Hardware and Integration, Software Technology, AREAS Application Development, Project Management

81 More than 80 top-notch R&D teams R&D TEAMS 1000 Hundreds of consequential milestones delivered on RESEARCHERS schedule and within budget since project inception

8 Barb Helland Thuc Hoang ECP Organization ASCR Program Manager ASC Program Manager Board of Directors Bill Goldstein, Chair (Director, LLNL) Thomas Zacharia, Vice Chair (Director, ORNL) Dan Hoag Federal Project Director DOE HPC Facilities Laboratory Operations Task Force (LOTF) Exascale Computing Project Core Laboratories Doug Kothe, ORNL Al Geist, ORNL Industry Council Project Director Chief Technology Officer Dave Kepczynski, GE, Chair Lori Diachin, LLNL Deputy Project Director

Julia White, ORNL Mike Bernhardt, ORNL Technical Operations Communications

Project Management Project Office Support Kathlyn Boudwin, ORNL Megan Fielden, Human Resources Director Willy Besancenez, Procurement Manuel Vigil, LANL Sam Howard, Export Control Analyst Deputy Director Mike Hulsey, Business Management Doug Collins, ORNL Kim Milburn, Finance Officer Associate Director Susan Ochs, Partnerships Michael Johnson, Legal and Points of Contacts at the Monty Middlebrook Doug Collins Project Controls & Risk IT & Quality Core Laboratories

Application Software Technology Hardware & Integration Development Mike Heroux, SNL Terri Quinn, LLNL Andrew Siegel, ANL Director Director Director Jonathan Carter, LBNL Susan Coghlan, ANL Erik Draeger, LLNL Deputy Director Deputy Director Deputy Director 9 81 WBS L4 subprojects have set their FY20-23 performance baseline with ECP Work Breakdown Structure (WBS) scope and technical plans to execute on Key leaders at WBS Level 1, 2, 3 RD&D objectives in ECP’s Final Design Exascale Computing Project 2.0 Kothe (ORNL)

Project Management Application Development Software Technology Hardware and Integration Application Development Application Development 2.1 2.2 2.3 2.4 2.2 2.2 Boudwin (ORNL) Siegel (ANL) Heroux (SNL) Quinn (LLNL)

Project Planning and Chemistry and Materials Programming Models and Chemistry and Materials ChemistryPathForward and Materials Management Applications Runtimes Applications Applications2.4.1 2.1.1 2.2.1 2.3.1 2.2.1 de Supinski2.2.1 (LLNL) Boudwin (ORNL) Deslippe (LBL) Thakur (ANL)

Project Controls and Risk Energy Applications Development Tools Hardware Evaluation Management Energy Applications Energy Applications 2.2.2 2.3.2 2.4.2 2.1.2 2.2.2 2.2.2 Evans (ORNL) Vetter (ORNL) Pakin (LANL) Middlebrook (ORNL)

Earth and Space Science Application Integration Business Management Earth and Space Science Mathematical Libraries Earth and Space Science Applications at Facilities 2.1.3 Applications 2.3.3 Applications 2.2.3 2.4.3 Hulsey (ORNL) 2.2.3 McInnes (ANL) 2.2.3 Dubey (ANL) Hill (ORNL)

Data Analytics and Optimization Software Deployment Procurement Management Data and Visualization Applications at Facilities 2.1.4 2.3.4 2.2.4 2.4.4 Besancenez (ORNL) Ahrens (LANL) Hart (SNL) Adamson (ORNL)

Information Technology and National Security Applications Software Ecosystem and Delivery Facility Resource Utilization Quality Management National Security Applications National Security Applications 2.2.5 2.3.5 2.4.5 2.1.5 2.2.5 2.2.5 Francois (LANL) Munson (ANL) White (ORNL) Collins (ORNL)

Communications and Outreach Co-Design NNSA Software Technologies Training and Productivity Co-Design Co-Design 2.1.6 2.2.6 2.3.6 2.4.6 2.2.6 2.2.6 Bernhardt (ORNL) Germann (LANL) Neely (LLNL) Barker (ORNL)

10 ECP High Level Schedule and Access to Systems

11 ECP applications target national problems in DOE mission areas

National security Energy security Economic security Scientific discovery Earth system Health care

Next-generation, Turbine wind plant Additive Cosmological probe Accurate regional Accelerate stockpile efficiency manufacturing of the standard model impact assessments and translate stewardship codes of qualifiable of particle physics in Earth system cancer research Design and metal parts models (partnership with NIH) Reentry-vehicle- commercialization Validate fundamental environment of SMRs Reliable and laws of nature Stress-resistant crop simulation efficient planning analysis and catalytic Nuclear fission of the power grid Plasma wakefield conversion Multi-physics science and fusion reactor accelerator design of biomass-derived simulations of high- materials design Seismic hazard alcohols energy density risk assessment Light source-enabled physics conditions Subsurface use analysis of protein Metagenomics for carbon capture, and molecular for analysis of petroleum extraction, structure and design biogeochemical waste disposal cycles, climate Find, predict, change, High-efficiency, and control materials environmental low-emission and properties remediation combustion engine and gas turbine Predict and control design magnetically confined fusion Scale up of clean plasmas fossil fuel combustion Demystify origin of chemical elements Biofuel catalyst

design 12 Co-design Subprojects

• Co-design centers address computational motifs common to multiple application projects

Efficient mechanism Co-design helps to ensure that CD Centers focus on a unique for delivering next-generation applications effectively utilize collection of algorithmic motifs community products with broad exascale systems invoked by ECP applications application impact • Pull software and hardware • Motif: algorithmic method that • Evaluate, deploy, and integrate developments into applications drives a common pattern of exascale hardware-aware computation and communication software designs and • Pushes application requirements technologies for key crosscutting into software and hardware • CD Centers must address all algorithmic motifs into RD&D high priority motifs used by ECP applications applications, including the new • Evolved from best practice motifs associated with data to an essential element science applications of the development cycle

CODAR COPA AMReX CEED ExaGraph ExaLearn Data and Particles/mesh Block structured Finite element Graph-based Machine workflows methods AMR discretization algorithms Learning 13 Department of Energy (DOE) Roadmap to Exascale Systems An impressive, productive lineup of accelerated node systems supporting DOE’s mission

Pre-Exascale Systems First U.S. Exascale Systems* 2012 2016 2018 2020 2021-2023

Titan (12) (1) ORNL* ORNL ORNL /AMD/NVIDIA Cray/AMD IBM/NVIDIA

Aurora Mira (24) Theta (28) To date, only ANL* Cray/ ANL ANL NVIDIA GPUs Three different IBM BG/Q Cray/Intel KNL types of NERSC-9 accelerators! Cori (14) Perlmutter LBNL LBNL Cray/Intel Xeon/KNL Cray/AMD/NVIDIA

Sequoia (13) LLNL Trinity (7) Sierra (2) IBM BG/Q LLNL LANL/SNL LLNL* LANL/SNL TBD Cray/Intel Xeon/KNL IBM/NVIDIA TBD 14 New hardware requires fully re-examining approaches

Goal: Ensure exascale hardware impacts DOE science/ mission

Approach: Significant investment in scientific applications well in advance of exascale machines

New Algorithmic Alternate Choice of Code Porting Numerical Restructuring Physical Models Approaches

This is not just a porting exercise, codes are being redesigned with heterogeneous computing and portability in mind

15 ECP Software Technology Software Ecosystem

Collaborators

ECP Applications Facilities Vendors HPC Community

Software Ecosystem & Delivery ECP Software Technology

Programming Development Mathematical Data & Models Tools Visualization Runtimes Libraries

Details available publicly at https://www.exascaleproject.org/wp-content/uploads/2019/02/ECP-ST-CAR.pdf

16 ECP software technologies are a fundamental underpinning in delivering on DOE’s exascale mission

Programming Development Math Libraries Data and Software NNSA ST Models & Runtimes Tools • Linear algebra, Visualization Ecosystem • Projects that have • Enhance & prepare • Continued, iterative linear • I/O libraries: • Develop features in both mission role OpenMP and MPI multifaceted solvers, direct HDF5, ADIOS, Spack necessary to and open science programming capabilities in linear solvers, PnetCDF, support all ST role models (hybrid integrators and products in E4S, and portable, open- • I/O via the • Major technical programming source LLVM nonlinear solvers, the AD projects that areas: New models, deep HDF5 API compiler optimization, adopt it programming memory copies) for FFTs, etc • Insightful, • Development of exascale ecosystem to abstractions, math support expected • Performance on memory-efficient Spack stacks for libraries, data and • Development of ECP architectures, new node in-situ reproducible viz libraries performance turnkey including support architectures; visualization and • Cover most ST portability tools analysis – Data deployment of large (e.g. Kokkos and for F18 extreme strong technology areas scalability reduction via collections of Raja) • Performance software • Open source NNSA analysis tools that • Advanced scientific data Software projects • Support alternate compression • Optimization and models for potential accommodate new algorithms for multi- interoperability of • Subject to the same benefits and risk architectures, physics, multiscale • Checkpoint restart containers on HPC planning, reporting mitigation: PGAS programming simulation and • Filesystem systems and review (UPC++/GASNet) models, e.g., PAPI, outer-loop analysis support for • Regular E4S processes ,task-based models Tau • Increasing quality, emerging solid releases of the ST (Legion, PaRSEC) interoperability, state software stack and • Libraries for deep complementarity of technologies. SDKs with regular memory hierarchy & math libraries integration of new power management ST products 17 Software Development Kits (SDKs): Key delivery vehicle for ECP A collection of related software products (packages) where coordination across package teams improves usability and practices, and foster community growth among teams that develop similar and complementary capabilities

• Domain scope ECP ST SDKs: Grouping similar products Collection makes functional sense for collaboration & usability

• Interaction model Programming Models & How packages interact; compatible, complementary, interoperable Runtimes Core Tools & Technologies • Community policies Compilers & Support Value statements; serve as criteria for membership Math Libraries (xSDK) • Meta-infrastructure Viz Analysis and Reduction Invokes build of all packages (Spack), shared test suites Data mgmt., I/O Services & Checkpoint/ Restart • Coordinated plans “Unity in essentials, otherwise diversity” Inter-package planning. Augments autonomous package planning

• Community outreach Coordinated, combined tutorials, documentation, best practices

18 ECP ST SDKs will span all technology areas Data management, Tools and Compilers Visualization Analysis I/O Services, Ecosystem/E4S PMR Core (17) Technology (11) and Support (7) xSDK (16) and Reduction (9) Checkpoint restart (12) at-large (12) QUO TAU openarc hypre ParaView SCR mpiFileUtils

Papyrus HPCToolkit Kitsune FleSCI Catalyst FAODEL TriBITS

SICM Dyninst Binary Tools LLVM MFEM VTK-m ROMIO MarFS

Legion Gotcha CHiLL autotuning comp Kokkoskernels SZ Mercury (Mochi suite) GUFI

Kokkos (support) Caliper LLVM openMP comp Trilinos zfp HDF5 Intel GEOPM

RAJA PAPI OpenMP V & V SUNDIALS VisIt Parallel netCDF BEE

CHAI Program Database Toolkit Flang/LLVM Fortran comp PETSc/TAO ASCENT ADIOS FSEFI

PaRSEC* Search (random forests) libEnsemble Cinema Darshan Kitten Lightweight Kernel

DARMA Siboka STRUMPACK ROVER UnifyCR COOLR

GASNet-EX C2C SuperLU VeloC NRM

Qthreads Sonar ForTrilinos IOSS ArgoContainers

BOLT SLATE HXHIM Spack UPC++ PMR MAGMA

MPICH Tools DTK Open MPI Math Libraries Legend Tasmanian Each column is an SDK as defined in the initial breakdown Data and Vis process using criteria developed for choosing logical and Umpire TuckerMPI effective groupings based on experience with the xSDK. AML Ecosystems and delivery The colored background denotes the ST technical area for each product.

19 ST Ecosystem: From products to SDKs to an integrated stack

Levels of Integration Product Source and Delivery ECP ST Individual Products • Source: ECP L4 teams; Non-ECP Developers; Standards Groups • Standard workflow ST • Delivery: Apps directly; spack; vendor stack; facility stack • Existed before ECP Products

• Group similar products • Source: ECP SDK teams; Non-ECP Products (policy compliant, • Make interoperable spackified) • Assure policy compliant SDKs • Delivery: Apps directly; spack install sdk; future: vendor/facility • Include external products

• Build all SDKs •Source: ECP E4S team; Non-ECP Products (all dependencies) • Build complete stack •Delivery: spack install e4s; containers; CI Testing • Containerize binaries E4S

ECP ST Open Product Integration Architecture 20 Extreme-scale Scientific Software Stack (E4S) A Spack-based distribution of ECP ST products and related and dependent software tested for interoperability and portability to multiple architectures Lead: Sameer Shende, University of Oregon

• Provides distinction between SDK usability / general quality / community and deployment / testing goals

• Will leverage and enhance SDK interoperability thrust

• Releases: – Oct: E4S 0.1: 24 full, 24 partial release products – Jan: E4S 0.2: 37 full, 10 partial release products

• Current primary focus: Facilities deployment

http://e4s.io

21 Monte Carlo Transport on Accelerated Node Architectures Recent efforts in the ECP ExaSMR Subproject

Thomas M. Evans A Coordinated Ecosystem for HL-LHC Computing R&D Catholic University, Oct 23, 2019 ExaSMR: Modeling and Simulation of Small Modular Reactors

• Small modular nuclear reactors present significant simulation challenges – Small size invalidates existing low-order models – Natural circulation flow requires high-fidelity fluid flow simulation • ExaSMR will couple most accurate available methods to perform “virtual experiment” simulations – Monte Carlo neutronics – CFD with turbulence models Reproduced with permission

MC Neutronics CFD Petascale Exascale Petascale Exascale

• System-integrated responses • Pin-resolved (and sub-pin) responses • Single fuel assembly • Full reactor core • Single physics • Coupled with T/H • RANS • Hybrid LES/RANS • Constant temperature • Variable temperatures • Within-core flow • Entire coolant loop • Isotopic depletion on assemblies • Isotopic depletion on full core • Reactor startup • Full-cycle modeling

Fuel assembly mixing vane

23 Physical Problem Characteristics Problem Parameters • Core Characteristics – Full core representative SMR model containing 37 assemblies with 17 × 17 pins per assembly and 264 fuels pins per assembly – 1010 particles per eigenvalue iteration – Pin-resolved reaction rate with 3 radial tally regions and 50 – 100 axial levels – O(150) nuclides and O(8) reactions per nuclide in each tally region r = 0.406 cm f • Geometry Size rg = 0.414 cm 6 6 rc = 0.475 cm – 푁푐푒푙푙푠 = 1.9 × 10 − 8.8 × 10 Pin pitch = 1.26 cm • Tally Sizes Assembly pitch = 21.5 cm – 푁 = 4.8 × 105 − 5.9 × 106 Height = 227.56 cm 푡,푐푒푙푙푠 9 10 – 푁푡,푏푖푛푠 = 1.5 × 10 − 1.8 × 10 Fuel (UO2) Clad (Zr) Gap (He)

24 Monte Carlo Neutron Transport Challenges

• MC neutronics is a stochastic method – Independent random walks are not readily amenable to SIMT algorithms – on-node concurrency – Sampling data is randomly accessed – Sampling data is characterized by detailed structure – Large variability in transport distributions both within and between particle histories

25 Developing GPU Continuous Energy Monte Carlo – Intra-Node

• Focus on high-level thread divergence Simple Event-Based Transport Algorithm • Optimize for device occupancy get vector of source particles – Separate geometry and physics kernels to increase occupancy while any particles are alive do – Boundary crossings (geometry) for each living particle do move particle – Collision (physics) dist-to-collision • Smaller kernels help address variability in particle dist-to-boundary transport distributions move-to-next end for • Partition macro cross section calculations between fuel and non-fuel regions – separate for each living particle do kernels for each process particle collision end for • Use of hardware atomics for tallies and direct sort addressing source particles sort/consolidate surviving particles • Judicious use of texture memory end while – __ldg on data interpolation bounds

26 Production continuous-energy Monte Carlo transport solver on GPUs

• Petascale implementation did not use GPU hardware • Enables three-dimensional, fully-depleted SMR core models simulated using continuous-energy physics and pin-resolved reaction rates with temperature-dependence • Algorithmic improvements offer 10x speedup relative to initial implementation and nearly 60x per-node speedup

over Titan Increase in particle tracking rate across GPU computing architectures • Nearly perfect parallel scaling efficiency on ORNL’s Summit supercomputer • GPU algorithm executes more than 20x faster than CPU algorithm on Summit (per full node) • Paper describes first production MC solver implementation on GPUs

Hamilton, S.P., Evans, T.M., 2019. Continuous-energy Monte Carlo neutron transport on GPUs in the Shift code. Annals of Nuclear Energy 128, 236– 247. https://doi.org/10.1016/j.anucene.2019.01.012

Total reaction rate in SMR core

27 Cross section calculations

• Computing transport cross sections requires contributions from various constituents σ푀 Σ 퐸 = 푚=1 푁푚휎푚 (퐸) • Fuel compositions contain substantially more nuclides than non-fuel

• Partition mixtures into fuel and non-fuel – Evaluate cross sections in separate kernels to reduce divergence

28 Occupancy

• Flattened algorithm allows small, focused kernels – Split geometry/physics components to reduce register usage – Smaller kernels = higher occupancy

MC type Algorithm Registers Occupancy Multigroup History-based 85 25% Event-based 83 25% Continuous-Energy History-based 168 12.5% Event-based 62 50%

29 Effect of varying occupancy

• Artificially limit occupancy by allocating shared memory – kernel<<>>(…)

Algorithm Occupancy (%) History-based Event-based Flattened event-based 12.5 3.7 3.4 8.2 25.0 - 5.8 13.3 37.5 - - 14.5 50.0 - - 16.9 62.5❊ - - 18.0 ❊Only applied to “distance to collision kernel”

30 CPU v GPU performance

CPU tracking rate per core GPU core equivalent

GPU performance increases have outpaced corresponding CPU improvements

31 Device saturation

Newest architectures remain unsaturated at 1M particles per GPU

Depleted SMR core

32 Inter-node Scaling 2 3

0 1

0 1 2 3

Domain replication parallelism

Num Particles = N / 4 Num Particles = N / 4 Num Particles = N / 4 Num Particles = N / 4 Multi-set domain decomposition topology (in development – GPU)

Weak scaling on Summit – 1 GPU per MPI rank Intra-set non-uniform block out to address load balancing

Ellis, J.A., Evans, T.M., Hamilton, S.P., Kelley, C.T., Pandya, T.M., 2019. Optimization of processor allocation for domain decomposed Monte Carlo calculations. Parallel Computing 87, 77–86. https://doi.org/10.1016/j.parco.2019.06.001 33 On-the-Fly Doppler Broadening

• Cross section resonances significantly broaden due to thermal motion of nuclei

• The cross section (휎) at any energy (퐸) and temperature (푇) can be expressed as a summation over contributions from poles (푝푗) and corresponding residues (푟푗):

1 퐴휋 휎 퐸, 푇 = ෍ ℜ 푟푗 푊 퐸 − 푝푗 퐴/푘퐵푇 퐸 푘퐵푇 푗

• A polynomial approximation can be used to reduce the number of 푊 ⋅ evaluations

푁−1 1 퐴휋 휎 퐸, 푇 = ෍ ℜ 푟푗 푊 퐸 − 푝푗 퐴/푘퐵푇 + ෍ 푎푤,푛픇푛 퐸 푘퐵푇 푗 푛=0

34 GPU Performance

• Performance testing with a quarter-core model of the awaited NuScale Small Modular Reactor (SMR)

• No significant sacrifice of accuracy compared to standard continuous energy (CE) data

• Each GPU thread does individual Fadeeva evaluations (no vectorization over nuclides)

• Factor of 2-3 performance penalty on both the CPU and GPU for arbitrary temperature resolution

2x IBM Power8+ 4x NVIDIA Tesla P100

35 Geant-based proxy pilot

Goals Challenges • Research and develop design patterns • Choosing a scope small enough to digest suitable for HEP transport on GPUs but can emulate the level of complexity of a real simulation • Produce a proxy app with limited but representative physics processes • Reconciling static (build-time) preference of GPU code with dynamic user • Execute and profile the proxy app at the requirements scale needed by next-generation HEP experiments • Effectively utilizing the GPU with a very broad, flat call graph (dozens of independent physics processes)

36 Geant-based proxy pilot

Complete In progress Future work • Developed requirements • Iterating on high-level • Explore HIP in document for the proxy code architecture and preparation for Frontier app event loop • Evaluate ClangJIT for • Constructed development • Implementing physics GPU-friendly dynamicism framework kernels in CUDA (CMake/Docker/CUDA) • Awaiting onboarding of • Integrated CUDA-enabled postdoc... VecGeom geometry

37 Questions? https://www.exascaleproject.org/

38