Early application experiences on Summit

Wayne Joubert Scientific Computing Group Oak Ridge Leadership Computing Facility

3rd OpenPOWER Academia Discussion Group Workshop Nov. 10, 2018

ORNL is managed by UT-Battelle, LLC for the US Department of Energy Summit – background Officially launched June 8, 2018 World’s fastest Peak speed 200 PF #1 on TOP500, @ 122.3 PF, June 2018 #1 level-3 measured system #1 on HPCG benchmark Used by 5 out of 6 Gordon Bell Finalist teams Achieved world’s first ExaOp calculation by an application, @ 2.36 ExaOps (ExaFlops16) Not yet officially accepted, but already achieving impressive results on conventional science and machine learning applications

2 3 Slide courtesy Jack Wells 4 Slide courtesy Jack Wells Summit early users • 1,080 compute nodes have been available to users since December 2017, after that built up to present 4,608 nodes • Used by 13 CAAR teams (Center for Accelerated Application Readiness) • 65 Letters of Intent for Summit Early Science program – were allowed on Summit for application readiness • Gordon Bell teams (5) • System Acceptance Test team – preparations for final system acceptance testing Graphic courtesy Tjerk Straatsma

5 Summit early science applicants • Received 65 LOIs in January, 47 full proposals in June • Awardees will be among the first users to get access to Summit after acceptance • Notably, 12 of the 65 LOIs (~ 20%) had a machine learning component – remarkable growth in a short period of time • Tremendous interest in running on Summit, from Early Science as well as 2019 INCITE projects (announcement Monday)

6 Summit Gordon Bell Teams

Slide courtesy Jack Wells 7 Summit Gordon Bell Finalist Projects • CoMet team used Tensor Cores to achieve 2.36 ExaOps performance on a comparative genomics application • Prabhat’s LBL team, deep learning application, 1.13 ExaOps peak, .999 ExaOps sustained performance for identification of extreme weather patterns from high resolution climate simulation data • University of Tokyo team used AI and transprecision computing for earthquake simulation • ORNL / Robert Patton team, MENNDL code, 152 PetaOps analyzing atomic- level materials properties from electron microscopy data • LBL-led team using LQCD code with mixed precision multigrid solver to study the physics of subatomic particles

• Full presentations @ SC18 sessions, Wed. 3:30-5:00, Thu. 10:30-12:00

8 Summit: first impressions • Our early experience with Summit is that it is an extremely powerful system – Very strong GPUs – Apps are often getting a higher fraction of peak than on – improvements to GPU hardware, software over time – New features useful for some apps, e.g., Tensor Cores, NVMe devices – Low-congestion fat tree interconnect with adaptive routing • Many apps have gotten impressive results already • The early system was somewhat rough around the edges, with a number of issues we have had to CAAR application performance to work through with the vendors date – number of nodes scaled to (out of 4,608 Summit nodes) and • The system has progressively gotten better as all performance vs. CPU-only parties have been working through the issues From “Early Application Results on Summit,” T.P. Straatsma, Smoky Mountain Conference 2018

9 Summit node performance • Summit nodes are achieving a high percentage of theoretical peak performance characteristics • For details see Vazhkudai et al., “The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems,” @ SC18, Wed. 3:30PM

10 Summit node performance: CPU memory subsystem • Using the Stream benchmark to measure CPU memory bandwidth • Theoretical peak 340 GB/sec, actual ~ 275 GB/sec, ~ 82% of peak • Significant boost from previous Titan, JaguarPF nodes

SUMMIT: peak 170 X 2 = 340, actual ~ 275 actual: ~ 82%

TITAN: peak 25.6 X 2 = 51.2, actual ~ 34 actual ~ 67%

JAGUARPF: peak 25.6, actual ~ 19 actual ~ 75% 11 Summit node performance: GPU HBM memory • Theoretical peak bandwidth of 900 GB/sec • Measured performance from GPU Stream benchmark: 789 (Copy), 788 (Mul), 831 (Add and Triad) GB/sec, representing 88%-92% of peak. • Compares extremely well to Titan K20X, ~ 181 GB/sec out of 250 GB/sec peak (72%) • Innovations were made in the GPU memory to achieve a higher fraction of peak performance

12 Summit node performance: CPU-GPU NVLINK • Previously relied on much slower PCIe-2 connection on Titan • On-socket transfer rates are 92%, 86% of peak 50, 100 GB/sec • Off-socket transfers go through the X-Bus, are slower

13 Summit node performance: Infiniband interconnect • Node-to-node bandwidth, latency measured using IMB benchmark • Achieving 22.36, 44.29 GB/sec out of peak 25, 50 GB/sec unidirectional / bidirectional for sufficiently large messages • 89% of theoretical peak

14 GTC application (CAAR; acceptance test code) • GTC (Gyrokinetic Toroidal Code) is a particle-in-cell (PIC) application to simulate magnetically confined plasmas in fusion reactors such as ITER • Written in Fortran 90 • Accelerated primarily with OpenACC • OpenMP acceleration of CPU-only parts; also an OpenMP code version for Intel Xeon Phi • Project personnel: Zhihong Lin (co-PI), William Tang (co-PI), Jian Bao, Wayne Joubert, Matthew Niemerg, Lei Shi, Sam Taimourzadeh, Bei Wang, Peng Wang, Wenlu Zhang • http://phoenix.ps.uci.edu/gtc_group

15 GTC application: experiences porting to Summit GPUs • Expensive particle push and charge loops are mapped to GPU using OpenACC, with persistent data on the GPUs • (aside: a number of codes since 2012 have used OpenACC on Titan and Summit, including several Summit CAAR codes. Some codes are now starting to use OpenMP 4, e.g., PSDNS on Titan) • “Shift” operation — to move particles to different MPI ranks — to get high performance uses highly optimized custom CUDA code — takes advantage of OpenACC/CUDA interoperability • Poisson field solver – original code used PETSc ILU(0)+GMRES sparse solver (CPU-only) – now uses NVIDIA’s AMGX algebraic multigrid solver using GPUs, > 20X faster – also option to use use Hypre algebraic multigrid solver (GPU support in development) • Build option exists to use GPU Unified Memory, originally was much slower than explicit transfers, now performance is near parity thanks to PGI compiler improvements • GTC has significant I/O requirements. The GTC I/O behaviors uncovered some issues with the Summit GPFS file system, which were addressed

16 GTC results Weak scaling to 4500 nodes of Summit

17 CoMet application (INCITE, Early Science, Gordon Bell) CoMet = Combinatorial Metrics code A new biosciences application used to find genomic features within a population Not a “traditional” modeling and simulation code (e.g., continuum PDE solver, PIC, Monte Carlo, etc.) Also is not a deep learning app per se, though is part of an AI workflow Best described as a data analytics application used in comparative genomics studies Gordon Bell Finalist -- see talk Thurs 11:30 AM

18 CoMet application The primary computation is an all-to-all comparison of vectors Computationally similar to a distributed DGEMM operation, as in the ScaLAPACK library and PBLAS — very computationally intensive, but also requires communication of very large matrices Written in C++, uses CUDA, cuBLAS and modified MAGMA calls Uses explicit calls for both asynchronous MPI point-to-point messages and asynchronous CPU/GPU transfers, with pipelining to overlap compute and transfer OpenMP threading is used for CPU work, done concurrently with GPU

19 CoMet algorithm: Custom Correlation Coefficient (CCC) Used to analyze allele data from a genome, encoded as 2-bit vector elements Base implementation uses bitwise operations (AND, OR, NOT, shift, mask, __popcll, etc.) to operate on this binary allelic data

1 1 00 10 0 1 0 1 0 0 01 11 1 1 1 1 2 2

v1 v2 0 1 Tally results into a Vectors composed of Take all combinations of table to represent 2-bit entries bits from the left and right how the 2 vectors vector elements are related

20 CCC method: mapping to Tensor Cores

• Each vector is replaced by two Original FP16 vectors, each containing the number vector vectors of 0s and 1s of each element of the # 0s # 1s original vector, forming a new matrix of vectors V 0 • Then taking the dense matrix-matrix 1 1 1 product VT V generates all 2X2 tables for all vector pairs 1 • HGEMM applied using call to 1 0 2 cublasGemmEx in cuBLAS library, gives identical result to original 0 method 0 2 0

21 CoMet performance • Achieved 2.36 ExaOps (mixed precision ExaFlops) at 4,560 nodes (99% of Summit) using the Tensor Cores • Near-perfect scaling made possible by Summit’s Mellanox Infiniband fat tree network with adaptive routing • Equivalent to 86.4 TF per GPU for the whole computation (including communications and transfers) W. Joubert, J. Nance, D. Weighill, D. Jacobson, “Parallel Accelerated Vector Similarity Calculations for Genomics Applications,” Parallel Computing, vol. 75, July 2018, pp. 130-145, https://www.sciencedirect.com/science/article/pii/S016781911830084X

• > 4X faster than original bitwise W. Joubert, J. Nance, S. Climer, D. Weighill, D. Jacobson, “Parallel Accelerated Custom Correlation Coefficient Calculations for Genomics Applications,” arxiv 1705.08213 [cs], Parallel Computing, implementation on Summit GPUs accepted.

Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan, Daniel Jacobson, “Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic Architectures for Chronic Pain and Opioid Addiction,” SC18, Gordon Bell finalist, to appear. 22 Summit Power Consumption

• 2-way CCC/sp/tc @ 4560 nodes • Summit power usage for 1 out of 4 phases of the run, duration ~ 50 sec. • Avg power: 11.45 MW (20% higher than HPL) • 206 GigaOps / Watt

23 Issues / challenges of using Tensor Cores • Matrices for this problem are tall and skinny – axis order had to be reversed to give shorter leading matrix dimension for better TC performance (about 2X faster) (thanks to Sean Treichler of NVIDIA for suggestion) • HGEMM performance as a function of matrix size is irregular, hard to precisely predict – performed extensive timing tests with Baidu DeepBench benchmark to try to understand – advisable to pad up to a multiple of a small power of 2 (e.g., 8, 16, 32) – however too much padding will be wasteful • There are many tuning options for HGEMM (~16 choices for the algorithm setting) – determined CUBLAS_GEMM_ALGO4_TENSOR_OP was the best – would prefer if default setting would give this performance (hoping for improvements with CUDA 10) • TC/HGEMM has surprising data-dependent performance: 125 TF theoretical peak, 113 TF achievable on zero-filled matrices, 105 TF peak on random CCC matrices, ~95 TF peak on matrices with fully random FP16 entries

24 Issues • Measurements on 1 Summit GPU using nvidia-smi • Data-dependent performance of Tensor Cores is due to 300W power/frequency throttling of Voltas on Summit • Baidu DeepBench GEMM benchmark has a bug (reported), incorrectly fills FP16 matrices with zeros instead of the intended random values, thus miscomputes GPU performance

25 Reduced precision: other possible opportunities • We are starting to look at other opportunities for using reduced precision for science calculations on Summit • In the past scientists have had accuracy concerns and usually required double precision – E.g., S3D combustion code, 2009 paper found single precision not adequate • New hardware (16X faster HGEMM than DGEMM) may call for a second look – ICL/Dongarra group already developing iterative refinement dense solver using Tensor Cores (see talk @ SC18, Wed. 4PM) – Deep learning projects already seeing high rates, e.g., peak 1.13 ExaOps – Previous work on reduced precision iterative solvers e.g., Turner/Walker 1992 paper on reduced precision GMRES sparse solver – Need to carefully evaluate on a case-by-case basis

26 Summit: general comments on user experiences • The most common execution configuration on Summit is 1 MPI rank owns 1 GPU and some CPU cores (like Titan), though some codes are using other configurations, and no doubt users will experiment with still others • Have requested jsrun options that would allow arbitrary execution configuration on nodes—some users absolutely need this flexibility, e.g., 2 apps need nonuniform resource sets for master/slave execution • Earlier saw long jsrun/MPI init times on Summit, especially at large node/rank counts. This has improved considerably. • Earlier Spectrum MPI beta versions we received had never been run at such high node counts—various issues encountered and bugs filed—IBM has worked to address

27 Summit: general comments • We would prefer more vendor tuning of third party libraries, as we have had in the past. IBM does give us some optimized build instructions for third party libraries. • A more general concern regarding the broader industry: every new HPC system we get has more complex node hardware and software stack. We hope HPC vendors very carefully manage this complexity. Users want and need advanced hardware features but also need reliable, coherent software to access them. • Similarly, users mix programming models, e.g., MPI, OpenMP, OpenACC, Kokkos, etc., sometimes in complex ways. We need to have interoperability and coherency between these (Example: can an OpenMP thread launch an OpenACC kernel)

28 Summit: general comments • GPU high-speed atomic update operations of Volta (and Pascal) have made a huge impact on some applications • Unified memory, automatic migration of data to GPU very helpful for some codes— e.g., codes with deep data structures. However, some users will prefer manual control of the exact timing of transfers for performance. • Most codes that run at our center also run at other sites. Use of vendor-specific libraries or features that give higher performance may be avoided by some users to maintain portability. We prefer standards-based approaches when possible. • MPS will be used by some, but can add complexity, e.g., need to manage CUDA mem handles. Also, MPS adds to the myriad of complexities to manage (resource sets, ranks per node, SMT mode, NUMA domains, thread binding, GPU binding, etc.).

29 Summit: general comments • We like having multiple compilers for risk mitigation, but there may not be any single compiler satisfying all requirements for a project, e.g., OpenACC, fast CPU code generation, etc. Also, Fortran is important, used by slightly under half of projects (INCITE 2014). • Features like RDMA and GPUDirect are important to users. RDMA is needed by at least one library (ExaTensor) used by 3 CAAR projects. • Because of Titan, we have already optimized many of our codes for CPU-GPU interconnect bandwidth (overlapped transfers, data persistence on GPUs) and latency (large transfers, longer-running kernels). However, some users still need to run many kernels, e.g., QMCPACK, thus still rely on low-latency kernel launch. • Inter-node messages of many possible sizes depending on the app, e.g., halo (e.g., S3D-Legion), large (~ 1 GB) messages (ScaLAPACK, CoMet, SLATE), small latency- limited messages (climate codes)—teams will work to optimize each of these cases.

30 Conclusions • Summit has shown itself a very powerful system for multiple applications so far • We have worked with IBM and other partners to resolve issues • We are looking forward to the new science that Summit will make possible in the near future

31 Questions? Wayne Joubert [email protected]

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

32