Towards Exascale Computing with the Atmospheric Model NUMA

Towards Exascale Computing with the Atmospheric Model NUMA Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA Tim Warburton Virginia Tech Blacksburg (Virginia), USA • NUMA = Non-hydrostatic Unified Model of the Atmosphere • dynamical core inside the Navy’s next generation weather prediction system NEPTUNE (Navy’s Environment Prediction sysTem Using the Numa Engine) • developed by Prof. Francis X. Giraldo and generations of postdocs Goal • NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements Goal • NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Goal • NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical • Expect: 2km by doing more optimizations Communication between processors 4th order Finite Difference Communication between processors 4th order Finite Difference CPU 1 CPU 2 Communication between processors compact stencil methods 4th order Finite Difference (like in NUMA) CPU 1 CPU 2 Communication between processors compact stencil methods 4th order Finite Difference (like in NUMA) CPU 1 CPU 2 CPU 1 CPU 2 Communication between processors compact stencil methods 4th order Finite Difference (like in NUMA) NUMA should scale extremely well CPU 1 CPU 2 CPU 1 CPU 2 Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. Fastest Supercomputers of the World according to top500.org # NAME COUNTRY TYPE PEAK (PFLOPS) 1 TaihuLight China Sunway 125.4 2 Tianhe-2 China Intel Xeon Phi 54.9 3 Titan USA NVIDIA GPU 27.1 4 Sequoia USA IBM BlueGene 20.1 5 K computer Japan Fujitsu CPU 11.2 6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. overview • Mira: Blue Gene strategy: optimize one setup for one specific computer • Titan: GPUs strategy: keep all options, portability on CPUs and GPUs • Intel Knights Landing Mira: Blue Gene optimize one setup for this specific computer Mira Titan KNL Optimization of main computations runtime of computational kernel on 30720 CPUs 1 MPI process per core, p=6 11.8 sec. p=3 9.4 sec. 4 MPI processes per core 4 sec. Lapack 3.9 sec. rewritten for optimized compiler vectorization 2.1 sec. BG/Q vector intrinsics 1 sec. OpenMP 1.1 sec. current version 0.7 sec. optimized intensity 0.5 sec. measured theory 0 3 6 9 12 more details: see Mueller et al. https://arxiv.org/abs/1511.01561 Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira 103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main computational peak bandwidth kernel 5.5⨉ GFlops per node 9.8⨉ 101 GFlops/s per node per GFlops/s data points: different optimization stages 100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira 103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main theoretical computational peak bandwidth optimum kernel 5.5⨉ GFlops per node 9.8⨉ 101 GFlops/s per node per GFlops/s data points: different optimization stages 100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical) baroclinic instability:baroclinic p=3, instability, 3.2km resolution,p=3, 3.0km horizontal 31 dof resolutionin the vertical 350 450 total runtime 400computation time 300 350 250 300 200 250 99.1% 150 200 strong scaling 150 100 model days per wallclock day efficiency model days / wall clock day 100 50 50 3.14M threads 0 0 0 0 0.50.5 1 1 1.5 1.5 2 2 2.5 32.5 3.5 3 4 numbernumber of threads of threads #106 #106 Mira Titan KNL 12.1% of theoretical peak (flops) baroclinic instability, p=3, 3.0km horizontal resolution total 12 filter createrhs IMEX 10 1.21 PFlops on 8 entire Mira 6 % of peak flops 4 2 0 104 105 106 107 number of threads Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107 106 current version 105 number of threads 104 103 1 2 3 5 10 20 30 50 100 resolution in km Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107 106 current version 105 Mira number of threads fully 104 optimized 103 1 2 3 5 10 20 30 50 100 resolution in km Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107 106 Aurora 2018 ? 2018 Aurora current version 105 Mira number of threads fully 104 optimized 103 1 2 3 5 10 20 30 50 100 resolution in km Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs Mira Titan KNL Titan 18,688 CPU nodes (16 cores each) network 18,688 NVIDIA GPUs (2,688 CUDA cores each) Mira Titan KNL OCCA2: unified threading model (slide:OCCA2: courtesy of Lucas unified Wilcox and threading Tim Warburton) model PortabilityOCCA2: & extensibility: unified device independent threading kernel language model (or OpenCLPortability / CUDA) & extensibility and: nativedevice independent host APIs. kernel language and native host APIs. Host Language pThreads # ++ CPU Processors Xeon Phi Intel COI AMD GPUs Kernel Language FPGAs OCCA OKL CUDA-x86 NVIDIA GPUs Floopy Available at: https://github.com/tcew/OCCA2Back ends Devices 23 Mira Titan KNL Floopy / OCCA2 Code Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL Floopy / OCCA2 Code Floopy / OCCA2 Code Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton) Mira Titan KNL Introduction PortingNUMAtomany-core Scalability Test cases Conclusions NUMA - Scalability on GPU cluster AWeak 90% scaling weak scaling up to 16,384 efficiency GPUs on (88% Titan of Supercomputer Titan) using 16384using 900 elements K20x per GPUs GPU, polynomial order 8 (up to 10.7 billion grid points) 100 90 80 70 60 50 90% weak scaling 40 Efficiency(%) efficiency using DG with 30 communication overlap 20 CG nooverlap 10 DG nooverlap DG overlap 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 No of GPUs Abdi et al.: Acceleration of the Implicit-Explicit Non-hydrostatic Unified Model of the Atmosphere (NUMA) on Manycore Processors, IJHPCA (submitted, 2016) Figure:pre-printGPU and scalability more information: see study http://frankgiraldo.wixsite.com/mysite/numa on Titan supercomputer 24/32 Mira Titan KNL Abdi, Wilcox, Warburton and Giraldo roofline plot: single precision 17 on the NVIDIA K20X GPU on Titan 4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel 3520 GFLOPS/s Update kernel Update kernel 200 Project kernel Project kernel 3 10 Roofline 150 12 2 10 208 GB/s GB/s 100 GFLOPS/s 4 convergence order 1 10 50 2 0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB (a) SP-CG kernels performance Mira Titan KNL (b) DP-CG kernels performance (c) SP-DG kernels performance (d) DP-DG kernels performance Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their efficiency. The GPU is a Tesla K20c. time to make a day’s weather forecast. For this reason, we a given simulation time limit for two resolutions: a coarse also conducted strong scaling tests, shown in Fig. 7, on a grid of 13km resolution and a fine grid of 3km resolution. global scale simulation problem described in Sec. 8.5. Our goal here is to determine the number of GPUs required for Prepared using sagej.cls Abdi, Wilcox, Warburton and Giraldo 17 roofline plot: double precision on the NVIDIA K20X GPU on Titan (a) SP-CG kernels performance 4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel Update kernel Update kernel 200 Project kernel Project kernel 1170 GFLOPS/s 3 10 Roofline 150 2 10 208 GB/s 12 GB/s 100 GFLOPS/s 4 1 convergence 10 50 order 2 0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB (b) DP-CG kernels performance Mira Titan KNL (c) SP-DG kernels performance (d) DP-DG kernels performance Figure 4.

Towards Exascale Computing with the Atmospheric Model NUMA

Ushering in a New Era: Argonne National Laboratory & Aurora

Roofline Model Toolkit: a Practical Tool for Architectural and Program

Characterizing and Understanding HPC Job Failures Over the 2K-Day Life of IBM Bluegene/Q System

June 2012 | TOP500 Supercomputing Sites

Supercomputers – Prestige Objects Or Crucial Tools for Science and Industry?

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

ALCF Newsbytes Argonne Leadership Computing Facility Argonne National Laboratory

Optimizing the Performance of Reactive Molecular Dynamics Simulations for Multi-Core Architectures

Early Evaluation of the Cray XC40 Xeon Phi System 'Theta' at Argonne

Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints

Application Power Profiling on IBM Blue Gene/Q

No Slide Title