Towards Exascale Computing with the Atmospheric Model NUMA
Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA
Tim Warburton Virginia Tech Blacksburg (Virginia), USA • NUMA = Non-hydrostatic Unified Model of the Atmosphere • dynamical core inside the Navy’s next generation weather prediction system NEPTUNE (Navy’s Environment Prediction sysTem Using the Numa Engine) • developed by Prof. Francis X. Giraldo and generations of postdocs Goal
• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements Goal
• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Goal
• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical • Expect: 2km by doing more optimizations Communication between processors
4th order Finite Difference Communication between processors
4th order Finite Difference
CPU 1 CPU 2 Communication between processors
compact stencil methods 4th order Finite Difference (like in NUMA)
CPU 1 CPU 2 Communication between processors
compact stencil methods 4th order Finite Difference (like in NUMA)
CPU 1 CPU 2 CPU 1 CPU 2 Communication between processors
compact stencil methods 4th order Finite Difference (like in NUMA)
NUMA should scale extremely well
CPU 1 CPU 2 CPU 1 CPU 2 Fastest Supercomputers of the World according to top500.org
# NAME COUNTRY TYPE PEAK (PFLOPS)
1 TaihuLight China Sunway 125.4
2 Tianhe-2 China Intel Xeon Phi 54.9
3 Titan USA NVIDIA GPU 27.1
4 Sequoia USA IBM BlueGene 20.1
5 K computer Japan Fujitsu CPU 11.2
6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. Fastest Supercomputers of the World according to top500.org
# NAME COUNTRY TYPE PEAK (PFLOPS)
1 TaihuLight China Sunway 125.4
2 Tianhe-2 China Intel Xeon Phi 54.9
3 Titan USA NVIDIA GPU 27.1
4 Sequoia USA IBM BlueGene 20.1
5 K computer Japan Fujitsu CPU 11.2
6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. overview
• Mira: Blue Gene strategy: optimize one setup for one specific computer • Titan: GPUs strategy: keep all options, portability on CPUs and GPUs • Intel Knights Landing Mira: Blue Gene optimize one setup for this specific computer
Mira Titan KNL Optimization of main computations runtime of computational kernel on 30720 CPUs
1 MPI process per core, p=6 11.8 sec.
p=3 9.4 sec.
4 MPI processes per core 4 sec.
Lapack 3.9 sec. rewritten for optimized compiler vectorization 2.1 sec. BG/Q vector intrinsics 1 sec.
OpenMP 1.1 sec.
current version 0.7 sec.
optimized intensity 0.5 sec. measured theory 0 3 6 9 12 more details: see Mueller et al. https://arxiv.org/abs/1511.01561
Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira
103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main computational peak bandwidth kernel 5.5⨉ GFlops per node 9.8⨉ 101
GFlops/s per node per GFlops/s data points: different optimization stages
100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira
103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main theoretical computational peak bandwidth optimum kernel 5.5⨉ GFlops per node 9.8⨉ 101
GFlops/s per node per GFlops/s data points: different optimization stages
100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical)
baroclinic instability:baroclinic p=3, instability, 3.2km resolution,p=3, 3.0km horizontal 31 dof resolutionin the vertical 350 450 total runtime 400computation time 300
350 250 300
200 250 99.1% 150 200 strong scaling 150 100 model days per wallclock day efficiency model days / wall clock day 100 50 50 3.14M threads
0 0 0 0 0.50.5 1 1 1.5 1.5 2 2 2.5 32.5 3.5 3 4 numbernumber of threads of threads #106 #106
Mira Titan KNL 12.1% of theoretical peak (flops)
baroclinic instability, p=3, 3.0km horizontal resolution
total 12 filter createrhs IMEX 10 1.21 PFlops on
8 entire Mira
6 % of peak flops 4
2
0 104 105 106 107 number of threads
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast
baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107
106 current version 105 number of threads
104
103 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast
baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107
106 current version 105 Mira
number of threads fully 104 optimized
103 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast
baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107
106 Aurora 2018 ? current version 105 Mira
number of threads fully 104 optimized
103 1 2 3 5 10 20 30 50 100 resolution in km
Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs
Mira Titan KNL Titan
18,688 CPU nodes (16 cores each) network
18,688 NVIDIA GPUs (2,688 CUDA cores each)
Mira Titan KNL OCCA2: unified threading model (slide:OCCA2: courtesy of Lucas unified Wilcox and threading Tim Warburton) model
PortabilityOCCA2: & extensibility: unified device independent threading kernel language model (or OpenCLPortability / CUDA) & extensibility and: nativedevice independent host APIs. kernel language and native host APIs.
Host Language pThreads
# ++ CPU Processors
Xeon Phi
Intel COI AMD GPUs
Kernel Language
FPGAs OCCA OKL
CUDA-x86 NVIDIA GPUs
Floopy
Available at: https://github.com/tcew/OCCA2Back ends Devices 23 Mira Titan KNL Floopy / OCCA2 Code
Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton)
Mira Titan KNL Floopy / OCCA2 Code Floopy / OCCA2 Code
Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton)
Mira Titan KNL Introduction PortingNUMAtomany-core Scalability Test cases Conclusions NUMA - Scalability on GPU cluster
AWeak 90% scaling weak scaling up to 16,384 efficiency GPUs on (88% Titan of Supercomputer Titan) using 16384using 900 elements K20x per GPUs GPU, polynomial order 8 (up to 10.7 billion grid points) 100
90
80
70
60
50 90% weak scaling
40 Efficiency(%) efficiency using DG with
30 communication overlap
20 CG nooverlap 10 DG nooverlap DG overlap 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 No of GPUs Abdi et al.: Acceleration of the Implicit-Explicit Non-hydrostatic Unified Model of the Atmosphere (NUMA) on Manycore Processors, IJHPCA (submitted, 2016) Figure:pre-printGPU and scalability more information: see study http://frankgiraldo.wixsite.com/mysite/numa on Titan supercomputer
24/32 Mira Titan KNL Abdi, Wilcox, Warburton and Giraldo roofline plot: single precision 17 on the NVIDIA K20X GPU on Titan
4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel 3520 GFLOPS/s Update kernel Update kernel 200 Project kernel Project kernel
3 10 Roofline
150 12
2 10 208 GB/s GB/s
100 GFLOPS/s 4 convergence order
1 10 50 2
0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB
(a) SP-CG kernels performance Mira Titan KNL
(b) DP-CG kernels performance
(c) SP-DG kernels performance
(d) DP-DG kernels performance
Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their efficiency. The GPU is a Tesla K20c. time to make a day’s weather forecast. For this reason, we a given simulation time limit for two resolutions: a coarse also conducted strong scaling tests, shown in Fig. 7, on a grid of 13km resolution and a fine grid of 3km resolution. global scale simulation problem described in Sec. 8.5. Our goal here is to determine the number of GPUs required for
Prepared using sagej.cls Abdi, Wilcox, Warburton and Giraldo 17
roofline plot: double precision on the NVIDIA K20X GPU on Titan (a) SP-CG kernels performance
4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel Update kernel Update kernel 200 Project kernel Project kernel 1170 GFLOPS/s 3 10 Roofline
150
2 10 208 GB/s 12 GB/s
100 GFLOPS/s 4
1 convergence 10 50 order 2
0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB
(b) DP-CG kernels performance Mira Titan KNL
(c) SP-DG kernels performance
(d) DP-DG kernels performance
Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their efficiency. The GPU is a Tesla K20c. time to make a day’s weather forecast. For this reason, we a given simulation time limit for two resolutions: a coarse also conducted strong scaling tests, shown in Fig. 7, on a grid of 13km resolution and a fine grid of 3km resolution. global scale simulation problem described in Sec. 8.5. Our goal here is to determine the number of GPUs required for
Prepared using sagej.cls some more results from Titan
• GPU: up to 15 times faster than original version on one CPU node for single precision (CAM-SE and other models: less than 5x speed-up) • CPU: single CPU node GPU 16-core AMD precision about NVIDIA K20X 30% faster than Opteron 6274 peak double 141 GFlops/s 1.31 TFlops/s performance GPU: single peak • 32 GB/s 208 GB/s about 2x faster bandwidth than double peak energy 115 W 235 W
Mira Titan KNL Intel Knights Landing
Mira Titan KNL Knights Landing: roofline plot OCCA version of NUMA
4 10
2100 GFLOPS/s
3 10
GB/s 0 44
2 10
1 GFLOPS/s 10 Volume kernel Diffusion kernel
0 Update Kernel of ARK 10 create_rhs_gmres_no_schur create_lhs_gmres_no_schur_set2c extract_q_gmres_no_schur Roofline −1 10 −3 −2 −1 0 1 2 3 10 10 10 10 10 10 10 GFLOPS/GB
Mira Titan KNL Introduction PortingNUMAtomany-core Scalability Test cases Conclusions NUMAKnights - Preliminary Landing: weak scalability scaling results on KNL cluster work in progress
100
90
80
70
60
50
40
Scaling efficiency (%) 30
20
10
0 2 4 6 8 10 12 14 16 Number of KNL nodes
26/32 Mira Titan KNL Summary • Mira: – 3.0km baroclinic instability within 4.15 minutes runtime per one day forecast – 99.1% strong scaling efficiency on Mira, 1.21 PFlops • Titan: – 90% weak scaling efficiency on 16,384 GPUs – GPU: runs up to 15x faster than on one CPU node • Intel Knights Landing: – looks good but still room for improvement
Mira Titan KNL Thank you for your attention!