Towards Exascale Computing with the Atmospheric Model NUMA

Andreas Müller, Daniel S. Abdi, Michal Kopera, Lucas Wilcox, Francis X. Giraldo Department of Applied Mathematics Naval Postgraduate School, Monterey (California), USA

Tim Warburton Virginia Tech Blacksburg (Virginia), USA • NUMA = Non-hydrostatic Unified Model of the Atmosphere • dynamical core inside the Navy’s next generation weather prediction system NEPTUNE (Navy’s Environment Prediction sysTem Using the Numa Engine) • developed by Prof. Francis X. Giraldo and generations of postdocs Goal

• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements Goal

• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical Goal

• NOAA: HIWPP project plan: Goal for 2020: ~ 3km – 3.5km global resolution within operational requirements • Achieved with NUMA: baroclinic wave test case at 3.0km within 4.15 minutes per one day forecast on supercomputer Mira double precision, no shallow atmosphere approx., arbitrary terrain, IMEX in the vertical • Expect: 2km by doing more optimizations Communication between processors

4th order Finite Difference Communication between processors

4th order Finite Difference

CPU 1 CPU 2 Communication between processors

compact stencil methods 4th order Finite Difference (like in NUMA)

CPU 1 CPU 2 Communication between processors

compact stencil methods 4th order Finite Difference (like in NUMA)

CPU 1 CPU 2 CPU 1 CPU 2 Communication between processors

compact stencil methods 4th order Finite Difference (like in NUMA)

NUMA should scale extremely well

CPU 1 CPU 2 CPU 1 CPU 2 Fastest of the World according to .org

# NAME COUNTRY TYPE PEAK (PFLOPS)

1 TaihuLight China Sunway 125.4

2 Tianhe-2 China Intel Xeon Phi 54.9

3 Titan USA NVIDIA GPU 27.1

4 USA IBM BlueGene 20.1

5 K computer Japan Fujitsu CPU 11.2

6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. Fastest Supercomputers of the World according to top500.org

# NAME COUNTRY TYPE PEAK (PFLOPS)

1 TaihuLight China Sunway 125.4

2 Tianhe-2 China Intel Xeon Phi 54.9

3 Titan USA NVIDIA GPU 27.1

4 Sequoia USA IBM BlueGene 20.1

5 K computer Japan Fujitsu CPU 11.2

6 Mira USA IBM BlueGene 10.0 1 PFlops = 1015 floating point ops. per sec. overview

• Mira: Blue Gene strategy: optimize one setup for one specific computer • Titan: GPUs strategy: keep all options, portability on CPUs and GPUs • Intel Knights Landing Mira: Blue Gene optimize one setup for this specific computer

Mira Titan KNL Optimization of main computations runtime of computational kernel on 30720 CPUs

1 MPI process per core, p=6 11.8 sec.

p=3 9.4 sec.

4 MPI processes per core 4 sec.

Lapack 3.9 sec. rewritten for optimized compiler vectorization 2.1 sec. BG/Q vector intrinsics 1 sec.

OpenMP 1.1 sec.

current version 0.7 sec.

optimized intensity 0.5 sec. measured theory 0 3 6 9 12 more details: see Mueller et al. https://arxiv.org/abs/1511.01561

Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira

103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main computational peak bandwidth kernel 5.5⨉ GFlops per node 9.8⨉ 101

GFlops/s per node per GFlops/s data points: different optimization stages

100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Measurements: roofline plot rising bubble test case on Mira

103 attainable timeloop createrhs blue: peak GFlops entire timeloop 102 red: main theoretical computational peak bandwidth optimum kernel 5.5⨉ GFlops per node 9.8⨉ 101

GFlops/s per node per GFlops/s data points: different optimization stages

100 10-1 100 101 102 arithmeticoperational intensity intensity (Flops/Bytes) (Flops/Byte) Mira Titan KNL Strong scaling with NUMA 1.8 billion grid points (3.0km horizontal, 31 levels vertical)

baroclinic instability:baroclinic p=3, instability, 3.2km resolution,p=3, 3.0km horizontal 31 dof resolutionin the vertical 350 450 total runtime 400computation time 300

350 250 300

200 250 99.1% 150 200 strong scaling 150 100 model days per wallclock day efficiency model days / wall clock day 100 50 50 3.14M threads

0 0 0 0 0.50.5 1 1 1.5 1.5 2 2 2.5 32.5 3.5 3 4 numbernumber of threads of threads #106 #106

Mira Titan KNL 12.1% of theoretical peak (flops)

baroclinic instability, p=3, 3.0km horizontal resolution

total 12 filter createrhs IMEX 10 1.21 PFlops on

8 entire Mira

6 % of peak 4

2

0 104 105 106 107 number of threads

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast

baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107

106 current version 105 number of threads

104

103 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast

baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107

106 current version 105 Mira

number of threads fully 104 optimized

103 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Where are we heading? dynamics within 4.5 minutes runtime per one day forecast

baroclinic instability, p=3, 31 vertical layers, 128 columns per node 107

106 Aurora 2018 ? current version 105 Mira

number of threads fully 104 optimized

103 1 2 3 5 10 20 30 50 100 resolution in km

Mira Titan KNL Titan: GPUs keep all options, portability on CPUs and GPUs

Mira Titan KNL Titan

18,688 CPU nodes (16 cores each) network

18,688 NVIDIA GPUs (2,688 CUDA cores each)

Mira Titan KNL OCCA2: unified threading model (slide:OCCA2: courtesy of Lucas unified Wilcox and threading Tim Warburton) model

PortabilityOCCA2: & extensibility: unified device independent threading kernel language model (or OpenCLPortability / CUDA) & extensibility and: nativedevice independent host APIs. kernel language and native host APIs.

Host Language pThreads

# ++ CPU Processors

Xeon Phi

Intel COI AMD GPUs

Kernel Language

FPGAs OCCA OKL

CUDA-x86 NVIDIA GPUs

Floopy

Available at: https://github.com/tcew/OCCA2Back ends Devices 23 Mira Titan KNL Floopy / OCCA2 Code

Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton)

Mira Titan KNL Floopy / OCCA2 Code Floopy / OCCA2 Code

Floopy/OCCA2 code (slide: courtesy of Lucas Wilcox and Tim Warburton)

Mira Titan KNL Introduction PortingNUMAtomany-core Scalability Test cases Conclusions NUMA - Scalability on GPU cluster

AWeak 90% scaling weak scaling up to 16,384 efficiency GPUs on (88% Titan of Supercomputer Titan) using 16384using 900 elements K20x per GPUs GPU, polynomial order 8 (up to 10.7 billion grid points) 100

90

80

70

60

50 90% weak scaling

40 Efficiency(%) efficiency using DG with

30 communication overlap

20 CG nooverlap 10 DG nooverlap DG overlap 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 No of GPUs Abdi et al.: Acceleration of the Implicit-Explicit Non-hydrostatic Unified Model of the Atmosphere (NUMA) on Manycore Processors, IJHPCA (submitted, 2016) Figure:pre-printGPU and scalability more information: see study http://frankgiraldo.wixsite.com/mysite/numa on Titan supercomputer

24/32 Mira Titan KNL Abdi, Wilcox, Warburton and Giraldo roofline plot: single precision 17 on the NVIDIA K20X GPU on Titan

4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel 3520 GFLOPS/s Update kernel Update kernel 200 Project kernel Project kernel

3 10 Roofline

150 12

2 10 208 GB/s GB/s

100 GFLOPS/s 4 convergence order

1 10 50 2

0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB

(a) SP-CG kernels performance Mira Titan KNL

(b) DP-CG kernels performance

(c) SP-DG kernels performance

(d) DP-DG kernels performance

Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their efficiency. The GPU is a Tesla K20c. time to make a day’s weather forecast. For this reason, we a given simulation time limit for two resolutions: a coarse also conducted strong scaling tests, shown in Fig. 7, on a grid of 13km resolution and a fine grid of 3km resolution. global scale simulation problem described in Sec. 8.5. Our goal here is to determine the number of GPUs required for

Prepared using sagej.cls Abdi, Wilcox, Warburton and Giraldo 17

roofline plot: double precision on the NVIDIA K20X GPU on Titan (a) SP-CG kernels performance

4 250 10 Horizontal volume kernel Horizontal volume kernel Vertical volume kernel Vertical volume kernel Update kernel Update kernel 200 Project kernel Project kernel 1170 GFLOPS/s 3 10 Roofline

150

2 10 208 GB/s 12 GB/s

100 GFLOPS/s 4

1 convergence 10 50 order 2

0 0 10 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 N 10 10 10 10 p GFLOPS/GB

(b) DP-CG kernels performance Mira Titan KNL

(c) SP-DG kernels performance

(d) DP-DG kernels performance

Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their efficiency. The GPU is a Tesla K20c. time to make a day’s weather forecast. For this reason, we a given simulation time limit for two resolutions: a coarse also conducted strong scaling tests, shown in Fig. 7, on a grid of 13km resolution and a fine grid of 3km resolution. global scale simulation problem described in Sec. 8.5. Our goal here is to determine the number of GPUs required for

Prepared using sagej.cls some more results from Titan

• GPU: up to 15 times faster than original version on one CPU node for single precision (CAM-SE and other models: less than 5x speed-up) • CPU: single CPU node GPU 16-core AMD precision about NVIDIA K20X 30% faster than Opteron 6274 peak double 141 GFlops/s 1.31 TFlops/s performance GPU: single peak • 32 GB/s 208 GB/s about 2x faster bandwidth than double peak energy 115 W 235 W

Mira Titan KNL Intel Knights Landing

Mira Titan KNL Knights Landing: roofline plot OCCA version of NUMA

4 10

2100 GFLOPS/s

3 10

GB/s 0 44

2 10

1 GFLOPS/s 10 Volume kernel Diffusion kernel

0 Update Kernel of ARK 10 create_rhs_gmres_no_schur create_lhs_gmres_no_schur_set2c extract_q_gmres_no_schur Roofline −1 10 −3 −2 −1 0 1 2 3 10 10 10 10 10 10 10 GFLOPS/GB

Mira Titan KNL Introduction PortingNUMAtomany-core Scalability Test cases Conclusions NUMAKnights - Preliminary Landing: weak scalability scaling results on KNL cluster work in progress

100

90

80

70

60

50

40

Scaling efficiency (%) 30

20

10

0 2 4 6 8 10 12 14 16 Number of KNL nodes

26/32 Mira Titan KNL Summary • Mira: – 3.0km baroclinic instability within 4.15 minutes runtime per one day forecast – 99.1% strong scaling efficiency on Mira, 1.21 PFlops • Titan: – 90% weak scaling efficiency on 16,384 GPUs – GPU: runs up to 15x faster than on one CPU node • Intel Knights Landing: – looks good but still room for improvement

Mira Titan KNL Thank you for your attention!