Scienfic Simulaons on Thousands of GPUs with Performance Portability

Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement

• Three “pre-exascale” machines have been announced in the US, each in the region of 100-300 petaflops • at ORNL and Sierra at LLNL will use NVIDIA GPUs (with IBM CPUs) . • Aurora at Argonne will use Phi many-core CPUs ( system) • Performance Portability is the key issue for the programmer 2

Outline

Ÿ Applications: Ludwig and MILC

Ÿ Performance Portability with targetDP

Ÿ Performance results on GPU, CPU and § Using same source code for each

Ÿ Scaling to many nodes with MPI+targetDP

3 Ludwig Applicaon • So maer substances or complex fluids are all around us • Ludwig: uses lace Boltzmann and finite difference methods to simulate a wide range of systems • Improving the understanding of, and ability to manipulate, liquid crystals is a very acve research area • But required simulaons can be extremely computaonally demanding, due to range of scales involved • targetDP developed in co-design with Ludwig

Gray, A., Hart, A., Henrich, O. & Stratford, K., Scaling soft Stratford, K., A. Gray, and J. S. Lintuvuori. "Large Colloids matter physics to thousands of graphics processing units in in Cholesteric Liquid Crystals." Journal of Statistical parallel, IJHPCA (2015) Physics 161.6 (2015): 1496-1507. 4 MILC applicaon • Lace QCD simulaons provide numerical studies to help understand how quarks and gluons interact to form protons, neutrons and other elementary parcles. • The Unified European Applicaon Benchmark Suite (UEABS) is a set of 12 applicaon codes designed to be representave of EU HPC usage ¡ including Lace QCD component, derived from MILC codebase • targetDP applied to this ¡ hp://www.prace-ri.eu/ueabs/ applicaon benchmark to enable for GPU and Xeon Phi 5 Mul-valued data • For most scienfic simulaons the boleneck is memory bandwidth • Simulaon data consists of mulple values at each site • In memory, we have a choice of how to store this ¡ |rgb|rgb|rgb|rgb| (Array of Structs AoS) ¡ |rrrr|gggg|bbbb| (Struct of Arrays SoA) ¡ Most general case is Array of Structs of (short) Arrays (AoSoA) ¡ E.g. ||rr|gg|bb|||rr|gg|bb|| has SA length of 2 ¡ Major effect on bandwidth. Best layout architecture-specific • Soluon: ¡ De-couple memory layout from applicaon source code ¡ Can simply be done with macro, e.g. field[INDEX(iDim,iSite)]

6 targetDP • Simple serial code example: loop over N grid points ¡ With some operaon … at each point

int iSite; for (iSite = 0; iSite < N; iSite++) { ... }

7 • OpenMP int iSite; #pragma omp parallel for for (iSite = 0; iSite < N; iSite++) { • targetDP ... } __targetEntry__ void scale(double* field){

int iSite; • CUDA __targetTLP__(iSite, N) {

__global__ void scale(double* field) { ... } int iSite; return; iSite=blockIdx.x*blockDim.x+threadIdx.x } if(iSite

... } return; }

8 __targetEntry__ void scale(double* t_field) {

int index; __targetTLP__(iSite, N) {

int iDim; for (iDim = 0; iDim < 3; iDim++) {

t_field[INDEX(iDim,iSite)] = t_a*t_field[INDEX(iDim,iSite)];

} } return; } • PROBLEM: to fully ulise modern CPUs, compiler must vectorize innermost loops to create vector instrucons. • SOLUTION: TLP can be strided, such that each thread operates on chunk of VVL lace sites ¡ VVL must be 1 for above example to work ¡ But we can set VVL>1, and add a new innermost loop 9 __targetEntry__ void scale(double* t_field) {

int baseIndex; __targetTLP__(baseIndex, N) {

int iDim, vecIndex; for (iDim = 0; iDim < 3; iDim++) {

__targetILP__(vecIndex) \ t_field[INDEX(iDim,baseIndex+vecIndex)] = \ t_a*t_field[INDEX(iDim,baseIndex+vecIndex)]; } } return; } • ILP can map to loop over chunk of lace sites, with OpenMP SIMD direcve • Easily vectorizable by compiler • VVL can be tuned specifically for hardware, e.g. VVL=8 will create single IMCI instrucon for 8-way DP vector unit on Xeon Phi ¡ Without this, performance is several mes worse on Xeon Phi • We can just map to an empty macro, when we don’t want ILP 10 • Funcon called from host code using wrappers to CUDA API ¡ That can alternavely map to regular CPU (malloc, memcpy etc) targetMalloc((void **) &t_field, datasize);

copyToTarget(t_field, field, datasize); copyConstDoubleToTarget(&t_a, &a, sizeof(double));

scale __targetLaunch__(N) (t_field); targetSynchronize();

copyFromTarget(field, t_field, datasize); targetFree(t_field);

11 Results

CPU Xeon Phi GPU

• Same performance-portable targetDP source code on all architectures 12 700" Full$Ludwig$Liquid$Crystal$128x128x128$Test$Case$$ 600" Ludwig"Remainder" Advect."Bound." 500" AdvecPon" LC"Update" 400" Chemical"Stress" Order"Par."Grad." !me$(s)$ 300" Collision" PropagaPon" 200" " "" 100" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 13 700" Full$MILC$Conjugate$Gradient$64x64x32x8$Test$Case$$ 600" MILC"Remainder"

500" ShiN" Scalar"Mult."Add" 400" Insert" Insert"&"Mult." !me$(s)$ 300" Extract"&"Mult." Extract" 200"

" 100" "" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 14 Comparing with capability of hardware • Use “Roofline” model

• It can be shown that all our kernels are memory-bandwidth bound ¡ Compare kernel bandwidth with STREAM benchmark

140" Intel"IvyPbridge"(Es.mated)" Intel"Xeon"Phi"(Es.mated)" NVIDIA"K40"GPU"(Actual)" 120"

100"

80"

60"

40" Percentage)of)STREAM)

20"

0"

ShiN"(0.00)" Extract"(0.07)" Insert"(0.10)" Collision"(1.08)" LC"Update"(0.79)"Advec.on"(0.13)" Propaga.on"(0.00)" Chemical"Stress"(2.97)" Advect."Bound."(0.05)" Order"Par."Grad."(0.15)" Extract"and"Mult."(0.38)"Insert"and"Mult."(0.38)" Scalar"Mult."Add"(0.07)" 15 Ludwig" MILC" MPI+targetDP Scaling

16 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$128x128x128$ 1000"

Titan"CPU"" Titan"CPU""(One"160core" (One"160core" 100" Interlagos"per"node)""" 100" Interlagos"per"node)""" Archer"CPU"" Archer"CPU""(Two"120core"Ivy0 (Two"120core"Ivy0bridge"per"node)"

!me$(s)$ bridge"per"node)" !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (One"K20X"per"node)"

1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 17 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$1024x1024x512$ 1000#

Titan#CPU##Titan"CPU"" (One#160core#(One"160core" Interlagos#per#node)### 100" Interlagos"per"node)"""

Archer#CPU##Archer"CPU"" (Two#120core#Ivy0(Two"120core"Ivy0 bridge#per#node)#bridge"per"node)" 100#!me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan#GPU#(One"K20X"per"node)" #(One#K20#per#node)#

1" 10# 1" 10" 100" 1000" 100# 1000#nodes$ 10000# nodes$ 18 Ludwig$Liquid$Crystal:$128x128x128$ 1000" MILC$Conjugate$Gradient:$64x64x32x8$ 1000"

Titan"CPU""Titan"CPU"" (one"160core"Interlagos"(One"160core" per"node)" 100" Interlagos"per"node)""" 100" Archer"CPU""Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (one"K20X"per"node)"

1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 19 MILC$Conjugate$Gradient:$64x64x64x192$Ludwig$Liquid$Crystal:$128x128x128$ 1000" 1000"

Titan"CPU"" (one"160core"Interlagos"Titan"CPU"" per"node)"(One"160core" 100" Interlagos"per"node)""" 100" Archer"CPU"Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU""Titan"GPU"" 10" (one"K20X"per"node)"(One"K20X"per"node)" 10"

1" 1" 1" 10" 100" 1000" 10" 100" nodes$1000" 10000" nodes$ 20 Summary • targetDP is a simplisc framework that allows grid-based codes to perform well on modern mul/many-core CPUs as well as GPUs ¡ By abstracng parallelism and memory spaces ¡ Express TLP and ILP. We can see that exposing ILP is crucial on Xeon Phi today, and vector units will connue to get wider on future CPUs ¡ It is also crucial to de-couple memory layout by abstracng memory accesses. • We demonstrated performance portability across mulple modern architectures • GPUs and Xeon Phi are significantly faster than CPUs, because they offer higher memory bandwidth ¡ GPUs have advantage over Xeon Phi • MPI+targetDP is suitable for large-scale supercompung ¡ NVLINK should help with strong mul-GPU scaling • We have been concentrang on structured grid-based applicaons, but similar thinking may be fruiul for other areas • targetDP is freely available ¡ hp://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README

21

Acknowledgements

22