Scien fic Simula ons on Thousands of GPUs with Performance Portability
Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement
• Three “pre-exascale” machines have been announced in the US, each in the region of 100-300 petaflops • Summit at ORNL and Sierra at LLNL will use NVIDIA GPUs (with IBM CPUs) . • Aurora at Argonne will use Intel Xeon Phi many-core CPUs (Cray system) • Performance Portability is the key issue for the programmer 2
Outline
Applications: Ludwig and MILC
Performance Portability with targetDP
Performance results on GPU, CPU and Xeon Phi § Using same source code for each
Scaling to many nodes with MPI+targetDP
3 Ludwig Applica on • So ma er substances or complex fluids are all around us • Ludwig: uses la ce Boltzmann and finite difference methods to simulate a wide range of systems • Improving the understanding of, and ability to manipulate, liquid crystals is a very ac ve research area • But required simula ons can be extremely computa onally demanding, due to range of scales involved • targetDP developed in co-design with Ludwig
Gray, A., Hart, A., Henrich, O. & Stratford, K., Scaling soft Stratford, K., A. Gray, and J. S. Lintuvuori. "Large Colloids matter physics to thousands of graphics processing units in in Cholesteric Liquid Crystals." Journal of Statistical parallel, IJHPCA (2015) Physics 161.6 (2015): 1496-1507. 4 MILC applica on • La ce QCD simula ons provide numerical studies to help understand how quarks and gluons interact to form protons, neutrons and other elementary par cles. • The Unified European Applica on Benchmark Suite (UEABS) is a set of 12 applica on codes designed to be representa ve of EU HPC usage ¡ including La ce QCD component, derived from MILC codebase • targetDP applied to this ¡ h p://www.prace-ri.eu/ueabs/ applica on benchmark to enable for GPU and Xeon Phi 5 Mul -valued data • For most scien fic simula ons the bo leneck is memory bandwidth • Simula on data consists of mul ple values at each site • In memory, we have a choice of how to store this ¡ |rgb|rgb|rgb|rgb| (Array of Structs AoS) ¡ |rrrr|gggg|bbbb| (Struct of Arrays SoA) ¡ Most general case is Array of Structs of (short) Arrays (AoSoA) ¡ E.g. ||rr|gg|bb|||rr|gg|bb|| has SA length of 2 ¡ Major effect on bandwidth. Best layout architecture-specific • Solu on: ¡ De-couple memory layout from applica on source code ¡ Can simply be done with macro, e.g. field[INDEX(iDim,iSite)]
6 targetDP • Simple serial code example: loop over N grid points ¡ With some opera on … at each point
int iSite; for (iSite = 0; iSite < N; iSite++) { ... }
7 • OpenMP int iSite; #pragma omp parallel for for (iSite = 0; iSite < N; iSite++) { • targetDP ... } __targetEntry__ void scale(double* field){
int iSite; • CUDA __targetTLP__(iSite, N) {
__global__ void scale(double* field) { ... } int iSite; return; iSite=blockIdx.x*blockDim.x+threadIdx.x } if(iSite ... } return; } 8 __targetEntry__ void scale(double* t_field) { int index; __targetTLP__(iSite, N) { int iDim; for (iDim = 0; iDim < 3; iDim++) { t_field[INDEX(iDim,iSite)] = t_a*t_field[INDEX(iDim,iSite)]; } } return; } • PROBLEM: to fully u lise modern CPUs, compiler must vectorize innermost loops to create vector instruc ons. • SOLUTION: TLP can be strided, such that each thread operates on chunk of VVL la ce sites ¡ VVL must be 1 for above example to work ¡ But we can set VVL>1, and add a new innermost loop 9 __targetEntry__ void scale(double* t_field) { int baseIndex; __targetTLP__(baseIndex, N) { int iDim, vecIndex; for (iDim = 0; iDim < 3; iDim++) { __targetILP__(vecIndex) \ t_field[INDEX(iDim,baseIndex+vecIndex)] = \ t_a*t_field[INDEX(iDim,baseIndex+vecIndex)]; } } return; } • ILP can map to loop over chunk of la ce sites, with OpenMP SIMD direc ve • Easily vectorizable by compiler • VVL can be tuned specifically for hardware, e.g. VVL=8 will create single IMCI instruc on for 8-way DP vector unit on Xeon Phi ¡ Without this, performance is several mes worse on Xeon Phi • We can just map to an empty macro, when we don’t want ILP 10 • Func on called from host code using wrappers to CUDA API ¡ That can alterna vely map to regular CPU (malloc, memcpy etc) targetMalloc((void **) &t_field, datasize); copyToTarget(t_field, field, datasize); copyConstDoubleToTarget(&t_a, &a, sizeof(double)); scale __targetLaunch__(N) (t_field); targetSynchronize(); copyFromTarget(field, t_field, datasize); targetFree(t_field); 11 Results CPU Xeon Phi GPU • Same performance-portable targetDP source code on all architectures 12 700" Full$Ludwig$Liquid$Crystal$128x128x128$Test$Case$$ 600" Ludwig"Remainder" Advect."Bound." 500" AdvecPon" LC"Update" 400" Chemical"Stress" Order"Par."Grad." !me$(s)$ 300" Collision" PropagaPon" 200" " "" 100" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 13 700" Full$MILC$Conjugate$Gradient$64x64x32x8$Test$Case$$ 600" MILC"Remainder" 500" ShiN" Scalar"Mult."Add" 400" Insert" Insert"&"Mult." !me$(s)$ 300" Extract"&"Mult." Extract" 200" " 100" "" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 14 Comparing with capability of hardware • Use “Roofline” model • It can be shown that all our kernels are memory-bandwidth bound ¡ Compare kernel bandwidth with STREAM benchmark 140" Intel"IvyPbridge"(Es.mated)" Intel"Xeon"Phi"(Es.mated)" NVIDIA"K40"GPU"(Actual)" 120" 100" 80" 60" 40" Percentage)of)STREAM) 20" 0" ShiN"(0.00)" Extract"(0.07)" Insert"(0.10)" Collision"(1.08)" LC"Update"(0.79)"Advec.on"(0.13)" Propaga.on"(0.00)" Chemical"Stress"(2.97)" Advect."Bound."(0.05)" Order"Par."Grad."(0.15)" Extract"and"Mult."(0.38)"Insert"and"Mult."(0.38)" Scalar"Mult."Add"(0.07)" 15 Ludwig" MILC" MPI+targetDP Supercomputer Scaling 16 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$128x128x128$ 1000" Titan"CPU"" Titan"CPU""(One"160core" (One"160core" 100" Interlagos"per"node)""" 100" Interlagos"per"node)""" Archer"CPU"" Archer"CPU""(Two"120core"Ivy0 (Two"120core"Ivy0bridge"per"node)" !me$(s)$ bridge"per"node)" !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (One"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 17 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$1024x1024x512$ 1000# Titan#CPU##Titan"CPU"" (One#160core#(One"160core" Interlagos#per#node)### 100" Interlagos"per"node)""" Archer#CPU##Archer"CPU"" (Two#120core#Ivy0(Two"120core"Ivy0 bridge#per#node)#bridge"per"node)" 100#!me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan#GPU#(One"K20X"per"node)" #(One#K20#per#node)# 1" 10# 1" 10" 100" 1000" 100# 1000#nodes$ 10000# nodes$ 18 Ludwig$Liquid$Crystal:$128x128x128$ 1000" MILC$Conjugate$Gradient:$64x64x32x8$ 1000" Titan"CPU""Titan"CPU"" (one"160core"Interlagos"(One"160core" per"node)" 100" Interlagos"per"node)""" 100" Archer"CPU""Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (one"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 19 MILC$Conjugate$Gradient:$64x64x64x192$Ludwig$Liquid$Crystal:$128x128x128$ 1000" 1000" Titan"CPU"" (one"160core"Interlagos"Titan"CPU"" per"node)"(One"160core" 100" Interlagos"per"node)""" 100" Archer"CPU"Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU""Titan"GPU"" 10" (one"K20X"per"node)"(One"K20X"per"node)" 10" 1" 1" 1" 10" 100" 1000" 10" 100" nodes$1000" 10000" nodes$ 20 Summary • targetDP is a simplis c framework that allows grid-based codes to perform well on modern mul /many-core CPUs as well as GPUs ¡ By abstrac ng parallelism and memory spaces ¡ Express TLP and ILP. We can see that exposing ILP is crucial on Xeon Phi today, and vector units will con nue to get wider on future CPUs ¡ It is also crucial to de-couple memory layout by abstrac ng memory accesses. • We demonstrated performance portability across mul ple modern architectures • GPUs and Xeon Phi are significantly faster than CPUs, because they offer higher memory bandwidth ¡ GPUs have advantage over Xeon Phi • MPI+targetDP is suitable for large-scale supercompu ng ¡ NVLINK should help with strong mul -GPU scaling • We have been concentra ng on structured grid-based applica ons, but similar thinking may be frui ul for other areas • targetDP is freely available ¡ h p://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README 21 Acknowledgements 22