Scienzfic Simulazons on Thousands of Gpus with Performance Portability

Scien&fic Simulaons on Thousands of GPUs with Performance Portability Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement • Three “pre-exascale” machines have been announced in the US, each in the region of 100-300 petaflops • Summit at ORNL and Sierra at LLNL will use NVIDIA GPUs (with IBM CPUs) . • Aurora at Argonne will use Intel Xeon Phi many-core CPUs (Cray system) • Performance Portability is the key issue for the programmer 2 Outline Applications: Ludwig and MILC Performance Portability with targetDP Performance results on GPU, CPU and Xeon Phi § Using same source code for each Scaling to many nodes with MPI+targetDP 3 Ludwig Applicaon • So3 ma4er substances or complex fluids are all around us • Ludwig: uses lace Boltzmann and finite difference methods to simulate a wide range of systems • Improving the understanding of, and ability to manipulate, liquid crystals is a very ac&ve research area • But required simulaons can be extremely computaonally demanding, due to range of scales involved • targetDP developed in co-design with Ludwig Gray, A., Hart, A., Henrich, O. & Stratford, K., Scaling soft Stratford, K., A. Gray, and J. S. Lintuvuori. "Large Colloids matter physics to thousands of graphics processing units in in Cholesteric Liquid Crystals." Journal of Statistical parallel, IJHPCA (2015) Physics 161.6 (2015): 1496-1507. 4 MILC applicaon • Lace QCD simulaons provide numerical studies to help understand how quarks and gluons interact to form protons, neutrons and other elementary par&cles. • The Unified European Applicaon Benchmark Suite (UEABS) is a set of 12 applicaon codes designed to be representave of EU HPC usage ¡ including Lace QCD component, derived from MILC codebase • targetDP applied to this ¡ h_p://www.prace-ri.eu/ueabs/ applicaon benchmark to enable for GPU and Xeon Phi 5 Mul&-valued data • For most scien&fic simulaons the bo_leneck is memory bandwidth • Simulaon data consists of mulple values at each site • In memory, we have a choice of how to store this ¡ |rgb|rgb|rgb|rgb| (Array of Structs AoS) ¡ |rrrr|gggg|bbbb| (Struct of Arrays SoA) ¡ Most general case is Array of Structs of (short) Arrays (AoSoA) ¡ E.g. ||rr|gg|bb|||rr|gg|bb|| has SA length of 2 ¡ Major effect on bandwidth. Best layout architecture-specific • Soluon: ¡ De-couple memory layout from applicaon source code ¡ Can simply be done with macro, e.g. field[INDEX(iDim,iSite)] 6 targetDP • Simple serial code example: loop over N grid points ¡ With some operaon … at each point int iSite; for (iSite = 0; iSite < N; iSite++) { ... } 7 • OpenMP int iSite; #pragma omp parallel for for (iSite = 0; iSite < N; iSite++) { • targetDP ... } __targetEntry__ void scale(double* field){ int iSite; • CUDA __targetTLP__(iSite, N) { __global__ void scale(double* field) { ... } int iSite; return; iSite=blockIdx.x*blockDim.x+threadIdx.x } if(iSite<N) { ... } return; } 8 __targetEntry__ void scale(double* t_field) { int index; __targetTLP__(iSite, N) { int iDim; for (iDim = 0; iDim < 3; iDim++) { t_field[INDEX(iDim,iSite)] = t_a*t_field[INDEX(iDim,iSite)]; } } return; } • PROBLEM: to fully ulise modern CPUs, compiler must vectorize innermost loops to create vector instruc&ons. • SOLUTION: TLP can be strided, such that each thread operates on chunk of VVL lace sites ¡ VVL must be 1 for above example to work ¡ But we can set VVL>1, and add a new innermost loop 9 __targetEntry__ void scale(double* t_field) { int baseIndex; __targetTLP__(baseIndex, N) { int iDim, vecIndex; for (iDim = 0; iDim < 3; iDim++) { __targetILP__(vecIndex) \ t_field[INDEX(iDim,baseIndex+vecIndex)] = \ t_a*t_field[INDEX(iDim,baseIndex+vecIndex)]; } } return; } • ILP can map to loop over chunk of lace sites, with OpenMP SIMD direcve • Easily vectorizable by compiler • VVL can be tuned specifically for hardware, e.g. VVL=8 will create single IMCI instruc&on for 8-way DP vector unit on Xeon Phi ¡ Without this, performance is several &mes worse on Xeon Phi • We can just map to an empty macro, when we don’t want ILP 10 • Func&on called from host code using wrappers to CUDA API ¡ That can alternavely map to regular CPU (malloc, memcpy etc) targetMalloc((void **) &t_field, datasize); copyToTarget(t_field, field, datasize); copyConstDoubleToTarget(&t_a, &a, sizeof(double)); scale __targetLaunch__(N) (t_field); targetSynchronize(); copyFromTarget(field, t_field, datasize); targetFree(t_field); 11 Results CPU Xeon Phi GPU • Same performance-portable targetDP source code on all architectures 12 700" Full$Ludwig$Liquid$Crystal$128x128x128$Test$Case$$ 600" Ludwig"ReMainder" Advect."Bound." 500" AdvecPon" LC"Update" 400" CheMical"Stress" Order"Par."Grad." !me$(s)$ 300" Collision" PropagaPon" 200" " "" 100" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 13 700" Full$MILC$Conjugate$Gradient$64x64x32x8$Test$Case$$ 600" MILC"Remainder" 500" ShiN" Scalar"Mult."Add" 400" Insert" Insert"&"Mult." !me$(s)$ 300" Extract"&"Mult." Extract" 200" " 100" "" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 14 Comparing with capability of hardware • Use “Roofline” model • It can be shown that all our kernels are memory-bandwidth bound ¡ Compare kernel bandwidth with STREAM benchmark 140" Intel"IvyPbridge"(Es.mated)" Intel"Xeon"Phi"(Es.mated)" NVIDIA"K40"GPU"(Actual)" 120" 100" 80" 60" 40" Percentage)of)STREAM) 20" 0" ShiN"(0.00)" Extract"(0.07)" Insert"(0.10)" Collision"(1.08)" LC"Update"(0.79)"Advec.on"(0.13)" Propaga.on"(0.00)" Chemical"Stress"(2.97)" Advect."Bound."(0.05)" Order"Par."Grad."(0.15)" Extract"and"Mult."(0.38)"Insert"and"Mult."(0.38)" Scalar"Mult."Add"(0.07)" 15 Ludwig" MILC" MPI+targetDP Supercomputer Scaling 16 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$128x128x128$ 1000" Titan"CPU"" Titan"CPU""(One"160core" (One"160core" 100" Interlagos"per"node)""" 100" Interlagos"per"node)""" Archer"CPU"" Archer"CPU""(Two"120core"Ivy0 (Two"120core"Ivy0bridge"per"node)" !me$(s)$ bridge"per"node)" !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (One"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 17 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$1024x1024x512$ 1000# Titan#CPU##Titan"CPU"" (One#160core#(One"160core" Interlagos#per#node)### 100" Interlagos"per"node)""" Archer#CPU##Archer"CPU"" (Two#120core#Ivy0(Two"120core"Ivy0 bridge#per#node)#bridge"per"node)" 100#!me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan#GPU#(One"K20X"per"node)" #(One#K20#per#node)# 1" 10# 1" 10" 100" 1000" 100# 1000#nodes$ 10000# nodes$ 18 Ludwig$Liquid$Crystal:$128x128x128$ 1000" MILC$Conjugate$Gradient:$64x64x32x8$ 1000" Titan"CPU""Titan"CPU"" (one"160core"Interlagos"(One"160core" per"node)" 100" Interlagos"per"node)""" 100" Archer"CPU""Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (one"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 19 MILC$Conjugate$Gradient:$64x64x64x192$Ludwig$Liquid$Crystal:$128x128x128$ 1000" 1000" Titan"CPU"" (one"160core"Interlagos"Titan"CPU"" per"node)"(One"160core" 100" Interlagos"per"node)""" 100" Archer"CPU"Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU""Titan"GPU"" 10" (one"K20X"per"node)"(One"K20X"per"node)" 10" 1" 1" 1" 10" 100" 1000" 10" 100" nodes$1000" 10000" nodes$ 20 Summary • targetDP is a simplis&c framework that allows grid-based codes to perform well on modern mul&/many-core CPUs as well as GPUs ¡ By abstrac&ng parallelism and memory spaces ¡ Express TLP and ILP. We can see that exposing ILP is crucial on Xeon Phi today, and vector units will con&nue to get wider on future CPUs ¡ It is also crucial to de-couple memory layout by abstrac&ng memory accesses. • We demonstrated performance portability across mul&ple modern architectures • GPUs and Xeon Phi are significantly faster than CPUs, because they offer higher memory bandwidth ¡ GPUs have advantage over Xeon Phi • MPI+targetDP is suitable for large-scale supercompu&ng ¡ NVLINK should help with strong mul&-GPU scaling • We have been concentrang on structured grid-based applicaons, but similar thinking may be fruimul for other areas • targetDP is freely available ¡ hp://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README 21 Acknowledgements 22 .

Scienzfic Simulazons on Thousands of Gpus with Performance Portability

Petaflops for the People

Ushering in a New Era: Argonne National Laboratory & Aurora

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor

TOP500 Supercomputer Sites

MPI on Aurora

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries

Leadership Computing Partnering with the ALCF Enabling

Intel® Xeon Phi™ Processors

Exascale” Supercomputer Fugaku & Beyond

A CORAL System and Implications for Future HPC Hardware and Data Centers

Aurora Fact Sheet