S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES with a LATTICE QCD CASE STUDY Kate Clark, Mathias Wagner Lattice Quantum Chromodynamics

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark, Mathias Wagner Lattice Quantum Chromodynamics QUDA Bandwidth Optimization AGENDA Latency Optimization Multi-node scaling NVSHMEM Summary 2 3",6@0*,/7*A#"./0 QUANTUM CHROMODYNAMICS ! V*0$0$6./A*<.6-2$(,$"#0$"A$-*0$'&,()$A"1)0,$"A$#&-O10$+&?"#R$ F(-*$R1&/(-2<$0?0)-1"K&R#0-(,K<$&#@$-*0$F0&H$A"1)03P ! 6-h,$F*&-$'(#@,$-"R0-*01$-*0$B",6@0$&#@$A#"./0$(#$-*0$E1"-"#$ The strong force is one of the basic+&#@$-*0$#0O-1"#<$&,$F0??$&,$*O#@10@,$"A$"-*01$E&1-()?0,$,00#$(#$ forces of nature (along with gravity, em and weak) &))0?01&-"1$0cE01(K0#-,3P It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments) Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.) QCD is the theory of the strong force It’s a beautiful theory… 1 d4xL(U) ⌦ = [dU]e− ⌦(U) …but h i Z R V*"K&,$M0AA01,"#$C&-("#&?$7))0?01&-"1$i&)(?(-2Z i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12!3 !"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; X LATTICE QUANTUM CHROMODYNAMICS Theory is highly non-linear ⇒ cannot solve directly Must resort to numerical methods to make predictions Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size Lx x Ly x Lz x Lt Finite spacetime ⇒ periodic boundary conditions PDEs ⇒ finite difference equations Consumer of 10-20% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years Andre Walker-Loud S91010: Accelerating our Understanding of Nuclear Physics and the Early Universe Jiqun Tu S9330: Lattice QCD with Tensor Cores !4 STEPS IN AN LQCD CALCULATION ↵ β ↵ Dij (x, y; U) j (y)=⌘i (x) 1. Generate an ensemble of gluon field configurations “gauge generation” or Ax = b Produced in sequence, with hundreds needed per ensemble Strong scaling required with 100-1000 TFLOPS sustained for several months Simulation Cost ~ a-6 V5/4 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 164 per GPU 2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver Many solves per system, e.g., O(106) Target 244-324 per GPU !5 LATTICE QCD IN A NUTSHELL Multi GPU Parallelization face 1 4 exchange d xL(U) wrap ⌦ = [dU]e− ⌦(U) around h i Z Z R face exchange wrap around experiment) theory' ! ! Tuesday, July 12, 2011 !6 Large&Hadron&Collider& Davies'et#al# Brookhaven*Na,onal*Laboratory* NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER Summit Becomes First System To Scale The 100 Petaflops Milestone 122 PF 3 EF 27,648 HPC AI Volta Tensor Core GPUs 7 STRONG SCALING Multiple meanings Same problem size, more nodes, more GPUs Same problem, next generation GPUs Multigrid - strong scaling within the same run (not discussed here) To tame strong scaling we have to understand the limiters Bandwidth limiters Latency limiters !8 QUDA 9 QUDA • “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. • Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-source solvers – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale !10 QUDA CONTRIBUTORS 10 years - lots of contributors § Ron Babich (NVIDIA) § Dean Howarth (BU) § Simone Bacchio (Cyprus) § Bálint Joó (Jlab) § Michael Baldhauf (Regensburg) § Hyung-Jin Kim (BNL -> Samsung) § Kip Barros (LANL) § Bartek Kostrzewa (Bonn) § Rich Brower (Boston University) § Claudio Rebbi (Boston University) § Nuno Cardoso (NCSA) § Hauke Sandmeyer (Bielefeld) § Kate Clark (NVIDIA)* § Guochun Shi (NCSA -> Google) § Michael Cheng (Boston University) § Mario Schröck (INFN) § Carleton DeTar (Utah University) § Alexei Strelchenko (FNAL) § Justin Foley (Utah -> NIH) § Jiqun Tu (Columbia) § Joel Giedt (Rensselaer Polytechnic Institute) § Alejandro Vaquero (Utah University) § Arjun Gambhir (William and Mary) § Mathias Wagner (NVIDIA)* § Steve Gottlieb (Indiana University) § Evan Weinberg (NVIDIA)* *this work § Kyriakos Hadjiyiannakou (Cyprus) § Frank Winter (Jlab) !11 LINEAR SOLVERS while (|rk|> ε) { •βk = (rk,rk)/(rk-1,rk-1) •pk+1 = rk - βkpk qk+1 = A pk+1 QUDA supports a wide range of linear solvers •α = (rk,rk)/(pk+1, qk+1) •r = r - αq CG, BiCGstab, GCR, Multi-shift solvers, etc. k+1 k k+1 •xk+1 = xk + αpk+1 •k = k+1 } Condition number inversely proportional to mass Light (realistic) masses are highly singular conjugate gradient Naive Krylov solvers suffer from critical slowing down at decreasing mass Entire solver algorithm must run on GPUs Time-critical kernel is the stencil application Also require BLAS level-1 type operations !12 MAPPING THE DIRAC OPERATOR TO CUDA review basic details of the LQCD application and of NVIDIA x GPU hardware. We then briefly consider some related work in Section IV before turning to a general description of the QUDA library in Section V. Our parallelization of the quark U interaction matrix is described in VI, and we present and x discuss our performance data for the parallelized solver in • Finite difference operator in LQCDSection is VII. known We finish as with Dslash conclusions and aD discussionx,x of = 0 x x future work in Section VIII. − x U μ • Assign a single space-time point to each thread x II. LATTICE QCD ν 4 6 V = XYZT threads, e.g., V = 24 =>The 3.3x10 necessity forthreads a lattice discretized formulation of QCD μ arises due to the failure of perturbative approaches commonly x− • Looping over direction each threadused for must calculations in other quantum field theories, such as electrodynamics. Quarks, the fundamental particles that are at – Load the neighboring spinor (24 numbers x8) Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D, the heart of QCD, are described by the Dirac operator acting as defined in (2), in the µ ⇥ plane. The color-spinor fields are located on − µ – Load the color matrix connectingin the the presence sites of (18 a local numbers SU(3) symmetry. x8) On the lattice, the sites. The SU(3) color matrices Ux are associated with the links. The the Dirac operator becomes a large sparse matrix, M, and the nearest neighbor nature of the stencil suggests a natural even-odd (red-black) Do the computation coloring for the sites. – calculation of quark physics is essentially reduced to many – Save the result (24 numbers) solutions to systems of linear equations given by X[1] Mx = b. (1) as red-black) preconditioning is used to accelerate the solution • Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity finding process, where the nearest neighbor property of the The form of M on which we focus in this work is the Dx,x0 matrix (see Fig. 1) is exploited to solve the Schur com- • QUDA reduces memory traffic Sheikholeslami-Wohlert [6] (colloquially known as Wilson- plement system [9]. This has no effect on the overall efficiency Exact SU(3) matrix compressionclover (18) form,=> 12 which or is 8 a centralreal numbers) difference discretization of the since the fields are reordered such that all components of Dirac operator. When acting in a vector space that is the tensor a given parity are contiguous. The quark mass controls the Use 16-bit fixed-point representationproduct of with a 4-dimensional mixed-precision discretized Euclidean solver spacetime, condition number of the matrix, and hence the convergence of spin space, and color space it is given by such iterative solvers.X[0] Unfortunately, physical quark masses 4 correspond to nearly indefinite matrices.!13 Given that current 1 µ µ +µ µ † 3 8 Mx,x0 = P − Ux δx+ˆµ,x0 + P Ux µˆ δx µ,xˆ 0 leading lattice volumes are 32 256, for > 10 degrees of −2 ⇤ ⇤ − − ⇥ µ=1 freedom in total, this represents an extremely computationally ⇤ ⇥ +(4+m + Ax)δx,x0 demanding task. 1 D +(4+m + A )δ . (2) III. GRAPHICS PROCESSING UNITS ⌅2 x,x0 x x,x0 µ In the context of general-purpose computing, a GPU is Here δ is the Kronecker delta; P ± are 4 4 matrix x,y ⇥ effectively an independent parallel processor with its own projectors in spin space; U is the QCD gauge field which locally-attached memory, herein referred to as device memory. is a field of special unitary 3 3 (i.e., SU(3)) matrices acting ⇥ The GPU relies on the host, however, to schedule blocks of in color space that live between the spacetime sites (and hence code (or kernels) for execution, as well as for I/O. Data is are referred to as link matrices); Ax is the 12 12 clover matrix 1⇥ exchanged between the GPU and the host via explicit memory field acting in both spin and color space, corresponding to copies, which take place over the PCI-Express bus. The low- a first order discretization correction; and m is the quark level details of the data transfers, as well as management of mass parameter. The indices x and x0 are spacetime indices the execution environment, are handled by the GPU device (the spin and color indices have been suppressed for brevity).

Load more