QE-GPU: between performance, correctness and sustainability. Filippo SPIGA 1,2
«What I cannot compute, I do not understand.» (adapted from Richard P. Feynman) Outline
• Overview
• The Quantum ESPRESSO project
• PWSCF GPU por ng
• Performance evalua on
• Final considera ons: performance, correctness and sustainability
2 GPU momentum con nues to grow
2008 2013
source: NVIDIA investors day 2013 3 The Accelerator model
PCI-E
1. First COPY input data from CPU memory to GPU memory 2. Load GPU program and EXECUTE (caching data on chip for performance) 3. When results are ready, COPY back data from GPU memory to CPU memory
4 Collision or Convergence?
CPU ?
Fast-growing eco-system
General purpose system
programmability Few limited functions
parallelism GPU 5 3 ways to accelerate codes using GPU
Applica on
Direc ves Direct programming Libraries (OpenACC) (CUDA)
Quickly Accelerate Maximum “Drop-in” Accelera on Exis ng Applica ons Performance
Modest, 2x ~ 10x Best, up to 100x
6 QUANTUM ESPRESSO What is QUANTUM ESPRESSO?
• QUANTUM ESPRESSO is an integrated suite of computer codes for atomis c simula ons based on DFT, pseudo-poten als, and plane waves • "ESPRESSO" stands for opEn Source Package for Research in Electronic Structure, Simula on, and Op miza on • QUANTUM ESPRESSO is an ini a ve of SISSA, EPFL, and ICTP, with many partners in Europe and worldwide • QUANTUM ESPRESSO is free so ware that can be freely downloaded. Everybody is free to use it and welcome to contribute to its development
8 What can QUANTUM ESPRESSO do?
• norm-conserving as well as ultra-so and PAW pseudo-poten als • ab-ini o molecular dynamics • many different energy func onals, including meta-GGA, DFT+U, and – Car-Parrinello (many ensembles and flavors) hybrids (van der Waals soon to be available) – Born-Oppenheimer (many ensembles and flavors) • scalar-rela vis c as well as fully rela vis c (spin-orbit) calcula ons – QM-MM (interface with LAMMPS) • magne c systems, including non-collinear magne sm • linear response and vibra onal dynamics • ground-state calcula ons – phonon dispersions, real-space interatomic force constants – Kohn-Sham orbitals and energies, total energies and atomic – electron-phonon interac ons and superconduc vity forces effec ve charges and dielectric tensors – finite as well as infinite system – third-order an-harmonici es and phonon life mes – any crystal structure or supercell – infrared and (off-resonance) Raman cross sec ons – insulators and metals (different schemes of BZ integra on) – thermal proper es via the quasi-harmonic approxima on – structural op miza on (many minimiza on schemes • electronic excited states available) – TDDFT for very large systems (both real- me and “turbo- – transi on states and minimum-energy paths (via NEB or Lanczos”) string dynamics) electronic polariza on via Berry’s phase – MBPT for very large systems (GW, BSE) – finite electric fields via saw-tooth poten al or electric enthalpy • Wannier intepola ons plus several post processing tools!
9 QUANTUM ESPRESSO in numbers
• 350,000+ lines of FORTRAN/C code • 46 registered developers • 1600+ registered users • 5700+ downloads of the latest 5.x.x version • 2 web-sites (QUANTUM-ESPRESSO.ORG & QE-FORGE.ORG) • 1 user mailing-list, 1 developer mailing-list • 24 interna onal schools and training courses (1000+ par cipants)
10 QUANTUM ESPRESSO: a global community
11 The development model
• QUANTUM ESPRESSO is not a monolithic applica on, but an integrated ecosystem thriving around a small number of core components developed and maintained by a small number of developers • the ecosystem is designed so as to be alien-friendly: a number of third-party QE- compa ble applica ons and add-ons, o en designed to be code-agnos c, are distributed with QE (notable examples include wannier90, Yambo, EPW, WanT, XCrysDen, ...) • the environment that allows the ecosystem to prosper is provided by the QE- FORGE.ORG pla orm, freely available to researchers and developers from all over the world
12 Quantum ESPRESSO package por olio
WanT Wannier90 PLUMED SaX Yambo TDDFPT Xspectra GIPAW GWL NEB PHonon PWCOND
Atomic PWscf CP CORE modules
13 J. Phys.: Condens. Matter 21 (2009) 395502 PGiannozziet al
J. Phys.: Condens. Matter 21 (2009) 395502 PGiannozziet al QE Scalability (2010)
CP (Aβ-pep de in water: 838 atoms and 2311 Figure 3. Scalability for medium-size calculations (CP code). CPU time (s) per electronic electrons, gamma point) time step (left panel) and speedup with respect to 32 processors (right panel) as a function of the number of processors and for different numbers ntask of task groups, on an IBM BlueGene/P (BG/P) and on an SGI Altix. The system is a fragment of an Aβ-peptide in water containing 838 atoms and 2311 electrons in a 3 22.1 22.9 19.9 A˚ cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respectively. × ×
Figure 3. Scalability for medium-size calculations (CP code). CPU time (s) per electronic time step (left panel) and speedup with respect to 32 processors (right panel) as a function of the number of processors and for different numbers ntask of task groups, on an IBM BlueGene/P (BG/P) and on an SGI Altix. The system is a fragment of an Aβ-peptide in water containing 838 atoms and 2311 electrons in a 3 22.1 22.9 19.9 A˚ cell, ultrasoft pseudopotentials, " point, 25 and 250 Ry cutoff for the orbitals and the charge density respectively. × ×
PWscf (PSIWAT: 587 atoms, 2552 electrons, 4 k-points) (CNT: 1532 atoms, 5232 electrons, gamma point)
Figure 4. Scalability for large-scale calculations: wall time (left panel) and speedup (right panel) as a function of the number14 of processors. PSIWAT: PWscf code, npool 4, ntask 4, on a Cray XT 4. The system is a gold surface covered by thiols in interaction with water, 4 = 3 = k-points, 10.59 20.53 32.66 A˚ cell, 587 atoms, 2552 electrons. CNT (1): PWscf code, ntask 4, on a Cray XT 4. The system is a porphyrin-functionalized× × nanotube, " point, 1532 atoms, 5232 electrons. CNT (2): CP code on a Cray= XT3, same system as for CNT (1), Times for PSIWAT and CNT (1) are for 10 and 2 self-consistency iterations, respectively; times for CNT (2) are for 10 electronic steps plus 1 Car–Parrinello step, divided by 2 so that they fall in the same range as for CNT (1).
Figure 4. Scalability for large-scale calculations: wall time (left panel) and speedup (right panel) as a function of the number of processors. PSIWAT: PWscf code, npool 4, ntask 4, on a Cray XT 4. The system is a gold surface covered by thiols in interaction with water, 4 = 3 = larger plane-wave basis set and a larger real- and reciprocal- architectures. Special attention is also paid to optimize the . . . ˚ k-points, 10 59 20 53 32 66 A cell, 587 atoms, 2552 electrons.space CNT (1): gridsPWscf are requiredcode, ntask in the4, former on a Cray case. XT On 4. the The other system hand, is a performances for simulations of intermediate size (on systems porphyrin-functionalized× × nanotube, " point, 1532 atoms, 5232 electrons. CNT (2): CP code on a Cray= XT3, same system as for CNT (1), Times for PSIWAT and CNT (1) are for 10 and 2 self-consistency iterations,using ultrasoft respectively; pseudopotentials times for CNT is (2)generally are for 10faster electronic because steps the plusranging 1 from several tens to a few hundreds of inequivalent Car–Parrinello step, divided by 2 so that they fall in the same range asuse for of CNT a smaller (1). basis set is obviously more efficient, even atoms), to be performed on medium-size clusters, readily though the overall parallel performance may not be as good. available to many groups [81]. In particular, the QUANTUM Simulations on systems containing several hundreds of ESPRESSO developers’ team is now working to better exploit larger plane-wave basis set and a larger real- and reciprocal-atomsarchitectures. are by now Special quite attention standard is (see also figure paid3 tofor optimize an new the hardware trends, particularly in the field of multicore space grids are required in the former case. On the other hand,example).performances Scalability for simulations does not of yet intermediate extend to size tens (on of systemsarchitectures. The current version implements a partial but using ultrasoft pseudopotentials is generally faster because thethousandsranging of from processors several as tens in to especially-crafted a few hundreds codes of inequivalent like fully functional OpenMP parallelization [171]thatisespecially suitable for modern multicore CPUs. Mixing OpenMP with use of a smaller basis set is obviously more efficient, evenQBoxatoms),[170], but to beexcellent performed scalability on medium-size on up to 4800 clusters,processors readily has been demonstrated (see figure 4), even for cases where MPI also allows scalability to be extended towards a higher though the overall parallel performance may not be as good. available to many groups [81]. In particular, the QUANTUM coarse-grained parallelization does not help, using only MPI number of processors, by adding a parallelization level on Simulations on systems containing several hundreds of ESPRESSO developers’ team is now working to better exploit parallelization. We remark that the results for CNT (2) in top of what can already be achieved using MPI. Preliminary new hardware trends, particularly in the field of multicore atoms are by now quite standard (see figure 3 for anfigure 4 were obtained with an earlier version of the CP code tests on realistic physical systems demonstrate scalability up example). Scalability does not yet extend to tens ofthatarchitectures. did not use ScaLAPACK; The current the version current implements version performs a partialto but 65 536 cores, so far. thousands of processors as in especially-crafted codes likebetterfully in terms functional of scalability. OpenMP parallelization [171]thatisespeciallyLooking ahead, future developments will likely focus on suitable for modern multicore CPUs. Mixing OpenMP with QBox [170], but excellent scalability on up to 4800 processors The efforts of the QUANTUM ESPRESSO developers’ team hybrid systems with hardware accelerators (GPUs and cell co- has been demonstrated (see figure 4), even for cases whereareMPI not limitedalso allows to the scalability performance to be on extended massively towards parallel a higherprocessors). coarse-grained parallelization does not help, using only MPI number of processors, by adding a parallelization level on parallelization. We remark that the results for CNT (2) in top of what can already be achieved using MPI. Preliminary12 figure 4 were obtained with an earlier version of the CP code tests on realistic physical systems demonstrate scalability up that did not use ScaLAPACK; the current version performs to 65 536 cores, so far. better in terms of scalability. Looking ahead, future developments will likely focus on The efforts of the QUANTUM ESPRESSO developers’ team hybrid systems with hardware accelerators (GPUs and cell co- are not limited to the performance on massively parallel processors).
12 PWSCF GPU PORTING Plane Wave Self-Consistency Field (PWSCF)
The solu on of the Kohn-Sham requires the diagonaliza on of the matrix HKS whose matrix elements are
2 ~ 2 k + G T k + G0 = (k + G0) Kine c energy h | | i 2m G,G0 2 n(G G0) k + G V k + G0 = V (G G0)=4⇡e Hartree term h | H | i H (G G ) 2 | 0 | k + G V k + G0 = FT[V (r)] Exchange correla on h | xc| i xc
16 PWscf aralleliza on hierarchy
Images • Only for Nudged Elas c Band (NEB) calcula ons
• Distribu on over k-points (if more than one) K-points • Scalability over k-points (if more than one) • No memory scaling BGRP_COMM • Distribu on of wave-func on coefficients Plane-waves • Distribu on of real-grid points • Good memory scale, good overall scalability, load-balancing
Linear algebra & task • Itera ve diagonaliza on (fully-parallel or serial) groups • Smart grouping of 3DFFTs to reduce compulsory MPI communica ons
Mul -threaded • OpenMP handled explicitly or implicitly kernels • Extend the scaling on mul -core machines with “limited” memory
17 Simplified PWSCF life-cycle
3D-FFT + GEMM + LAPACK
CALCULATIO N OF NEW CALCULATION OF STRUCTURE OPTIMIZATION ATOMIC POSITIONS NEW FORCES (AND LATTICE PARAMETERS) (AND STRESSES)
SELF CONSISTENCY SELF CONSISTENCY ACHIEVED
ELSE SETUP OF THE DIAGONALIZATION OF THE INITIAL PHYSICAL CALCULATION OF WAVEFUNCTIONS HAMILTONIAN MATRIX QUANTITIES ELSE (FOR EACH K POINT)
WAVEFUNCTION ORTHO NORMALIZATION FORCES AND STRESSES EQUAL TO ZERO CALCULATION OF THE NEW CALCULATION OF THE CALCULATION OF THE NON LOCAL POTENTIALS NEW POTENTIALS CHARGE DENSITY
3D-FFT 3D-FFT + GEMM 18 QE-GPU and PRACE
PRACE - Partnership for Advanced Compu ng in Europe
• PRACE Project 1-IP (2010 – 2012) – GPU development: PHIGEMM library and PWSCF – Key-rule to leverage Quantum ESPRESSO as an EU community code – (Be er) Paralleliza on of the GIPAW code (over bands)
• PRACE Project 2-IP (2011 – 2013) – Extend the mul -threading support with OpenMP – Exploratory of the adop on of OpenACC (GPU with direc ves) – Improvement in the linear algebra and the diagonaliza on
19 A "lucky" star ng point AUSURF, 112 Au atoms, 2 k-point
1 SCF itera on 1 full SCF cycle
20 GPU developments
• MPI-GPU binding & GPU memory management • NEWD à CUDA NEWD (mul ple kernels combined) • ADDUSDENS à CUDA ADDUSDENS • VLOC_PSI à CUDA VLOC_PSI (CUDA kernels + CUFFT) • BLAS3 *GEMM à PHIGEMM library (CUBLAS) • (serial) LAPACK à MAGMA library
VLOC_PSI acts over distributed data NEWD/ADDUSDENS act over local data
21 phiGEMM
phiGEMM single- and multi-GPU performance 800 700 • Inspired by M. Fa ca LINPACK work 600 500 • Independent open-source library, BSD license 400
GFlops 300 • 200 GPU+CPU BLAS 3 *GEMM rou ne 100 0 • Manual or “semi-automa c“ (SELF-TUNE) split 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240 Dimension [M=N=K] CUBLAS Thunking phiGEMM (1 GPU) • Special-K for rectangular matrices CUBLAS peak phiGEMM (2 GPU) MKL
• GEMMàGEMV fallback phiGEMM single- and multi-GPU performance 800 • Detailed call-by-call profiling 700 600 • Pinned/non-pinned, sync/async 500 400 • GFlops 300 Support of mul -GPU 200 100 0 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240 web: h p://qe-forge.org/projects/phigemm/ Dimension [M=N=K] CUBLAS Thunking phiGEMM (1 GPU) CUBLAS peak phiGEMM (2 GPU) MKL 22 MAGMA: LAPACK for GPU
MAGMA uses HYBRIDIZATION methodology based on • Represen ng linear algebra algorithms as collec ons of TASKS and DATA DEPENDENCIES among them • Properly SCHEDULING the tasks' execu on over the mul core and the GPU hardware components
What does HYBRIDIZATION means? • Panels (Level 2 BLAS) are factored on CPU using LAPACK • Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”
23 CUDA kernels
• ADDUSDENS – compute-bounded kernel – Best performance measured: 20x* (Realis c? 9x~10x) • NEWD – compute-bounded kernels – Best performance measured: 7.2x* (Realis c? 3x~4x) • VLOC_PSI – memory-bounded kernels – Best (serial) performance measured: 9x (Realis c? ...) • All the data is moved to GPU memory only at once • External loops over atomic species are kept on the CPU side
24 The parallel FFT "issue"
There are two "FFT grid" representa on in
Reciprocal Space: wave func ons (Ecut) and charge A single 3D-FFT is divided in independent 1D-FFTs density (4Ecut)
Transform along Z Transform along Y Transform along X
E
P
r 0 0
e
p
d
e
g
n
a
h
c
0 1 2 3 0 1 2 3 x 1 1
e 0 1 2 3 0 1 2 3 0 1
a 2 3 t Ec a 4 d Ec
)
p
N 2 2
5
(
/
z
N
y
N
x
N
3 3
~
e s z z y o
p
s
n
a
r
T
l
z e x l x x l y y
a
r
a Nx Nz / 2 FFT along y Ny Nz FFT along x ~ Nx Ny / 5 FFT along z P
0 PE 0 2 PE 2 1 PE 1 3 PE 3 Data are not con guous and not “trivially” distributed across processors Zeros are not transformed. Different cut-offs preserve accuracy
25 H * psi compute/update H * psi: compute kinetic and non-local term (in G space)