GPU acceleration of plane-wave codes using SIRIUS library MATERIAL SCIENCE CODES ON INNOVATIVE HPC ARCHITECHTURES: TARGETING EXASCALE Anton Kozhevnikov, CSCS December 05, 2017 Introduction Hybrid supercomputers at CSCS

Tödi AMD Opteron + NVIDIA K20X 393 Teraflops

October 2011 Hybrid supercomputers at CSCS

Tödi Piz Daint AMD Opteron + NVIDIA K20X Intel Sandy Bridge + NVIDIA K20X 393 Teraflops 7.787 Petaflops

October 2011 Novembre 2013 Hybrid supercomputers at CSCS

Tödi Piz Daint Piz Daint AMD Opteron + NVIDIA K20X Intel Sandy Bridge + NVIDIA K20X Intel Haswell + NVIDIA P100 393 Teraflops 7.787 Petaflops 25.326 Petaflops

October 2011 Novembre 2013 Novembre 2016 Piz Daint: #3 supercomputer in the world

Cray XC50, 5320 nodes Intel Xeon E5-2690v3 12C, 2.6GHz, 64GB + NVIDIA Tesla P100 16GB 4.761 Teraflops / node Piz Daint node layout

32 GB/s bidirectional 732 GB/s ~60 GB/s over 16 Gb of 64 GB of PCIe x16 CPU GPU high DDR4 host ~500 Gigaflops ~4.2 Teraflops bandwidth memory memory Porting codes to GPUs

No magic “silver bullet” exists! Porting codes to GPUs

No magic “silver bullet” exists!

Usual steps in porting codes to GPUs Porting codes to GPUs

No magic “silver bullet” exists!

Usual steps in porting codes to GPUs

▪ cleanup and refactor the code ▪ (possibly) change the data layout ▪ fully utilize CPU threads and prepare code for node-level parallelization ▪ move compute-intensive kernels to GPUs Porting codes to GPUs ▪ CUDA ( / C++ / ) ▪ OpenCL

▪ OpenACC ▪ OpenMP 4.0 Porting codes to GPUs ▪ CUDA (C / C++ / Fortran) ▪ OpenCL

▪ OpenACC ▪ OpenMP 4.0 Electronic-structure codes Electronic structure codes

Basis functions for KS states Periodic Bloch functions Localized orbitals (plane-waves or similar) Atomic potential treatment

FLEUR Wien2K FHI-aims Full-potential Exciting FPLO Elk

VASP CPMD CP2K Pseudo-potential Quantum ESPRESSO SIESTA Abinit OpenMX Qbox Atomic total energies with LAPW

Hydrogen Helium Lithium Beryllium Boron Carbon Nirogen Oxygen Fluorine Neon NIST -0.478671 -2.834836 -7.343957 -14.447209 -54.136799 -128.233481 LAPW -0.478671 -2.834835 -7.343958 -14.447209 -24.356062 -37.470324 -54.136792 -74.531333 -99.114324 -128.233477 MADNESS -0.478671 -2.834836 -7.343957 -14.447209 -24.356065 -37.470329 -54.136798 -74.531345 -99.114902 -128.233481 NWCHEM -24.356064 -37.470328 -54.136799 -74.531344 -99.114901 1s1 1s2 2s1 2s2 2s2p1 2s2p2 2s2p3 2s2p4 2s2p5 2s2p6

Linearized augmented plane-wave method (LAPW): ▪ provides a very high accuracy of the DFT total energy ▪ designed for crystalline solids ▪ considered as a gold standard for electronic structure simulations Delta DFT codes effort Full-potential linearized augmented plane-wave method

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves

atom #2 atom #1

Interstitial Full-potential linearized augmented plane-wave method

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves

Basis functions:

↵ O` A↵ (G + k)u↵ (r)Y (ˆr) r MT↵ `m⌫ `⌫ `m 2 atom #2 'G+k(r)=8 `m ⌫=1 > X1 X atom #1 <> ei(G+k)r r I p⌦ 2 > Interstitial :> Full-potential linearized augmented plane-wave method

▪ Unit cell is partitioned into “muffin-tin” spheres and interstitial region ▪ Inside MT spheres spherical harmonic expansion is used ▪ In the interstitial region functions are expanded in plane-waves

Basis functions:

↵ O` A↵ (G + k)u↵ (r)Y (ˆr) r MT↵ `m⌫ `⌫ `m 2 atom #2 'G+k(r)=8 `m ⌫=1 > X1 X atom #1 <> ei(G+k)r r I p⌦ 2 > Interstitial :> Potential and density:

V ↵ (r)Y (ˆr) r MT↵ ⇢↵ (r)Y (ˆr) r MT↵ `m `m 2 `m `m 2 V (r)=8 X`m ⇢(r)=8 X`m > V (G)eiGr r I > ⇢(G)eiGr r I <> 2 <> 2 XG XG > > :> :> Full-potential linearized augmented plane-wave method

▪ No approximation to atomic potential ▪ Core states are included ▪ Number of basis functions: ~100 / atom ▪ Number of valence states: ~15-20% of the total basis size ▪ Large condition number of the overlap matrix ▪ Full diagonalization of dense matrix is required (iterative subspace diagonalization schemes are not efficient) ▪ Atomic forces can be easily computed ▪ Stress tensor can’t be easily computed (N-point numerical scheme is required) Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ↵ ↵ ↵ ▪ Atomic potential is replaced by a pseudopotential VˆPS = Vloc(r)+ D | ⇠ i ⇠⇠0 h ⇠0 | ↵ X X⇠⇠0 Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ↵ ↵ ↵ ▪ Atomic potential is replaced by a pseudopotential VˆPS = Vloc(r)+ D | ⇠ i ⇠⇠0 h ⇠0 | ↵ X X⇠⇠0 Basis functions: 1 i(G+k)r 'G+k(r)= e p⌦ Pseudopotential plane-wave method

▪ Unit cell is mapped to a regular grid ▪ All functions are expanded in plane-waves ↵ ↵ ↵ ▪ Atomic potential is replaced by a pseudopotential VˆPS = Vloc(r)+ D | ⇠ i ⇠⇠0 h ⇠0 | ↵ X X⇠⇠0 Basis functions: 1 i(G+k)r 'G+k(r)= e p⌦

Potential and density:

V (r)= V (G)eiGr ⇢(r)= ⇢(G)eiGr XG XG Pseudopotential plane-wave method

▪ Approximation to atomic potential ▪ Core states are excluded ▪ Number of basis functions: ~1000 / atom ▪ Number of valence states: ~0.001 - 0.01% of the total basis size ▪ Efficient iterative subspace diagonalization schemes exist ▪ Atomic forces can be easily computed ▪ Stress tensor can be easily computed Common features of the FP-LAPW and PP-PW methods ▪ Definition of the unit cell (atoms, atom types, lattice vectors, symmetry operations, etc.) ▪ Definition of the reciprocal lattice, plane-wave cutoffs, G vectors, G+k vectors ▪ Definition of the wave-functions ▪ FFT driver ▪ Generation of the charge density on the regular grid ▪ Generation of the XC-potential ▪ Symmetrization of the density, potential and occupancy matrices ▪ Low-level numerics (spherical harmonics, Bessel functions, Gaunt coefficients, spline interpolation, Wigner D-matrix, linear algebra wrappers, etc.) SIRIUS library Motivation for a common domain specific library Motivation for a common domain specific library

Users

Computational Code scientists developers Motivation for a common domain specific library

Users

Computational Code scientists developers

Supercomputer Code Motivation for a common domain specific library

Users

Computational Code scientists developers

Supercomputer Code Motivation for a common domain specific library

Users

Computational Code scientists developers

Supercomputer Code Motivation for a common domain specific library

Extend the legacy Fortran codes with the API calls to a domain-specific library which runs on GPUs and other novel architectures.

Quantum ESPRESSO Exciting / Elk

inherent PW / PAW inherent LAPW implementation implementation

BLAS, PBLAS, LAPACK, ScaLAPACK, FFT

CPU Motivation for a common domain specific library

Extend the legacy Fortran codes with the API calls to a domain-specific library which runs on GPUs and other novel architectures.

Quantum ESPRESSO Exciting / Elk Quantum ESPRESSO Exciting / Elk

inherent PW / PAW inherent LAPW inherent PW / PAW inherent LAPW implementation implementation implementation implementation

SIRIUS domain specific library LAPW / PW / PAW implementation

BLAS, PBLAS, LAPACK, ScaLAPACK, FFT, BLAS, PBLAS, LAPACK, ScaLAPACK, FFT cuBLAS, MAGMA, PLASMA, cuFFT

CPU CPU GPU Where to draw the line?

SIRIUS domain specific library Eigen-value problem LAPW / PW / PAW implementation 1 + v (r) (r)=" (r) 2 eff j j j ⇣ ⌘ Effective potential construction Density generation ⇢(r0) ⇢new(r)= (r) 2 veff (r)= dr0 + vXC[⇢](r) + vext(r) | j | r0 r j Z | | X

Density mixing ⇢(r)=↵⇢new(r)+(1 ↵)⇢old(r)

Output:

wave-functions j ( r ) and eigen energies "j charge density ⇢ ( r ) and magnetization m(r)

total energy E tot , atomic forces F ↵ and stress tensor ↵ SIRIUS library

▪ full-potential (L)APW+lo ▪ non-magnetic, collinear and non-collinear magnetic ground states ▪ non-relativistic, ZORA and IORA valence solvers ▪ solver for core states

▪ norm-conserving, ultrasoft and PAW pseudopotentials ▪ non-magnetic, collinear and non-collinear magnetic ground states ▪ spin-orbit correction ▪ atomic forces ▪ stress tensor ▪ Gamma-point case SIRIUS library https://github.com/electronic-structure/SIRIUS

SIRIUS is a collection of classes that abstract away the different building blocks of PW and LAPW codes. The class composition hierarchy starts from the most primitive classes (Communicator, mdarray, etc.) and progresses towards several high-level classes (DFT_ground_state, Band, Potential, etc.). The code is written in C++11 with MPI, OpenMP and CUDA programming models.

DFT_ground_state Band Local_operator Potential Density K_point_set K_point Non_local_operator Beta_projectors Periodic_function Matching_coefficients Simulation_context Unit_cell Radial_integrals Augmentation_operator Step_function Atom_type Radial_grid Atom Spline

Eigensolver Wave_functions linalg dmatrix BLACS_grid FFT3D MPI_grid Gvec Communicator mdarray splindex matrix3d vector3d Doxygen documentation https://electronic-structure.github.io/SIRIUS-doc/ Potential class ▪ Generate LDA / GGA exchange-correlation potential from the density EXC[⇢(r), m(r)] EXC[⇢(r), m(r)] vXC(r)= BXC(r)= ⇢(r) m(r) ▪ Generate Hartree potential H ⇢(r0) v (r)= dr0 r r Z | 0| ▪ Generate local part of pseudo-potential

1 iGr ↵ V (G)= e V (r T ⌧ )dr loc V loc ↵ XT,↵ Z ▪ Generate D-operator matrix

D↵ = V (r)Q↵ (r)dr ⇠⇠0 ⇠⇠0 Z Density class ▪ Generate charge density and magnetization from the valence wave-functions

occ 1 1 ⇢(r)+m (r) m (r) im (r) "⇤(r) "(r) #⇤(r) "(r) ⇢(r)= I⇢(r)+m(r) = z x y = j j j j 2 2 mx(r)+imy(r) ⇢(r) mz(r) "⇤(r) #(r) #⇤(r) #(r) j j j j j ! ⇣ ⌘ ✓ ◆ X

▪ Generate core charge density (full-potential case) ▪ Generate density matrix

↵ ↵ ↵ ↵ ↵ d = Nˆ = jk njk jk ⇠⇠0 h ⇠ | | ⇠0 i h ⇠ | i h | ⇠0 i Xjk ▪ Augment charge density ⇢˜(G)= d↵ Q↵ (G) ⇠⇠0 ⇠0⇠ ↵ X X⇠⇠0 ▪ Symmetrize density and magnetization ▪ Mix density and magnetization Band class ▪ Setup and solves the Kohn-Sham eigen-value problem Hˆ = " | i | i ▪ Full-potential LAPW case: Hx=εSx, direct diagonalization of dense matrix

▪ Pseudopotential PW case: Hx=εx (norm-conserving pseudo) or Hx=εSx (ultrasoft pseudo), iterative diagonalization of dense matrix ▪ Conjugate gradient ▪ LOBPCG ▪ RMM-DIIS ▪ Chebyshev filtering method ▪ Davidson algorithm Band class ▪ Setup and solves the Kohn-Sham eigen-value problem Hˆ = " | i | i ▪ Full-potential LAPW case: Hx=εSx, direct diagonalization of dense matrix

▪ Pseudopotential PW case: Hx=εx (norm-conserving pseudo) or Hx=εSx (ultrasoft pseudo), iterative diagonalization of dense matrix ▪ Conjugate gradient ▪ LOBPCG ▪ RMM-DIIS All solvers were tried, ▪ Chebyshev filtering method Davidson method ▪ Davidson algorithm works the best Davidson iterative solver ˜ ˜ ▪ We want to solve eigen-value problem H j = "j j Davidson iterative solver ˜ ˜ ▪ We want to solve eigen-value problem H j = "j j

▪ We know how to apply Hamiltonian to the wave-functions

˜ ˜ iGr ˆ ˆ 1 ↵ ↵ ↵ h j = H j = e H j(r)dr H = + veff (r)+ D 2 | ⇠ i ⇠⇠0 h ⇠0 | ↵ Z X X⇠⇠0 Davidson iterative solver ˜ ˜ ▪ We want to solve eigen-value problem H j = "j j

▪ We know how to apply Hamiltonian to the wave-functions

˜ ˜ iGr ˆ ˆ 1 ↵ ↵ ↵ h j = H j = e H j(r)dr H = + veff (r)+ D 2 | ⇠ i ⇠⇠0 h ⇠0 | ↵ Z X X⇠⇠0

Key idea of the Davidson iterative solver: start with a subspace spanned by a guess to j and expand it with preconditioned residuals. Davidson iterative solver: application of the Hamiltonian ▪ Application of the Laplace operator (kinetic energy) 1 1 G2 (r)= eiGr ˜ (G)= eiGr ˜ (G) 2 j 2 j 2 j XG ⇣ ⌘ XG

▪ Application of the local part of potential 1 ˜ (G) FFT (r) v (r) (r) FFT h˜ (G) j ! j ! eff j ! j

▪ Application of the non-local local part of potential ↵ ↵ ↵ (G) D ⇤(G0) ˜j(G0) h˜ (G) ⇠ ⇠⇠0 ⇠0 ! j X↵⇠ X⇠0 XG0 zgemm#1 zgemm#2 zgemm#3 Davidson iterative solver: algorithm

Initialize the trial : ˜0 ˜ • j ( j Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j

m ˜m ˜m Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j

m ˜m ˜m Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Diagonalize reduced Hamiltonian matrix m m m • h Z = ✏jZ and get N lowest eigen pairs: Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j

m ˜m ˜m Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Diagonalize reduced Hamiltonian matrix m m m • h Z = ✏jZ and get N lowest eigen pairs: m m m m R˜ = h˜ m Z ✏ ˜ Z Compute residuals (R = H ˆ ✏ ): j j j j • j j j j Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j

m ˜m ˜m Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Diagonalize reduced Hamiltonian matrix m m m • h Z = ✏jZ and get N lowest eigen pairs: m m m m R˜ = h˜ m Z ✏ ˜ Z Compute residuals (R = H ˆ ✏ ): j j j j • j j j j Apply preconditioner to the unconverged residuals, • ˜m+1 ˜m ˜m orthogonalize and add the resulting functions to j = j P Rj the basis: { } { } { } M Davidson iterative solver: algorithm

Initialize the trial basis set: ˜0 ˜ • j ( j m Apply Hamiltonian to the basis functions: h˜m = H˜ • j j

m ˜m ˜m Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Diagonalize reduced Hamiltonian matrix m m m • h Z = ✏jZ and get N lowest eigen pairs: m m m m R˜ = h˜ m Z ✏ ˜ Z Compute residuals (R = H ˆ ✏ ): j j j j • j j j j Apply preconditioner to the unconverged residuals, • ˜m+1 ˜m ˜m orthogonalize and add the resulting functions to j = j P Rj the basis: { } { } { } M ˜ ˜m m •Recompute the wave-functions: j = j Z Davidson iterative solver: algorithm === QE output === Initialize the trial basis set: ˜0 ˜ Dynamical RAM for wfc: 2.50 MB • j j Dynamical RAM for : 2.81 MB ( Dynamical RAM for psi: 10.02 MB Dynamical RAM for hpsi: 10.02 MB ˜ ˜m Apply Hamiltonian to the basis functions: hm = Hj Dynamical RAM for spsi: 10.02 MB • j ======˜ m ˜m ˜m and psi is j here Compute reduced Hamiltonian matrix: hjj = j ⇤(G)h (G) • 0 j0 XG Diagonalize reduced Hamiltonian matrix m m m • h Z = ✏jZ and get N lowest eigen pairs: m m m m R˜ = h˜ m Z ✏ ˜ Z Compute residuals (R = H ˆ ✏ ): j j j j • j j j j Apply preconditioner to the unconverged residuals, • ˜m+1 ˜m ˜m orthogonalize and add the resulting functions to j = j P Rj the basis: { } { } { } M ˜ ˜m m •Recompute the wave-functions: j = j Z Davidson iterative solver •Initialize subspace basis functions and apply Hamiltonian

0 0 ˜ h˜ 0 = H˜ j j j G plane-waveindex Davidson iterative solver •Compute reduced Hamiltonian matrix 0 0 ˜ h ˜ ⇤ h0 jj0 j j = ×

•Compute eigen-value probelem

= "j Davidson iterative solver •Compute residuals ˜0 0 0 0 0 R ˜ Z ˜ Z ✏jjj0 j hj j = × - × ×

0 1 0 Apply preconditioner: P R˜ (G)=(H " ) R˜ (G) • j GG j j Davidson iterative solver •Expand variational space and apply Hamiltonian to new basis functions

˜1 ˜0 ˜0 1 j = j P Rj h˜ 1 = H˜ j j M G plane-waveindex Davidson iterative solver •Compute reduced Hamiltonian matrix h1 ˜1 h˜1 jj0 j⇤ j

= ×

•Compute eigen-value probelem

= "j Davidson iterative solver •Compute residuals 1 1 1 1 1 ˜ ˜ ˜ ✏jjj0 Rj hj Z j Z = × - × ×

1 1 1 Apply preconditioner: P R˜ (G)=(H " ) R˜ (G) • j GG j j Davidson iterative solver •Continue to expand variational space 2 ˜2 ˜1 ˜1 h˜ 2 = H˜ j = j P Rj j j M G plane-waveindex Davidson iterative solver •Continue to expand variational space 2 ˜2 ˜1 ˜1 h˜ 2 = H˜ j = j P Rj j j M G plane-waveindex

Iterate until the convergence (all residuals are zero) is reached Davidson iterative solver •Recompute the wave-functions ˜ ˜m m j j Z = ×

•Take " j from the last subspace diagonalization SIRIUS-enabled Quantum ESPRESSO Development cycle

https://github.com/electronic-structure/q-e

QEF/q-e/master

/q-e/master

/q-e/sirius

Pull request Pull request Example of QE/SIRIUS interoperability

QE Initialization phase SIRIUS read input file, read pseudopotentials, create a list of k-points, initialize data structures, communicators, etc. set unit cell parameters (lattice vectors, atom types, atomic positions, etc.), cutoffs and other parameters initialize simulation context set k-points initialize K_point_set class initialize Density class initialize Potential class initialize DFT_ground_state class generate initial density get rho(G) and mag(G) Example of QE/SIRIUS interoperability

QE Initialization phase SIRIUS QE SCF cycle SIRIUS read input file, read pseudopotentials, create a list of k-points, initialize data solve band problem and find KS orbitals structures, communicators, etc. get band energies set unit cell parameters (lattice vectors, atom types, atomic positions, etc.), cutoffs and other parameters find band occupancies set band occupancies initialize simulation context generate unsymmetrized rho(G) and mag(G)

set k-points get rho(G) and mag(G) initialize K_point_set class symmetrize rho(G) and mag(G) initialize Density class mix rho(G) and mag(G) initialize Potential class generate Veff(r) and Veff(G) set Veff(G) initialize DFT_ground_state class get forces generate forces generate initial density get stress tensor generate stress tensor get rho(G) and mag(G) Variable cell relaxation of Si63Ge

Performance benchmark of the QE, Cuda Fortran version of QE and SIRIUS-enabled QE codes for the 64- atom unit cell of Si1-xGex The runs we performed on on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (QE-GPU, QE-SIRIUS-GPU) and on nodes with 68-core Intel processor @1.4 GHz (QE-KNL). Time for the full ‘vc-relax’ calculation is reported.

QE-KNL (CINECA) QE-SIRIUS-KNL (CSCS) QE-SIRIUS-GPU (CSCS) QE-GPU (NVIDIA)

2000 1'934 1'800 1500

1000 1'020 900 774 637 500 519.5

Time to solutionto (sec) Time 480 400 298.44 240 220 150 138 0 1 2 5 10 Number of nodes Ground state of Pt-cluster in water

Performance benchmark of the QE and SIRIUS-enabled QE codes for the 288-atom unit cell of Pt cluster embedded in the water. The runs we performed on dual socket 18-core Intel Broadwell @2.1GHz nodes (BW), on hybrid nodes with 12-core Intel Haswell @2.5GHz + NVIDIA Tesla P100 card (GPU) and on nodes with 64-core Intel Xeon Phi processor @1.3 GHz (KNL). ELPA eigen-value solver was used for CPU runs. Time for the SCF ground state calculation is reported.

QE-v6.2-BW QE-sirius-BW QE-sirius-KNL QE-sirius-GPU QE-NVIDIA-GPU

400

364.03 344.18 330.14 300 305.46 275.49 249.88 240 200 208.35 180.68 186.6 166.49 148.9 136.78 125.07 100 112.34 Time to solutionto (sec.) Time

0 18 32 50 Number of nodes Thank you for your attention.