Porting the DFT code CASTEP to GPGPUs

Toni Collis [email protected] EPCC, University of Edinburgh http://www.nu-fuse.com • • • Outline

computational problems. Functional Theory codes. The Brief introduction to CASTEP underlying Why are we interested in CASTEP and Density OpenACC implementation CASTEP and GPGPUs http://www.nu-fuse.com • • • •

Rutherford Labs Durham, St. Andrews, Cambridge and serial code, developed by Universities of York, electrons). (atom positions are dictated by their materials by the use of electronic structure and plane wave calculations. package Modern CASTEP is a re-write of the original Calculates the structure and motions of Capable of Density Functional Theory (DFT) CASTEP is a commercial and academic CASTEP: a DFT code http://www.nu-fuse.com • • •

functionality. structure. solving a Hamiltonian to explain the electronic based at University of Edinburgh). users of DFT codes are becoming more complex and morewith Codes such as CASTEP, VASP and CP2K. All involve DFT/ ab initio HECToR software packages are one of the largest (UK national supercomputing service, CASTEP: a DFT code http://www.nu-fuse.com • • • •

system cabinet Cray XE6 Service Peak of over 800 TF and 90 TB of memory Each node has Currently 30- UK National HPC – (2.3GHz – –

Opterons 32 GB memory 32 GB 16-core2× AMD 90,112 cores Interlagos

)

HECToR http://www.nu-fuse.com MOLPRO) Quantum Espresso, GAMESS-US, SIESTA, GAMESS-UK, Ab Phase 3 statistics (Nov 2011 - Apr 2013) CASTEP DL_POLY initio codes initio (VASP, CP2K, CASTEP, ONETEP, MITgcm 4% 4% 5% 5% 4% 4% 4% 4% UM ChemShell SENGA 3% 3% 2% 2% GROMACS 6% 6% GS2 2% 2% NEMO NEMO 2% 2% HECToR CP2K 8% 8% usage statistics VASP 19% 19% Others 34% 34% NWChem , http://www.nu-fuse.com methods. 35% of the Chemistry software on Phase 3 statistics (Nov 2011 - Apr 2013) HECToR Chemistry Other 3% 3% usage statistics Classical MDClassical 14% 14% HECToR 35% 35% DFT DFT is using DFT http://www.nu-fuse.com

Parallel Tasks 16384 32768 65536 1024 2048 4096 8192 128 256 512 32 64 0 HECToR 500000 Node Hours Hours Node 1000000 Usage Statistics 1500000 ONETEP ONETEP Espresso Quantum SIESTA GAMESS-US NWChem CP2K CASTEP MOLPRO MITgcm CPMD GAMESS-UK VASP 2000000 http://www.nu-fuse.com

Parallel Tasks 16384 32768 65536 1024 2048 4096 8192 128 256 512 32 64 0 HECToR 50000 100000 Usage Statistics - CASTEP 150000 Node Hours Hours Node 200000 250000 300000 350000 http://www.nu-fuse.com • • • • • •

Scaling 128 cores causes MPI to dominate, using 50% of the of execution time is spent performing message passing used as this requires many nodes, resulting in excessive up to 100’s of atoms only. April2013). (2048 nodes) using CP2K (between November2011 and and 64 nodes this this affect DFT codes? using 32,768 cores (1024 nodes). execution time, mostly due to poor CASTEP simulation of 64 atoms on(Ti-Al-V): 64 cores 13.5% Thousand(s) of atom simulations are possible, but typically not Simulations generally computed by chemists/physicists are of On Future technologies are tolikely use accelerators – how will For Usage statistics: HECToR , typical usage of 1024 cores (32 nodes). Largest job , ab (1024to 2048 cores); max cores used is 66536 initio initio synchronisation codes most commonly use between 32 . ab synchronisation MPI_alltoallv initio codes initio for FFTs. . http://www.nu-fuse.com • • •

The issues DFTwith calculations Communications includes FFTs – (using FFTW3 manipulation) BLAS routines (matrix DFT codes. improving use of large-scale systems for larger simulations in Typical Typical benchmark results: Codes scale with Investigation of how to use alternative architectures is key to – –

parallelisation potential calculation of 8 Si atoms – total 32 electrons, on 8 CPUs. Benchmark 2: 8 Ti, 64 Al and 2 V atoms. 596 electrons. No k-point Benchmark 1: non-local MPI_alltoall_V comms (of which possible as only 1 k-point, on 64 CPUs. ) O(N 3) ) wavefunction for 10.8% 10.8% 26.6% 38.7% 12.1% 1 Benchmark N electrons. -based exchange correlation screening 13.5% 13.5% 23.1% 19.2% 19.8% 2 Benchmark http://www.nu-fuse.com • • •

The issues DFTwith calculations systems. calculate complex physical phenomena of condensed matter particles. molecular bonding) results from the properties of the charged nucleus and negatively charged electrons). from calculations of the charged particle structure (positive Accurate calculation of these properties results in the to ability Interactions between atoms and molecules (chemical and First principles – – – – – – –

switched switched off. those of electrons, except that the electron-electron repulsive interaction is are dealt explicitly with. particles. body Schrödinger equation. (Schrödinger’s scales exponentially). Born-Oppenheimer approximation assumes nuclei can be classed as classical Complete system of Electrons behave like point charge particles. Modelled simple with quantum mechanics – but large numerical calculations Use ground state calculation only to reduce complexity of the calculation. systemThis fictitious is a set of particles whose properties are identical to Introduce auxiliaryfictitious system, to reduce the number of electrons that calculations of condensed matter systems upbuilt N electrons and N I

nuclei nuclei is represented by the many

http://www.nu-fuse.com The ofprinciples DFT – – is set up to satisfy Schrödinger’s equation – – – – – –

Schrodinger, called Kohn-Sham orbitals: symmetry. Result in single-particle external potential in which the non-interacting particles move. interacting particles move. so simplified it can be calculated computationally. wavefunction structure Define local effective (Kohn-Sham) external potential, where non- This scales exponentially. To reduce complexity, this needs to be The V then enables description of the Calculate a static potential, Use Born-Oppenheimer Approximation – only model electronic Wavefunctions The Kohn–Sham equation is defined by a local effective (fictitious) wavefunction : independent,

V BUT , within , which thewithin electrons move. Principles ofPrinciples DFT still need still to satisfy exchange anti- stationary wavefunctions electronic state as a that satisfy http://www.nu-fuse.com The issues DFTwith calculations http://www.nu-fuse.com • • • • • •

The issues DFTwith calculations data sets and communication overheads due to large data structures. lot more expensive both in memory and CPU time. Currently ported DFT software: May seem like an obvious candidate because of fine-grain parallelisation, except for sparse Includes non standard DFT calculations such as non local exchange, making the calculations a Operations have fine-grain data parallelisation: Pseudo-Potential Plane-wave DFT is nominally O( – – – – – – –

& Car & Car Nwchem Quantum Espresso FFT, BLAS, Scatter-gather heavilyRelies on 3D FFTs (requires Coverage where not needed (cubic scaling) Good coverage, no region bias VASP (Vienna • • •

main package release. ask_schedulers Benchmarks#Current_developments_for_high_accuracy:_GPGPU_and_alternative_t the plain wave self consistent field ( Part of http:// integrated suite of open source packages. Currently GPU acceleration available for Parrinello – scalable code. Uses atom centred type orbitals www.-sw.org Hartree ab N

2 initio ln N http://www.quantum-espresso.org -Foch calculation has been ported, but no results released and not in ) or O( simulation package) N 3 ) ) depending on system size / index.php orthoganilsation PWscf https://www.vasp.at/ / ) ) code , diagnolisation

) http://www.nu-fuse.com • •

Replace Replace was due to due was data transfer. Majority runtime of the increased CASTEP: initial acceleratorCASTEP: initial investigation – –

electrons. calls. 4 atoms,Ti 2 O atoms, total of 32 www.culatools.com Small simulation, to fit on one CPU, no MPI ( • • • cuda-blas

Cula Cula No device calls runtime = 14.6s fft blas

blas blas calls with calls with calls with calls with library http:// and calls runtime = 31.1s cufft cufft /) cula calls runtime = 418s.

http://www.nu-fuse.com • •

Code structure: Aim: – – – – – – – – –

loop, using ~50 subroutines. electronic structure properties. , and a rangewide of derived large data structures on the GPU. Majority of computation in self-consistent field computation Compiles in serial or MPI.with Runs on supercomputers and PCs. Able to calculate geometry Parallelised 134 modules, over 1974 subroutines, 237 functions. 531,000 lines of modular 90/95 MPIwith Use remove data transfer problems by placing most of the OpenACC over k-points, bands and plane waves GPU kernels, ification cula

optimisation blas and cufft of CASTEP , finite temperature . http://www.nu-fuse.com • • Data structures on device

Bands Wavefunctions – – – – – – – – – – – – –

(:,:)! real(kind= complex(kind= complex(kind= real(kind= complex(kind= logical :: real(kind= complex(kind= complex(kind= logical :: real(kind= complex(kind= complex(kind=

dp dp dp dp : Wavefunctionslice%have_realspace Wavefunction%have_beta_phi ) :: ) :: ) :: ) :: dp dp dp dp dp dp dp GPU ) :: ) :: ) :: ) :: ) :: ) :: ) :: Wavefunctionslice%beta_phi_at_gamma Wavefunctionslice%realspace_coeffs_at_gamma Wavefunction%beta_phi_at_gamma beta_phi_at_gamma beta_phi coeffs Wavefunctionslice%beta_phi Wavefunctionslice%realspace_coeffs Wavefunctionslice%coeffs Wavefunction%beta_phi Wavefunction%coeffs ification (:) ! (:)! (:)! (:,:)! of CASTEP

(:,:,:,:)! (:,:,:,:)! (:)! (:,:) ! (:,:)! (:,:,:,:)! (:,:)! (:,:)! ewe rjco function projector a between ftecmuiain ilb icse later. discussed be will communications the of n hnw opt h iercombination linear the compute we then and for data the receiving and sending for exist manners possible Several 2346 igecmuiainwith communication single h prto ftennlclptnilequation potential non-local the of operation The ic h oa oeta arxi ignl t prto na ria sstraightforward: is orbital an on operation its diagonal, is matrix potential local the Since ð b ð V V alm ~ / NL Þ ~ / ¼ i Þ ¼ i ¼ alm v ð X a r Z ¼ N i Þ 1 X / a X ð l d ¼ L r a r 0 i p Þ http://www.nu-fuse.com : m alm X ¼À l ð r l M p Þ / i.1. Fig. alm i B p ð alm  r b Þ% alm .I wt ta./Junlo opttoa hsc 2 21)2339–2363 (2010) 229 Physics Computational of Journal / al. et Iwata J.-I. M n norbital an and lwcato h C aclto n h tmcsrcueoptimization. atomic-structure the and calculation SCF the of chart Flow D ; C  alm M j X 2 s X aa rsvrlcmuiain iha prpit e fpce aa h cost The data. packed of set appropriate an with communications several or data, a p Conjugate-GradientMethod(CG) Update orbitalsby Update theatomicconfiguration Compute Hellmann-Feynman force Construct ionicpseudopotentials alm j Read initialatomicconfiguration exchange-correlationpotentials Update Hartreeand à / ortho-normalization(GS) Perform Gram-Schmidt / Generate initialorbitals and initialpotentials density initial Construct j D diagonalization(SD) Perform subspace , V SCF isconverged? Update density Update (18) scntutdi w tp.Frt ecmuea ne product inner an compute we First, steps. two in constructed is Yes M B orbitals: No M B omnctoswith communications Majority of calculation ais within • • • ‘Self Consistent Field’ Loop

stages around key computational device electronic properties from the configuration and key Place Aim to only remove atomic Place central loop on device M D Â M s aa a data, OpenACC ð ð ð 36 37 35 Þ Þ Þ

kernals http://www.nu-fuse.com end ! !$ ……! enddo! enddo! %node_band_index & (nb2,id_in_bnd_group), donb2=1,nbands_to_copy! do ……! ! ……! enddo! (:, (:, do !Map reducedrepresentationofcoefficientsonk-point !$ ……! subroutine acc acc nb,nk_d,ns_d nb,nk_s,ns_s subroutine endkernels! kernels copy nb wvfn_dst%rotation rotationdata! call call recip_grid =1,nbands_to_copy! nb wave_copy_wv_wv_ks =1,nbands_to_copy! present_or_copy wave_copy_wv_wv_ks ), basis_recip_grid_to_reduced ),nk_src,recip_ basis_recip_reduced_to_grid (nb2,id_in_bnd_group), nk_dst =cmplx_0! nk_dst,ns_dst )! GPU ( nb,wvfn_dst%node_band_index ( ! grid wvfn_dst ! ,'STND')! ) =&! ification wvfn_src%rotation nk_src,ns_src , wvfn_src (recip_ ( wvfn_src%coeffs )! grid )! ,'STND', ( nb,wvfn_src of CASTEP ! ! wvfn_dst%coeffs http://www.nu-fuse.com • • • • The problems porting with

improve performance. necessarily very efficient. require Will a lot of fine tuning to together, and not the entire subroutines, which is not results in multiple subroutines subroutines followed by ‘ such as the use of ‘ Serial only so far. Sometimes the oflimitations what is on and off the device Specifying arrays with CASTEP uses functions that are not supported on devices, – –

requiring arguments. Resolved by specifying correct dimension Resolved by creating copies of subroutines andwith without optional

mulitiple GPU copies of subroutines !$ optional acc dimension(*) kernel ification if present ’ types when passing data to regions very close structuer ’ statements. when passing to of CASTEP , sometimes http://www.nu-fuse.com Next step: • • • anStill ongoing project

minimise ported to device Found a PGI compiler bug Expect this to be Very closed to having the first version of the software – –

nested case/select statements in a kernel. Waiting for a resolution. Found bug in data transfer OpenACC+MPI openACC GPU optimised compiler due to the ofinclusion too many

ification to improve performance and implementation.

of CASTEP http://www.nu-fuse.com • • CUDA libraries Porting CASTEP to using GPGPUs

software on accelerators. iterative improvement to investigate performance of DFT work on will accelerators. supercomputing sofacilities we need to consider if these Close Close to having a version require finished which will DFT software is one of the largest users of UK OpenACC Summary and http://www.nu-fuse.com Derek Hepburn (University of Edinburgh) Graeme Adrian Jackson (University of Edinburgh) Collaborators: additional funding towards porting CASTEP to GPGPUs. The University of Edinburgh and EPSRC has provided funding Nu- Keith Stewart Clark (University of Durham) FuSE Refson Ackland is funded by the G8 (STFC/RAL) (STFC/RAL) (University of Edinburgh) Acknowledgements Exascale Software Projects http://www.nu-fuse.com http://www.nu-fuse.com http://www.nu-fuse.com