Summer School CEA-EDF-INRIA

Toward petaflop numerical simulation on parallel hybrid architectures

BigDFT and GPU INRIACENTREDE RECHERCHE SOPHIA ANTIPOLIS,FRANCE Atomistic Simulations

DFT Ab initio codes

BigDFT -Based DFT calculations on Massively

Properties BigDFT and GPUs Parallel Hybrid Architectures Code details

BigDFT and HPC GPU Luigi Genovese Practical cases

Discussion Messages L_Sim – CEA Grenoble

June 9, 2011 Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Outline

1 Review of Atomistic Simulations Density Functional Theory Ab initio codes

BigDFT and 2 The BigDFT project GPU Formalism and properties Atomistic Simulations The needs for hybrid DFT codes DFT Main operations, parallelisation Ab initio codes

BigDFT 3 Properties Performance evaluation BigDFT and GPUs Code details Evaluating GPU gain

BigDFT and Practical cases HPC

GPU Practical cases 4 Concrete examples Discussion Messages Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Review of Atomistic Simulations

A interdisciplinary domain Theory – Experiment – Simulation Hardware – Computers BigDFT and Algorithms GPU

Atomistic Simulations Different Atomistic Simulations DFT Ab initio codes Force fields (interatomic potentials) BigDFT

Properties Tight Binding Methods BigDFT and GPUs Code details Hartree-Fock BigDFT and HPC Density Functional Theory

GPU Practical cases Configuration interactions Discussion

Messages Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Quantum mechanics for many particle systems

Can we do quantum mechanics on systems of many atoms? Decoupling of the nuclei and dynamics Born-Oppenheimer approximation: The position of the nuclei can be considered as fixed,

BigDFT and obtaining the potential “felt” by the GPU n Za Atomistic Vext(r,{R1,··· ,Rn}) = − ∑ Simulations a=1 |r − Ra| DFT Ab initio codes BigDFT Electronic Schrödinger equation Properties BigDFT and GPUs The system properties are described by the ground state Code details

BigDFT and wavefunction ψ(r1,··· ,rN ), which solves Schrödinger HPC equation GPU Practical cases H [{R}]ψ = Eψ Discussion Messages The quantum hamiltonian depends on the set of the atomic positions {R}.

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Atomistic Simulations

Two intrinsic difficulties for numerical atomistic simulations, related to complexity: Interactions The way that atoms interact is known: ∂Ψ BigDFT and i = H Ψ H ψ = E0ψ GPU } ∂t Exploration of the configuration space Atomistic Simulations

DFT Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details

BigDFT and HPC

GPU Practical cases

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Atomistic Simulations

Two intrinsic difficulties for numerical atomistic simulations, related to complexity: Interactions The way that atoms interact is known: ∂Ψ BigDFT and i = H Ψ H ψ = E0ψ GPU } ∂t Exploration of the configuration space Atomistic Simulations E DFT pot Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details BigDFT and R1 HPC

GPU Practical cases

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Atomistic Simulations

Two intrinsic difficulties for numerical atomistic simulations, related to complexity: Interactions The way that atoms interact is known: ∂Ψ BigDFT and i = H Ψ H ψ = E0ψ GPU } ∂t Exploration of the configuration space Atomistic Simulations E DFT pot Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details BigDFT and R1 HPC

GPU Practical cases R2

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Atomistic Simulations

Two intrinsic difficulties for numerical atomistic simulations, related to complexity: Interactions The way that atoms interact is known: ∂Ψ BigDFT and i = H Ψ H ψ = E0ψ GPU } ∂t Exploration of the configuration space Atomistic Simulations E DFT pot Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details BigDFT and R1 HPC

GPU Practical cases R3 R2

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Atomistic Simulations

Two intrinsic difficulties for numerical atomistic simulations, related to complexity: Interactions The way that atoms interact is known: ∂Ψ BigDFT and i = H Ψ H ψ = E0ψ GPU } ∂t Exploration of the configuration space Atomistic Simulations E DFT pot Ab initio codes R1000 BigDFT

Properties BigDFT and GPUs Code details BigDFT and R1 HPC

GPU R41 R3 Practical cases Rn R2

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Physics Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Physics Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Physics Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Physics Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Choice of Atomistic Methods

3 criteria 1→ Generality (elements, alloys) G P 2→ Precision (∆r, ∆E) 3→ System size (N, ∆t) BigDFT and S GPU

Atomistic Chemistry and Physics Simulations

DFT Ab initio codes Force Fields BigDFT G P G P G P Properties Tight Binding BigDFT and GPUs Code details S S S Hartree-Fock BigDFT and HPC DFT GPU G P G P G P Practical cases Conf. Inter. Discussion Messages S S S Quantum Monte-Carlo

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The Hohenberg-Kohn theorem

A tremendous numerical problem

N 1 1 1 H = − ∇2 + V (r ,{R}) + ∑ ri ext i ∑ i=1 2 2 i6=j |ri − rj |

BigDFT and The Schrödinger is very difficult to solve for more than two GPU electrons! Another approach is imperative

Atomistic Simulations The fundamental variable of the problem is however not the DFT Ab initio codes wavefunction, but the electronic density BigDFT Z ∗ Properties ρ(r) = N dr2 ···drN ψ (r,r2,··· ,rN )ψ(r,r2,··· ,rN ) BigDFT and GPUs Code details BigDFT and Hohenberg-Kohn theorem (1964) HPC

GPU Practical cases The ground state density ρ(r) of a many-electron system

Discussion uniquely determines (up to a constant) the external potential . Messages The external potential is a functional of the density

Vext = Vext[ρ]

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The Kohn-Sham approximation

Given the H-K theorem, it turns out that the total electronic energy is an unknown functional of the density E = E[ρ] =⇒ Density Functional Theory DFT (Kohn-Sham approach)

BigDFT and Mapping of a interacting many-electron system into a system GPU with independent particles moving into an effective potential. Atomistic Simulations DFT Find a set of orthonormal orbitals Ψ (r) that minimizes: Ab initio codes i

BigDFT Properties N/2 1 Z 1 Z BigDFT and GPUs E = − Ψ∗(r)∇2Ψ (r)dr + ρ(r)V (r)dr Code details ∑ i i H 2 i=1 2 BigDFT and HPC Z GPU + Exc[ρ(r)]+ Vext (r)ρ(r)dr Practical cases

Discussion N/2 Messages ∗ 2 ρ(r) = 2 ∑ Ψi (r)Ψi (r) ∇ VH (r) = −4πρ(r) i=1

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Ab initio calculations with DFT

Several advantages Main limitations  Ab initio: No  Approximated approach adjustable parameters  Requires high computer  DFT: Quantum power, limited to few mechanical BigDFT and hundreds atoms in most GPU (fundamental) cases Atomistic treatment Simulations

DFT Ab initio codes Wide range of applications: nanoscience, biology, materials BigDFT

Properties BigDFT and GPUs Code details

BigDFT and HPC

GPU Practical cases

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Performing a DFT calculation

A self-consistent equation

Then ρ = 2∑i hψi |ψi i, where |ψi i satisfies   1 2 ∇ + VH [ρ] + Vxc[ρ] + Vext + Vpseudo |ψi i = ∑Λi,j |ψj i , 2 j BigDFT and GPU Now in practice: implementing a DFT code Atomistic Simulations (Kohn-Sham) DFT “Ingredients” DFT Ab initio codes An XC potential, functional of the density BigDFT

Properties several approximations exists (LDA,GGA,. . . ) BigDFT and GPUs Code details A choice of the (if not all-electrons) BigDFT and (norm conserving, ultrasoft, PAW,. . . ) HPC GPU An (iterative) algorithm for finding the wavefunctions | i Practical cases ψi

Discussion A for expressing the |ψi i Messages A (good) computer. . .

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese 2 2 2 Poisson Equation: ∆V = ρ (Laplacian: ∆ = ∂ + ∂ + ∂ ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion

Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion ∂2 ∂2 ∂2 Messages Poisson Equation: ∆V = ρ (Laplacian: ∆ = + + ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion ∂2 ∂2 ∂2 Messages Poisson Equation: ∆V = ρ (Laplacian: ∆ = + + ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion ∂2 ∂2 ∂2 Messages Poisson Equation: ∆V = ρ (Laplacian: ∆ = + + ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion ∂2 ∂2 ∂2 Messages Poisson Equation: ∆V = ρ (Laplacian: ∆ = + + ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese KS Equations: Self-Consistent Field

Hamiltonian (H) z }| { Set of self-consistent equations:  2  1 ~ 2 − ∇ + Veff ψi = εi ψi 2 m BigDFT and e GPU with an effective potential:

Atomistic Z 0 Simulations 0 ρ(r ) δExc Veff (r) = Vext (r)+ dr + DFT |r − r 0| δρ(r) Ab initio codes V | {z } | {z } BigDFT Hartree exchange−correlation Properties BigDFT and GPUs Code details 2 BigDFT and and: ρ(r) = ∑i fi |ψi (r)| HPC

GPU Practical cases

Discussion ∂2 ∂2 ∂2 Messages Poisson Equation: ∆V = ρ (Laplacian: ∆ = + + ) Hartree ∂x2 ∂y2 ∂z2 Real Mesh (1003 = 106): 106 × 106 = 1012 evaluations !

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Performing a DFT calculation (KS formalism)

Find a set of orthonormal orbitals Ψi (r) that minimizes:

E = ∑hΨi |H[ρ]|Ψi i i with:

BigDFT and i = 1,··· ,N (one Ψ per electron) GPU ∗ ρ(r) = ∑Ψi (r)Ψi (r) Atomistic i Simulations

DFT Ab initio codes (Kohn-Sham) DFT “Actors” BigDFT Properties A set of wavefunctions |ψi i, one for each electron BigDFT and GPUs Code details A computational approach on a finite basis BigDFT and HPC ⇒ One array for each Ψi GPU Practical cases ⇒ A set of computational operations on these arrays which Discussion

Messages depend on the basis set A (even more) good computer. . .

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Basis sets for electronic structure calculation

Several basis sets exist, with different features:

Plane Waves Gaussians, Slater Orbitals ABINIT, CPMD, VASP,.. . CP2K,,AIMPRO,. . . Systematic convergence. Real space localized

BigDFT and GPU  Accuracy increases with  Small number of basis the number of basis functions for moderate Atomistic Simulations elements accuracy DFT Ab initio codes  Non-localised,  Well suited for BigDFT

Properties optimal for periodic, and other open BigDFT and GPUs homogeneous systems structures Code details BigDFT and  Non adaptive  Non systematic HPC

GPU Practical cases Analytic functions Discussion FFT Messages Kinetic and overlap matrices Robust, Easy to parallelise can be calculated analytically

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese List of ab initio Codes

Plane Waves ABINIT — Louvain-la-Neuve — http://www.abinit.org CPMD — Zurich, Lugano — http://www.cpmd.org PWSCF — Italy — http://www.pwscf.org VASP — Vienna — http://cms.mpi.univie.ac.at/vasp BigDFT and Gaussian GPU Gaussian — http://www.gaussian.com Atomistic Simulations DeMon — http://www.demon-software.com DFT CP2K — http://cp2k.berlios.de Ab initio codes

BigDFT Numerical-like basis sets Properties Siesta — Madrid — BigDFT and GPUs Code details http://www.uam.es/departamentos/ciencias/fismateriac/siesta BigDFT and Wien2K — Vienna — http://www.wien2k.at (FPLAPW, HPC

GPU all electrons) Practical cases Real space basis set Discussion

Messages ONETEP — http://www.onetep.soton.ac.uk BigDFT — http://inac.cea.fr/L_Sim/BigDFT

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A basis for nanosciences: the BigDFT project

STREP European project: BigDFT(2005-2008) Four partners, 15 contributors: CEA-INAC Grenoble (T.Deutsch), U. Basel (S.Goedecker), U. Louvain-la-Neuve (X.Gonze), U. Kiel (R.Schneider)

BigDFT and GPU Aim: To develop an ab-initio DFT code Atomistic Simulations based on Daubechies , to be DFT Ab initio codes integrated in ABINIT. BigDFT BigDFT 1.0 −→ January 2008 Properties BigDFT and GPUs Code details

BigDFT and HPC . . . why have we done this? Was it worth it?

GPU Practical cases Test the potential advantages of a new formalism Discussion A lot of outcomes and interesting results Messages A lot can be done starting from present know-how

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A DFT code based on Daubechies wavelets

BigDFT: a PSP Kohn-Sham code A Daubechies wavelets basis has unique properties for DFT usage

Systematic, Orthogonal BigDFT and GPU Localised, Adaptive Atomistic Simulations Kohn-Sham operators are analytic DFT Ab initio codes Daubechies Wavelets Short, Separable convolutions BigDFT 1.5 φ(x) Properties ˜ = a c ` ∑j j `−j 1 ψ(x) BigDFT and GPUs Code details 0.5 BigDFT and Peculiar numerical properties 0 HPC

GPU -0.5 Practical cases Real space based, highly flexible -1 Discussion Big & inhomogeneous systems Messages -1.5 -6 -4 -2 0 2 4 6 8 x

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Wavelet properties: multi-resolution

Example of two resolution levels: step function (Haar Wavelet) Scaling Functions: Multi-Resolution basis Low and High resolution functions related each other

BigDFT and GPU = + Atomistic Simulations

DFT Ab initio codes Wavelets: complete the low resolution description BigDFT Properties Defined on the same grid as the low resolution functions BigDFT and GPUs Code details

BigDFT and 1 1 HPC = 2 + 2 GPU Practical cases Discussion Scaling Function + Wavelet = High resolution Messages We increase the resolution without modifying the grid spacing

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Wavelet properties: adaptivity

Adaptivity Resolution can be refined following the grid point.

The grid is divided in BigDFT and Low (1 DoF) and High (8 GPU DoF) resolution points. Atomistic Simulations Points of different resolution DFT Ab initio codes belong to the same grid.

BigDFT Empty regions must not be Properties BigDFT and GPUs “filled” with basis functions. Code details

BigDFT and HPC

GPU Practical cases Localization property,real space description Discussion

Messages Optimal for big & inhomogeneous systems, highly flexible

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Basis set features

BigDFT features in a nutshell  Arbitrary absolute precision can be achieved Good convergence ratio for real-space approach (O(h14))  Optimal usage of the degrees of freedom (adaptivity) BigDFT and Optimal speed for a systematic approach (less memory) GPU  Hartree potential accurate for various boundary Atomistic Simulations conditions DFT Ab initio codes Free and Surfaces BC Poisson Solver

BigDFT (present also in CP2K, ABINIT, ) Properties BigDFT and GPUs Data repartition is suitable for optimal scalability Code details

BigDFT and Simple communications paradigm, multi-level parallelisation HPC possible (and implemented) GPU Practical cases Discussion Improve and develop know-how Messages Optimal for advanced DFT functionalities in HPC framework

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese BigDFT version 1.5.2: (ABINIT-related) capabilities http://inac.cea.fr/L_Sim/BigDFT Isolated, surfaces and 3D-periodic boundary conditions (k-points, symmetries) All XC functionals of the ABINIT package (libXC library)

BigDFT and Hybrid functionals, Fock exchange operator GPU Direct Minimisation and Mixing routines (metals) Atomistic Simulations Local geometry optimizations (with constraints) DFT Ab initio codes External electric fields (surfaces BC) BigDFT Properties Born-Oppenheimer MD, ESTF-IO BigDFT and GPUs Code details Vibrations BigDFT and HPC Unoccupied states

GPU Practical cases Empirical van der Waals interactions Discussion

Messages Saddle point searches (NEB, Granot & Bear) All these functionalities are GPU-compatible

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Operations performed

The SCF cycle Real Space Daub. Wavelets Orbital scheme: Hamiltonian Preconditioner BigDFT and GPU Coefficient Scheme:

Atomistic Simulations Overlap matrices

DFT Ab initio codes Orthogonalisation

BigDFT Properties Comput. operations BigDFT and GPUs Code details Convolutions BigDFT and HPC BLAS routines GPU Practical cases FFT (Poisson Solver) Discussion

Messages Why not GPUs?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid Supercomputing nowadays

GPGPU on Supercomputers Traditional architectures are somehow saturating More cores/node, memories (slightly) larger but not faster Architectures of Supercomputers are becoming hybrid

BigDFT and 3 out to 4 Top Supercomputers are hybrid machines GPU Extrapolation: In 2015, No. 500 will become petafloppic Atomistic Simulations Likely it will be a hybrid machine DFT Ab initio codes

BigDFT Codes should be conceived differently Properties BigDFT and GPUs # MPI processes is limited for a fixed problem size Code details

BigDFT and Performances increase only by enhancing parallelism HPC GPU Further parallelisation levels should be added (OpenMP, Practical cases GPU) Discussion

Messages Does electronic structure calculations codes are suitable?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How far is petaflop (for DFT)?

At present, with traditional architectures Routinely used DFT calculations are: Few dozens (hundreds) of processors Parallel intensive operations (blocking communications, BigDFT and GPU 60-70 percent efficiency)

Atomistic Not freshly optimised (legacy codes, monster codes) Simulations DFT Optimistic estimation: 5 GFlop/s per core × 2000 cores × Ab initio codes 0.9 = 9 TFlop/s = 200 times less than Top 500’s #1! BigDFT

Properties BigDFT and GPUs Code details It is such as BigDFT and HPC Distance Earth-Moon = 384 Mm

GPU Practical cases Distance Earth-Mars = 78.4 Gm = 200 times more

Discussion Messages Are we able to go to Mars? (. . . in 2015?)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The vessel: Ambitions from BigDFT experience

Reliable formalism Systematic convergence properties Explicit environments, analytic operator expressions

BigDFT and GPU State-of-the-art computational technology

Atomistic Data locality optimal for operator applications Simulations DFT Massive parallel environements Ab initio codes BigDFT Material accelerators (GPU) Properties BigDFT and GPUs Code details New physics can be approached BigDFT and HPC Enhanced functionalities can be applied relatively easily GPU Practical cases Limitation of DFT approximations can be evidenced Discussion Messages A formalism of interest for Post-DFT treatments

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Separable convolutions

We must calculate L F(I ,I ,I ) = h h h G(I − j ,I − j ,I − j ) 1 2 3 ∑ j1 j2 j3 1 1 2 2 3 3 j1,j2,j3=0 L L L BigDFT and = hj1 hj2 hj3 G(i1 − j1,i2 − j2,i3 − j3) GPU ∑ ∑ ∑ j1=0 j2=0 j3=0 Atomistic Simulations DFT Application of three successive operations Ab initio codes 1 BigDFT A3(I3,i1,i2) = ∑j hj G(i1,i2,I3 − j) ∀i1,i2; Properties 2 BigDFT and GPUs A2(I2,I3,i1) = ∑j hj A3(I3,i1,I2 − j) ∀I3,i1; Code details 3 BigDFT and F(I1,I2,i3) = ∑j hj A2(I2,I3,I1 − j) ∀I2,I3. HPC

GPU Practical cases Main routine: Convolution + transposition Discussion

Messages F(I,a) = ∑hj G(a,I − j) ∀a ; j

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Basic Input-Output operation

BigDFT and GPU

Atomistic Simulations

DFT Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details

BigDFT and HPC From sequential to GPU GPU Practical cases Same operation schedule for monocore, multithread, GPU Discussion

Messages Can be treated as a library

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese CPU performances of the convolutions Initially, naive routines do j=1,ndat U do i=0,n1 y(j,I) = h x(I + `,j) ∑ ` tt=0.d0 `=L do l=lowfil,lupfil BigDFT and Easy to write and GPU tt=tt+x(i+l,j)*h(l) debug enddo Atomistic Simulations Test the formalism y(j,i)=tt DFT Ab initio codes Define reference enddo BigDFT enddo Properties results BigDFT and GPUs Code details Optimisation can then start (Ex. X5550,2.67 GHz) BigDFT and HPC

GPU Method GFlop/s % of peak SpeedUp Practical cases Naive (FORTRAN) 0.54 5.1 1/(6.25) Discussion Messages Current (FORTRAN) 3.3 31 1 Best (C, SSE) 7.8 73 2.3

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How to optimize?

A trade-off between benefit and effort

FORTRAN based  Relatively accessible (loop unrolling)

BigDFT and  Moderate optimisation can be achieved relatively fast GPU  Compilers fail to use vector engine efficiently Atomistic Simulations

DFT Ab initio codes Push optimisation at the best BigDFT Only one out of 3 convolution type has been Properties BigDFT and GPUs implemented Code details

BigDFT and About 20 different patterns have been studied for one HPC

GPU 1D convolution Practical cases Tedious work, huge code −→ Maintainability? Discussion

Messages Automatic code generation under study

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization I: Orbital distribution scheme

Used for the application of the hamiltonian Operator approach: The hamiltonian (convolutions) is applied separately onto each wavefunction

BigDFT and GPU ψ1 Atomistic Simulations MPI 0 DFT Ab initio codes ψ2 BigDFT

Properties BigDFT and GPUs ψ3 Code details

BigDFT and MPI 1 HPC ψ4 GPU Practical cases

Discussion Messages ψ5 MPI 2

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization II: Coefficient distribution scheme

Used for scalar products & orthonormalisation BLAS routines (level 3) are called, then result is reduced

MPI 0 MPI 1 MPI 2

BigDFT and GPU ψ1

Atomistic Simulations ψ2 DFT Ab initio codes BigDFT ψ Properties 3 BigDFT and GPUs Code details

BigDFT and ψ4 HPC

GPU Practical cases ψ5 Discussion

Messages At present, MPI_ALLTOALL(V) is used to switch

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese OpenMP parallelisation

Innermost parallelisation level (Almost) Any BigDFT operation is parallelised via OpenMP  Useful for memory demanding calculations

BigDFT and  Allows further increase of speedups GPU  Saves MPI processes and intra-node Message Passing Atomistic 4.5 Simulations 2OMP DFT 3OMP  Less efficient than 6OMP Ab initio codes 4 MPI BigDFT 3.5 Properties

BigDFT and GPUs  Compiler and 3 Code details

system dependent OMP Speedup BigDFT and 2.5 HPC  OMP sections GPU 2 Practical cases should be regularly 1.5 Discussion maintained 0 20 40 60 80 100 120 Messages No. of MPI procs

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Task repartition for a small system (ZnO, 128 atoms)

1 Th. OMP per core 100 1000 90 80 70 100 60 BigDFT and GPU 50

Percent 40 Atomistic 10 Efficiency (%) Simulations 30

Seconds (log. scale) Time (sec) DFT Other Ab initio codes 20 CPU Conv BigDFT 10 LinAlg Properties Comms BigDFT and GPUs 0 1 Code details 16 24 32 48 64 96 144 192 288 576 No. of cores BigDFT and HPC GPU What are the ideal conditions for GPU Practical cases

Discussion GPU-ported routines should take the majority of the time Messages What happens to parallel efficiency?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Parallelisation and architectures

Same code, same runs. Which is the best?

1 Th. OMP per core 1 Th. OMP per core 100 1000 100 1000 90 90 80 80 70 70 100 100 60 60 50 50

Percent 40 Percent 40 10 Efficiency (%) 10 Efficiency (%) 30 30

BigDFT and Seconds (log. scale) Time (sec) Seconds (log. scale) Time (sec) Other Other GPU 20 CPU 20 CPU Conv Conv 10 LinAlg 10 LinAlg 0 1 Comms 0 1 Comms Atomistic 16 24 32 48 64 96 144 192 288 576 16 24 32 48 64 96 144 192 288 576 Simulations No. of cores No. of cores DFT CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5) Ab initio codes

BigDFT Titane is 2.3 to 1.6 times faster than Rosa! Properties BigDFT and GPUs Code details Degradation of parallel performances: why?

BigDFT and 1 HPC Calculation power has increased more than networking

GPU 2 Practical cases Better libraries (MKL) Discussion Walltime reduced, but lower parallel efficiency Messages

This will always happen while using GPU!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Architectures, libraries, networking

Same runs, same sources; different user conditions

6 Titane MpiBull2 no MKL 5.5 Jade MPT Titane MpiBull2, MKL 5 Titane OpenMPI, MKL Rosa Cray, istanbul 4.5

BigDFT and 4 GPU Differences 3.5 up to a Atomistic Cpu Hours 3 Simulations 2.5 factor of 3! DFT Ab initio codes 2

BigDFT 1.5 Properties 1 BigDFT and GPUs 0 100 200 300 400 500 600 Code details No. of cores

BigDFT and HPC

GPU A case-by-case study Practical cases Consideration are often system-dependent, a thumb rule not Discussion Messages always exists. Know your code!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Using GPUs in a Big systematic DFT code

Nature of the operations Operators approach via convolutions Linear Algebra due to orthogonality of the basis Communications and calculations do not interfere BigDFT and GPU A number of operations which can be accelerated

Atomistic Simulations Evaluating GPU convenience DFT Ab initio codes Three levels of evaluation BigDFT Properties 1 Bare speedups: GPU kernels vs. CPU routines BigDFT and GPUs Code details Does the operations are suitable for GPU? BigDFT and 2 HPC Full code speedup on one process

GPU Practical cases Amdahl’s law: are there hot-spot operations? Discussion 3 Speedup in a (massively?) parallel environment Messages The MPI layer adds an extra level of complexity

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Convolutions and Transposition in GPU

BigDFT and GPU

Atomistic Simulations

DFT Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details

BigDFT and HPC Blocking operations GPU Practical cases A work group performs convolutions plus transpositions Discussion

Messages Different boundary conditions can be implemented

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese OpenCL data management

Example of 4 × 4 block with a filter of length 5

BigDFT and GPU

Atomistic Simulations

DFT Ab initio codes

BigDFT

Properties BigDFT and GPUs Code details

BigDFT and HPC

GPU Practical cases SIMD implementation Discussion Each work item is associated to a single output element Messages Convolution buffers → data reuse

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Convolution kernel with OpenCL

Comparison with CPU work (Ex. X5550,2.67 GHz) Method GFlop/s % of peak SpeedUp Naive (FORTRAN) 0.54 5.1 1/(6.25) Current (FORTRAN) 3.3 31 1 BigDFT and GPU Best (C, SSE) 7.8 73 2.3

Atomistic OpenCL (Fermi) 97 20 29 (12.4) Simulations

DFT Ab initio codes Very good and promising results BigDFT Properties No need of data transfer for 3D case (chain of 1D BigDFT and GPUs Code details kernels) BigDFT and HPC Deeper optimisation still to be done (20% of peak) GPU Practical cases Require less manpower than deep CPU optimisation Discussion Messages Automatic generation will be considered

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese GPU-ported operations in BigDFT (double precision)

Convolutions Kernels (OpenCL (re)written)  Fully functional (all BC)  Based on the former BigDFT and CUDA version GPU  A 10 to 50 speedup Atomistic Simulations

DFT Ab initio codes 20

18 BigDFT Properties 16 GPU BigDFT sections BigDFT and GPUs 14 Code details 12 GPU speedups between 5

BigDFT and 10 HPC and 20, depending of: 8 GPU  Wavefunction size Practical cases 6 GPU speedup (Double prec.) locden Discussion 4 locham  CPU-GPU Architecture precond Messages 2 0 10 20 30 40 50 60 70 Wavefunction size (MB)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese BigDFT in hybrid codes

Acceleration of the full BigDFT code Considerable gain may be achieved for suitable systems Amdahl’s law should always be considered Resources can be used concurrently (OpenCL queues)

BigDFT and More MPI processes may share the same card! GPU

Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid Atomistic 100 Simulations 10 DFT Ab initio codes 80 BigDFT 8

Properties BigDFT and GPUs 60 6 Code details Percent Speedup BigDFT and 40 HPC 4

GPU

Practical cases 20 Speedup 2 Other CPU Discussion Conv Messages LinAlg Comms 0 CPU-mkl CPU-mkl-mpiCUDA CUDA-mkl OCL-cublasOCL-mkl CUDA-mpi CUDA-mkl-mpiOCL-cublas-mpiOCL-mkl-mpi 0

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese (Lower time-to-solution for a given architecture) The time-to-solution problem I: Efficiency

Good example: 4 C at, surface BC, 113 Kpts Parallel efficiency of 98%, convolutions largely dominate. Node: # GPU added 2 4 8 2× Fermi + 8 × SpeedUp (SU) 5.3 9.8 11.6 BigDFT and Westmere # MPI equiv. 44 80 96 GPU 8 MPI processes Acceler. Eff. 1 .94 .56 Atomistic Simulations 100

DFT Ab initio codes 80 BigDFT

Properties BigDFT and GPUs 60 Code details

BigDFT and SpeedUp 40 HPC

GPU Practical cases 20 ideal Discussion CPU+MPI Messages GPU 0 0 20 40 60 80 100 No. of MPI proc

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The time-to-solution problem II:Robustness

Not so good example: A too small system

Titane, ZnO 64 at.: CPU vs. Hybrid 100 7

6 80

5

BigDFT and 60 GPU 4

Percent 3 Atomistic 40 Simulations Speedup with GPU

DFT 2 Speedup Ab initio codes 20 Efficiency (%) 1 Other BigDFT CPU Conv Properties LinAlg 0 0 Comms BigDFT and GPUs 1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 CPU code Hybrid code (rel.) Code details No. of MPI proc BigDFT and  CPU efficiency is poor (calculation is too fast) HPC GPU  Amdahl’s law not favorable (5x SU at most) Practical cases

Discussion  GPU SU is almost independent of the size Messages  The hybrid code always goes faster

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid and Heterogeneous runs with OpenCL

NVidia S2070 Connected each ATI HD 6970 to a Nehalem Workstation

BigDFT may run BigDFT and GPU on both

Atomistic Simulations

DFT Ab initio codes Sample BigDFT run: Graphene, 4 C atoms, 52 kpts BigDFT 12 Properties No. of Flop: 8.053 · 10 BigDFT and GPUs MPI 1 1 4 1 4 8 Code details GPU NO NV NV ATI ATI NV+ ATI BigDFT and HPC Time (s) 6020 300 160 347 197 109 GPU Practical cases Speedup 1 20.07 37.62 17.35 30.55 55.23

Discussion GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87

Messages Next Step: handling of Load (un)balancing

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Concrete examples with BigDFT code

MD simulation, 32 water molecules, 0.5 fs/step Mixed MPI/OpenMP BigDFT parallelisation vs. GPU case MPI•OMP 32•1 128•1 32•6 128•6 128+128 s/SCF 7.2 2.0 1.5 0.44 0.3 MD ps/day 0.461 1.661 2.215 7.552 11.02 BigDFT and GPU

Atomistic An example: challenging DFT for catalysis Simulations DFT Multi-scale study for OR mechanism on PEM fuel cells Ab initio codes

BigDFT Explicit model of H2O/Pt interface Properties BigDFT and GPUs Absorbtion properties, reaction Code details mechanisms BigDFT and HPC

GPU Outcomes from the understanding of Practical cases catalytic mechanism at atomic scale: Discussion Messages Conception of new active and selective materials Fuels cell ageing, more efficient and durable devices

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The fuel: Scientific Topics

Scientific directions: Energy conversion A scientific domain which embodies a number of challenges:  The quantities are “complicated” (many body effects) Study/prediction of fundamental properties: band gaps, band BigDFT and offsets, excited state quantities,. . . GPU

Atomistic  Objects are “big” and the environment matters Simulations Systems react with the surroundings DFT Ab initio codes Building new modelisation paradigms BigDFT

Properties BigDFT and GPUs How to achieve these objectives? Code details

BigDFT and Short and medium term objectives HPC GPU State-of-the art functionalities of DFT for complex Practical cases environments Discussion Messages Explore new formalisms for Post DFT treatments

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A look in near future: science with HPC DFT codes

A concerted set of actions Improve codes functionalities for present-day and next generation supercomputers Test and develop new formalisms BigDFT and Insert ab-initio codes in new scientific workflows GPU (Multiscale Modelling) Atomistic Simulations

DFT Ab initio codes The Mars mission

BigDFT Is Petaflop performance possible? Properties BigDFT and GPUs GPU acceleration → one order of magnitude Code details

BigDFT and Bigger systems, heavier methods → (more than) one HPC

GPU order of magnitude bigger Practical cases

Discussion

Messages BigDFT experience makes this feasible An opportunity to achieve important outcomes and know-how

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese General considerations

Optimisation effort Know the code behaviour and features Careful performance study of the complete algorithm

BigDFT and Identify and make modular critical sections GPU Fundamental for mainainability and architecture evolution

Atomistic Simulations Optimisation cost: consider end-user running conditions DFT Robustness is more important than best performance Ab initio codes

BigDFT

Properties BigDFT and GPUs Performance evaluation know-how Code details No general thumb-rule: what means High Performance? BigDFT and HPC A multi-criterion evaluation process GPU Practical cases Multi-level parallelisation always to be used Discussion Your code will not (anymore) become faster via hardware Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Conclusions

BigDFT code: a modern approach for nanosciences  Flexible, reliable formalism (wavelet properties)  Easily fit with massively parallel architecture  Open a path toward the diffusion of Hybrid architectures BigDFT and GPU Messages from GPU experience with BigDFT Atomistic Simulations  GPU allow a significant reduction of the time-to-solution DFT Ab initio codes  Require a well-structured underlying code which makes BigDFT

Properties multi-level parallelisation possible BigDFT and GPUs Code details  To be taken into account while evaluating performances BigDFT and Parallel efficiency ⇐ dimensioning of system wrt architecture HPC

GPU Practical cases CECAM BigDFT tutorial next October Discussion Messages A tutorial on BigDFT code is scheduled! Grenoble, 19-21 October 2011

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Acknowledgments

CEA Grenoble – Group of Thierry Deutsch LG, D. Caliste, B. Videau, M. Ospici, I. Duchemin, P. Boulanger, E. Machado-Charry, F. Cinquini, B. Natarajan

BigDFT and GPU Basel University – Group of Stefan Goedecker Atomistic S. A. Ghazemi, A. Willand, M. Amsler, S. Mohr, A. Sadeghi, Simulations DFT N. Dugan, H. Tran Ab initio codes

BigDFT Properties And other groups BigDFT and GPUs Code details Montreal University: BigDFT and HPC L. Béland, N. Mousseau GPU Practical cases European Synchrotron Radiation Facility: Discussion Y. Kvashnin, A. Mirone Messages

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese