Daubechies in Electronic Structure Calculation: BigDFT Code Tutorial

ORNL/NICS–OAK RIDGE,TENNESSEE BigDFT and HPC

The code Properties BigDFT approach to High Performance BigDFT and GPUs Code details

BigDFT and Computing HPC

GPU Practical cases

Discussion Luigi Genovese

L_Sim – CEA Grenoble

February 8, 2012

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Outline

1 The code Formalism and properties

BigDFT and The needs for hybrid DFT codes HPC Main operations, parallelisation

The code

Properties BigDFT and GPUs 2 Code details Performance evaluation

BigDFT and Evaluating GPU gain HPC

GPU Practical cases Practical cases

Discussion 3 Discussion

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A DFT code based on Daubechies wavelets

BigDFT: a PSP Kohn-Sham code A Daubechies wavelets basis has unique properties for DFT usage

Systematic, Orthogonal BigDFT and HPC Localised, Adaptive The code Properties Kohn-Sham operators are analytic BigDFT and GPUs Code details Daubechies Wavelets Short, Separable convolutions BigDFT and 1.5 φ(x) HPC ˜ = a c ` ∑j j `−j 1 ψ(x) GPU Practical cases 0.5 Discussion Peculiar numerical properties 0 Real space based, highly flexible -0.5 -1 Big & inhomogeneous systems -1.5 -6 -4 -2 0 2 4 6 8 x

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese features

BigDFT features in a nutshell  Arbitrary absolute precision can be achieved Good convergence ratio for real-space approach (O(h14))  Optimal usage of the degrees of freedom (adaptivity) BigDFT and Optimal speed for a systematic approach (less memory) HPC  Hartree potential accurate for various boundary The code Properties conditions BigDFT and GPUs Code details Free and Surfaces BC Poisson Solver

BigDFT and (present also in CP2K, ABINIT, ) HPC GPU Data repartition is suitable for optimal scalability Practical cases

Discussion Simple communications paradigm, multi-level parallelisation possible (and implemented)

Improve and develop know-how Optimal for advanced DFT functionalities in HPC framework

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Moore’s law

40 years of improvements BigDFT and HPC Transistor counts double every The code two years. . . Properties BigDFT and GPUs . . . but how? Code details

BigDFT and HPC

GPU Practical cases Discussion Power is the limiting factor Power ∝ Frequency3 Clock rate is limited Multiple slower devices preferable than one superfast device More performance with less power → problem?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid Supercomputing nowadays

GPGPU on Supercomputers Traditional architectures are somehow saturating More cores/node, memories (slightly) larger but not faster Architectures of Supercomputers are becoming hybrid

BigDFT and 3 out to 5 Top Supercomputers are hybrid machines HPC Extrapolation: In 2015, No. 500 will become petafloppic The code Properties Likely it will be a hybrid machine BigDFT and GPUs Code details

BigDFT and Codes should be conceived differently HPC GPU # MPI processes is limited for a fixed problem size Practical cases

Discussion Performances increase only by enhancing parallelism Further parallelisation levels should be added (OpenMP, GPU)

Does electronic structure calculations codes are suitable?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How far is petaflop (for DFT)?

At present, with traditional architectures Routinely used DFT calculations are: Few dozens (hundreds) of processors Parallel intensive operations (blocking communications, BigDFT and HPC 60-70 percent efficiency)

The code Not freshly optimised (legacy codes, monster codes) Properties BigDFT and GPUs Optimistic estimation: 5 GFlop/s per core × 2000 cores × Code details 0.9 = 9 TFlop/s = 200 times less than Top 500’s #3! BigDFT and HPC

GPU Practical cases It is such as Discussion Distance Earth-Moon = 384 Mm Distance Earth-Mars = 78.4 Gm = 200 times more

Moon is reached. . . can we go to Mars? (. . . in 2015?)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Operations performed

The SCF cycle Real Space Daub. Wavelets Orbital scheme: Hamiltonian Preconditioner BigDFT and HPC Coefficient Scheme: The code Overlap matrices Properties BigDFT and GPUs Code details Orthogonalisation

BigDFT and HPC Comput. operations GPU Practical cases Convolutions Discussion BLAS routines FFT (Poisson Solver)

Why not GPUs?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Separable convolutions

We must calculate L F(I ,I ,I ) = h h h G(I − j ,I − j ,I − j ) 1 2 3 ∑ j1 j2 j3 1 1 2 2 3 3 j1,j2,j3=0 L L L BigDFT and = hj1 hj2 hj3 G(i1 − j1,i2 − j2,i3 − j3) HPC ∑ ∑ ∑ j1=0 j2=0 j3=0 The code

Properties BigDFT and GPUs Application of three successive operations Code details 1 BigDFT and A3(I3,i1,i2) = ∑j hj G(i1,i2,I3 − j) ∀i1,i2; HPC 2 GPU A2(I2,I3,i1) = ∑j hj A3(I3,i1,I2 − j) ∀I3,i1; Practical cases 3 Discussion F(I1,I2,i3) = ∑j hj A2(I2,I3,I1 − j) ∀I2,I3.

Main routine: Convolution + transposition

F(I,a) = ∑hj G(a,I − j) ∀a ; j

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese CPU performances of the convolutions Initially, naive routines (?) do j=1,ndat U do i=0,n1 y(j,I) = ∑ h`x(I + `,j) tt=0.d0 `=L do l=lowfil,lupfil tt=tt+x(i+l,j)*h(l) BigDFT and Easy to write and HPC enddo debug y(j,i)=tt The code Properties Define reference enddo BigDFT and GPUs Code details results enddo

BigDFT and HPC

GPU Optimisation can then start (Ex. X5550,2.67 GHz) Practical cases

Discussion Method GFlop/s % of peak SpeedUp Naive (FORTRAN) 0.54 5.1 1/(6.25) Current (FORTRAN) 3.3 31 1 Best (C, SSE) 7.8 73 2.3 OpenCL (Fermi) 97 20 29 (12.4)

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How to optimize?

A trade-off between benefit and effort

FORTRAN based  Relatively accessible (loop unrolling)

BigDFT and  Moderate optimisation can be achieved relatively fast HPC  Compilers fail to use vector engine efficiently The code

Properties BigDFT and GPUs Code details Push optimisation at the best

BigDFT and HPC Only one out of 3 convolution type has been GPU implemented Practical cases

Discussion About 20 different patterns have been studied for one 1D convolution Tedious work, huge code −→ Maintainability? Automatic code generation under study

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization I: Orbital distribution scheme

Used for the application of the hamiltonian Operator approach: The hamiltonian (convolutions) is applied separately onto each wavefunction

BigDFT and HPC ψ1 The code Properties MPI 0 BigDFT and GPUs Code details ψ2 BigDFT and HPC GPU ψ3 Practical cases

Discussion MPI 1 ψ4

ψ5 MPI 2

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization II: Coefficient distribution scheme

Used for scalar products & orthonormalisation BLAS routines (level 3) are called, then result is reduced

MPI 0 MPI 1 MPI 2

BigDFT and HPC ψ1

The code Properties ψ2 BigDFT and GPUs Code details

BigDFT and HPC ψ3

GPU Practical cases

Discussion ψ4

ψ5

At present, MPI_ALLTOALL(V) is used to switch

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese OpenMP parallelisation

Innermost parallelisation level (Almost) Any BigDFT operation is parallelised via OpenMP  Useful for memory demanding calculations

BigDFT and  Allows further increase of speedups HPC  Saves MPI processes and intra-node Message Passing The code 4.5 Properties 2OMP BigDFT and GPUs 3OMP 6OMP  Less efficient than 4 Code details MPI BigDFT and 3.5 HPC

GPU  Compiler and 3 Practical cases

system dependent OMP Speedup Discussion 2.5  OMP sections 2 should be regularly 1.5 0 20 40 60 80 100 120 maintained No. of MPI procs

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Task repartition for a small system (ZnO, 128 atoms)

1 Th. OMP per core 100 1000 90 80 70 100 60 BigDFT and HPC 50

Percent 40 The code 10 Efficiency (%) Properties 30

Seconds (log. scale) Time (sec) BigDFT and GPUs Other Code details 20 CPU Conv BigDFT and 10 LinAlg HPC Comms GPU 0 1 Practical cases 16 24 32 48 64 96 144 192 288 576 No. of cores Discussion What are the ideal conditions for GPU GPU-ported routines should take the majority of the time What happens to parallel efficiency?

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Parallelisation and architectures

Same code, same runs. Which is the best?

1 Th. OMP per core 1 Th. OMP per core 100 1000 100 1000 90 90 80 80 70 70 100 100 60 60 50 50

Percent 40 Percent 40 10 Efficiency (%) 10 Efficiency (%) 30 30

BigDFT and Seconds (log. scale) Time (sec) Seconds (log. scale) Time (sec) Other Other HPC 20 CPU 20 CPU Conv Conv 10 LinAlg 10 LinAlg 0 1 Comms 0 1 Comms The code 16 24 32 48 64 96 144 192 288 576 16 24 32 48 64 96 144 192 288 576 Properties No. of cores No. of cores BigDFT and GPUs CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5) Code details

BigDFT and Titane is 2.3 to 1.6 times faster than Rosa! HPC

GPU Practical cases Degradation of parallel performances: why?

Discussion 1 Calculation power has increased more than networking

2 Better libraries (MKL) Walltime reduced, but lower parallel efficiency

This will always happen while using GPU!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Architectures, libraries, networking

Same runs, same sources; different user conditions

6 Titane MpiBull2 no MKL 5.5 Jade MPT Titane MpiBull2, MKL 5 Titane OpenMPI, MKL Rosa Cray, istanbul 4.5

BigDFT and 4 HPC Differences 3.5 up to a The code Cpu Hours 3 Properties 2.5 factor of 3! BigDFT and GPUs Code details 2

BigDFT and 1.5 HPC 1 GPU 0 100 200 300 400 500 600 Practical cases No. of cores

Discussion A case-by-case study Consideration are often system-dependent, a thumb rule not always exists. Know your code!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Intranode bandwidth problem

B80 Cluster, one Keeneland Node (CPU only) 100 9

90 8

80 7 70

6 BigDFT and 60 HPC 50 5 Percent

The code Speedup

Properties 40 4 BigDFT and GPUs Code details 30 3 BigDFT and 20 Efficiency (%) HPC Speedup Other GPU 2 10 Potential Practical cases Conv LinAlg Comms Discussion 0 1 1-1 2-1 3-1 4-1 6-1 1-2 2-2 3-2 4-2 6-2 1-3 2-3 3-3 4-3 1-4 2-4 3-4 1-6 2-6 12-1 MPI proc - OMP threads Scalability does not depend only on communication Amdahl’s law is a upper limit!

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Using GPUs in a Big systematic DFT code

Nature of the operations Operators approach via convolutions Linear Algebra due to orthogonality of the basis Communications and calculations do not interfere BigDFT and HPC A number of operations which can be accelerated

The code Properties Evaluating GPU convenience BigDFT and GPUs Code details Three levels of evaluation BigDFT and HPC 1 Bare speedups: GPU kernels vs. CPU routines GPU Practical cases Does the operations are suitable for GPU? Discussion 2 Full code speedup on one process Amdahl’s law: are there hot-spot operations?

3 Speedup in a (massively?) parallel environment The MPI layer adds an extra level of complexity

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese BigDFT in hybrid codes

Acceleration of the full BigDFT code Considerable gain may be achieved for suitable systems Amdahl’s law should always be considered Resources can be used concurrently (OpenCL queues)

BigDFT and More MPI processes may share the same card! HPC

Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid The code 100 Properties 10 BigDFT and GPUs Code details 80 BigDFT and 8 HPC

GPU 60 6 Practical cases Percent Speedup Discussion 40 4

20 Speedup 2 Other CPU Conv LinAlg Comms 0 CPU-mkl CPU-mkl-mpiCUDA CUDA-mkl OCL-cublasOCL-mkl CUDA-mpi CUDA-mkl-mpiOCL-cublas-mpiOCL-mkl-mpi 0

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese (Lower time-to-solution for a given architecture) The time-to-solution problem I: Efficiency

Good example: 4 C at, surface BC, 113 Kpts Parallel efficiency of 98%, convolutions largely dominate. Node: # GPU added 2 4 8 2× Fermi + 8 × SpeedUp (SU) 5.3 9.8 11.6 BigDFT and Westmere # MPI equiv. 44 80 96 HPC 8 MPI processes Acceler. Eff. 1 .94 .56 The code

Properties 100 BigDFT and GPUs Code details 80 BigDFT and HPC GPU 60 Practical cases

Discussion SpeedUp 40

20 ideal CPU+MPI GPU 0 0 20 40 60 80 100 No. of MPI proc

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The time-to-solution problem II:Robustness

Not so good example: A too small system

Titane, ZnO 64 at.: CPU vs. Hybrid 100 7

6 80

5

BigDFT and 60 HPC 4

Percent 3 The code 40 Speedup with GPU Properties BigDFT and GPUs 2 Speedup Code details 20 Efficiency (%) 1 Other CPU BigDFT and Conv HPC LinAlg 0 0 Comms GPU 1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 CPU code Hybrid code (rel.) Practical cases No. of MPI proc Discussion  CPU efficiency is poor (calculation is too fast)  Amdahl’s law not favorable (5x SU at most)  GPU SU is almost independent of the size  The hybrid code always goes faster

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid and Heterogeneous runs with OpenCL

NVidia S2070 Connected each ATI HD 6970 to a Nehalem Workstation

BigDFT may run BigDFT and HPC on both

The code

Properties BigDFT and GPUs Code details Sample BigDFT run: Graphene, 4 C atoms, 52 kpts BigDFT and 12 HPC No. of Flop: 8.053 · 10 GPU MPI 1 1 4 1 4 8 Practical cases GPU NO NV NV ATI ATI NV+ ATI Discussion Time (s) 6020 300 160 347 197 109 Speedup 1 20.07 37.62 17.35 30.55 55.23 GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87

Next Step: handling of Load (un)balancing

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A look in near future: science with HPC DFT codes

A concerted set of actions Improve codes functionalities for present-day and next generation supercomputers Test and develop new formalisms BigDFT and Insert ab-initio codes in new scientific workflows HPC (Multiscale Modelling) The code

Properties BigDFT and GPUs Code details The Mars mission

BigDFT and Is Petaflop performance possible? HPC GPU GPU acceleration → one order of magnitude Practical cases

Discussion Bigger systems, heavier methods → (more than) one order of magnitude bigger

BigDFT experience makes this feasible An opportunity to achieve important outcomes and know-how

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Exercises of this afternoon

Profiling and interpreting BigDFT result

B80 Cluster, Keeneland, Multi-Node (CPU only) 100 7

90

6 Dimension job size 80 70 5 Interpret and 60 50 4 Percent Speedup BigDFT and distinguish intranode 40 3 HPC 30

and internode 20 2 Speedup The code Other 10 Potential Conv Properties LinAlg behaviour Comms 0 1

BigDFT and GPUs 6-2 4-3 8-3 3-4 6-4 2-6 4-6 8-6 12-1 24-1 48-1 12-2 24-2 60-2 16-3 40-3 12-4 30-4 20-6 120-1 Code details MPI proc - OMP threads

BigDFT and HPC BigDFT usage of GPU resources

GPU ZnO 8 atoms, gamma point, Keeneland, Intranode (GPU accelerated) Practical cases 100 11

90 10

Discussion 80 9

70 8

60 7

Combining MPI, 50 6 Percent Speedup 40 5

OpenMP and GPU 30 4

20 3 Speedup Other 10 2 Potential acceleration Conv LinAlg Comms 0 1 1-1-OCL 3-1-OCL 3-4-OCL 6-2-OCL 1-1-CPU 3-4-CPU 12-1-OCL 12-1-CPU 1-1-CUDA 3-1-CUDA 12-1-CUDA MPI proc - OMP threads - Acceleration

Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese