Daubechies Wavelets in Electronic Structure Calculation: BigDFT Code Tutorial
ORNL/NICS–OAK RIDGE,TENNESSEE BigDFT and HPC
The code Properties BigDFT approach to High Performance BigDFT and GPUs Code details
BigDFT and Computing HPC
GPU Practical cases
Discussion Luigi Genovese
L_Sim – CEA Grenoble
February 8, 2012
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Outline
1 The code Formalism and properties
BigDFT and The needs for hybrid DFT codes HPC Main operations, parallelisation
The code
Properties BigDFT and GPUs 2 Code details Performance evaluation
BigDFT and Evaluating GPU gain HPC
GPU Practical cases Practical cases
Discussion 3 Discussion
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A DFT code based on Daubechies wavelets
BigDFT: a PSP Kohn-Sham code A Daubechies wavelets basis has unique properties for DFT usage
Systematic, Orthogonal BigDFT and HPC Localised, Adaptive The code Properties Kohn-Sham operators are analytic BigDFT and GPUs Code details Daubechies Wavelets Short, Separable convolutions BigDFT and 1.5 φ(x) HPC ˜c = a c ` ∑j j `−j 1 ψ(x) GPU Practical cases 0.5 Discussion Peculiar numerical properties 0 Real space based, highly flexible -0.5 -1 Big & inhomogeneous systems -1.5 -6 -4 -2 0 2 4 6 8 x
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Basis set features
BigDFT features in a nutshell Arbitrary absolute precision can be achieved Good convergence ratio for real-space approach (O(h14)) Optimal usage of the degrees of freedom (adaptivity) BigDFT and Optimal speed for a systematic approach (less memory) HPC Hartree potential accurate for various boundary The code Properties conditions BigDFT and GPUs Code details Free and Surfaces BC Poisson Solver
BigDFT and (present also in CP2K, ABINIT, OCTOPUS) HPC GPU Data repartition is suitable for optimal scalability Practical cases
Discussion Simple communications paradigm, multi-level parallelisation possible (and implemented)
Improve and develop know-how Optimal for advanced DFT functionalities in HPC framework
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Moore’s law
40 years of improvements BigDFT and HPC Transistor counts double every The code two years. . . Properties BigDFT and GPUs . . . but how? Code details
BigDFT and HPC
GPU Practical cases Discussion Power is the limiting factor Power ∝ Frequency3 Clock rate is limited Multiple slower devices preferable than one superfast device More performance with less power → software problem?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid Supercomputing nowadays
GPGPU on Supercomputers Traditional architectures are somehow saturating More cores/node, memories (slightly) larger but not faster Architectures of Supercomputers are becoming hybrid
BigDFT and 3 out to 5 Top Supercomputers are hybrid machines HPC Extrapolation: In 2015, No. 500 will become petafloppic The code Properties Likely it will be a hybrid machine BigDFT and GPUs Code details
BigDFT and Codes should be conceived differently HPC GPU # MPI processes is limited for a fixed problem size Practical cases
Discussion Performances increase only by enhancing parallelism Further parallelisation levels should be added (OpenMP, GPU)
Does electronic structure calculations codes are suitable?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How far is petaflop (for DFT)?
At present, with traditional architectures Routinely used DFT calculations are: Few dozens (hundreds) of processors Parallel intensive operations (blocking communications, BigDFT and HPC 60-70 percent efficiency)
The code Not freshly optimised (legacy codes, monster codes) Properties BigDFT and GPUs Optimistic estimation: 5 GFlop/s per core × 2000 cores × Code details 0.9 = 9 TFlop/s = 200 times less than Top 500’s #3! BigDFT and HPC
GPU Practical cases It is such as Discussion Distance Earth-Moon = 384 Mm Distance Earth-Mars = 78.4 Gm = 200 times more
Moon is reached. . . can we go to Mars? (. . . in 2015?)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Operations performed
The SCF cycle Real Space Daub. Wavelets Orbital scheme: Hamiltonian Preconditioner BigDFT and HPC Coefficient Scheme: The code Overlap matrices Properties BigDFT and GPUs Code details Orthogonalisation
BigDFT and HPC Comput. operations GPU Practical cases Convolutions Discussion BLAS routines FFT (Poisson Solver)
Why not GPUs?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Separable convolutions
We must calculate L F(I ,I ,I ) = h h h G(I − j ,I − j ,I − j ) 1 2 3 ∑ j1 j2 j3 1 1 2 2 3 3 j1,j2,j3=0 L L L BigDFT and = hj1 hj2 hj3 G(i1 − j1,i2 − j2,i3 − j3) HPC ∑ ∑ ∑ j1=0 j2=0 j3=0 The code
Properties BigDFT and GPUs Application of three successive operations Code details 1 BigDFT and A3(I3,i1,i2) = ∑j hj G(i1,i2,I3 − j) ∀i1,i2; HPC 2 GPU A2(I2,I3,i1) = ∑j hj A3(I3,i1,I2 − j) ∀I3,i1; Practical cases 3 Discussion F(I1,I2,i3) = ∑j hj A2(I2,I3,I1 − j) ∀I2,I3.
Main routine: Convolution + transposition
F(I,a) = ∑hj G(a,I − j) ∀a ; j
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese CPU performances of the convolutions Initially, naive routines (FORTRAN?) do j=1,ndat U do i=0,n1 y(j,I) = ∑ h`x(I + `,j) tt=0.d0 `=L do l=lowfil,lupfil tt=tt+x(i+l,j)*h(l) BigDFT and Easy to write and HPC enddo debug y(j,i)=tt The code Properties Define reference enddo BigDFT and GPUs Code details results enddo
BigDFT and HPC
GPU Optimisation can then start (Ex. X5550,2.67 GHz) Practical cases
Discussion Method GFlop/s % of peak SpeedUp Naive (FORTRAN) 0.54 5.1 1/(6.25) Current (FORTRAN) 3.3 31 1 Best (C, SSE) 7.8 73 2.3 OpenCL (Fermi) 97 20 29 (12.4)
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How to optimize?
A trade-off between benefit and effort
FORTRAN based Relatively accessible (loop unrolling)
BigDFT and Moderate optimisation can be achieved relatively fast HPC Compilers fail to use vector engine efficiently The code
Properties BigDFT and GPUs Code details Push optimisation at the best
BigDFT and HPC Only one out of 3 convolution type has been GPU implemented Practical cases
Discussion About 20 different patterns have been studied for one 1D convolution Tedious work, huge code −→ Maintainability? Automatic code generation under study
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization I: Orbital distribution scheme
Used for the application of the hamiltonian Operator approach: The hamiltonian (convolutions) is applied separately onto each wavefunction
BigDFT and HPC ψ1 The code Properties MPI 0 BigDFT and GPUs Code details ψ2 BigDFT and HPC GPU ψ3 Practical cases
Discussion MPI 1 ψ4
ψ5 MPI 2
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization II: Coefficient distribution scheme
Used for scalar products & orthonormalisation BLAS routines (level 3) are called, then result is reduced
MPI 0 MPI 1 MPI 2
BigDFT and HPC ψ1
The code Properties ψ2 BigDFT and GPUs Code details
BigDFT and HPC ψ3
GPU Practical cases
Discussion ψ4
ψ5
At present, MPI_ALLTOALL(V) is used to switch
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese OpenMP parallelisation
Innermost parallelisation level (Almost) Any BigDFT operation is parallelised via OpenMP Useful for memory demanding calculations
BigDFT and Allows further increase of speedups HPC Saves MPI processes and intra-node Message Passing The code 4.5 Properties 2OMP BigDFT and GPUs 3OMP 6OMP Less efficient than 4 Code details MPI BigDFT and 3.5 HPC
GPU Compiler and 3 Practical cases
system dependent OMP Speedup Discussion 2.5 OMP sections 2 should be regularly 1.5 0 20 40 60 80 100 120 maintained No. of MPI procs
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Task repartition for a small system (ZnO, 128 atoms)
1 Th. OMP per core 100 1000 90 80 70 100 60 BigDFT and HPC 50
Percent 40 The code 10 Efficiency (%) Properties 30
Seconds (log. scale) Time (sec) BigDFT and GPUs Other Code details 20 CPU Conv BigDFT and 10 LinAlg HPC Comms GPU 0 1 Practical cases 16 24 32 48 64 96 144 192 288 576 No. of cores Discussion What are the ideal conditions for GPU GPU-ported routines should take the majority of the time What happens to parallel efficiency?
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Parallelisation and architectures
Same code, same runs. Which is the best?
1 Th. OMP per core 1 Th. OMP per core 100 1000 100 1000 90 90 80 80 70 70 100 100 60 60 50 50
Percent 40 Percent 40 10 Efficiency (%) 10 Efficiency (%) 30 30
BigDFT and Seconds (log. scale) Time (sec) Seconds (log. scale) Time (sec) Other Other HPC 20 CPU 20 CPU Conv Conv 10 LinAlg 10 LinAlg 0 1 Comms 0 1 Comms The code 16 24 32 48 64 96 144 192 288 576 16 24 32 48 64 96 144 192 288 576 Properties No. of cores No. of cores BigDFT and GPUs CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, Cray XT5) Code details
BigDFT and Titane is 2.3 to 1.6 times faster than Rosa! HPC
GPU Practical cases Degradation of parallel performances: why?
Discussion 1 Calculation power has increased more than networking
2 Better libraries (MKL) Walltime reduced, but lower parallel efficiency
This will always happen while using GPU!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Architectures, libraries, networking
Same runs, same sources; different user conditions
6 Titane MpiBull2 no MKL 5.5 Jade MPT Titane MpiBull2, MKL 5 Titane OpenMPI, MKL Rosa Cray, istanbul 4.5
BigDFT and 4 HPC Differences 3.5 up to a The code Cpu Hours 3 Properties 2.5 factor of 3! BigDFT and GPUs Code details 2
BigDFT and 1.5 HPC 1 GPU 0 100 200 300 400 500 600 Practical cases No. of cores
Discussion A case-by-case study Consideration are often system-dependent, a thumb rule not always exists. Know your code!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Intranode bandwidth problem
B80 Cluster, one Keeneland Node (CPU only) 100 9
90 8
80 7 70
6 BigDFT and 60 HPC 50 5 Percent
The code Speedup
Properties 40 4 BigDFT and GPUs Code details 30 3 BigDFT and 20 Efficiency (%) HPC Speedup Other GPU 2 10 Potential Practical cases Conv LinAlg Comms Discussion 0 1 1-1 2-1 3-1 4-1 6-1 1-2 2-2 3-2 4-2 6-2 1-3 2-3 3-3 4-3 1-4 2-4 3-4 1-6 2-6 12-1 MPI proc - OMP threads Scalability does not depend only on communication Amdahl’s law is a upper limit!
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Using GPUs in a Big systematic DFT code
Nature of the operations Operators approach via convolutions Linear Algebra due to orthogonality of the basis Communications and calculations do not interfere BigDFT and HPC A number of operations which can be accelerated
The code Properties Evaluating GPU convenience BigDFT and GPUs Code details Three levels of evaluation BigDFT and HPC 1 Bare speedups: GPU kernels vs. CPU routines GPU Practical cases Does the operations are suitable for GPU? Discussion 2 Full code speedup on one process Amdahl’s law: are there hot-spot operations?
3 Speedup in a (massively?) parallel environment The MPI layer adds an extra level of complexity
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese BigDFT in hybrid codes
Acceleration of the full BigDFT code Considerable gain may be achieved for suitable systems Amdahl’s law should always be considered Resources can be used concurrently (OpenCL queues)
BigDFT and More MPI processes may share the same card! HPC
Badiane, X5550 + Fermi S2070 , ZnO 64 at.: CPU vs. Hybrid The code 100 Properties 10 BigDFT and GPUs Code details 80 BigDFT and 8 HPC
GPU 60 6 Practical cases Percent Speedup Discussion 40 4
20 Speedup 2 Other CPU Conv LinAlg Comms 0 CPU-mkl CPU-mkl-mpiCUDA CUDA-mkl OCL-cublasOCL-mkl CUDA-mpi CUDA-mkl-mpiOCL-cublas-mpiOCL-mkl-mpi 0
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese (Lower time-to-solution for a given architecture) The time-to-solution problem I: Efficiency
Good example: 4 C at, surface BC, 113 Kpts Parallel efficiency of 98%, convolutions largely dominate. Node: # GPU added 2 4 8 2× Fermi + 8 × SpeedUp (SU) 5.3 9.8 11.6 BigDFT and Westmere # MPI equiv. 44 80 96 HPC 8 MPI processes Acceler. Eff. 1 .94 .56 The code
Properties 100 BigDFT and GPUs Code details 80 BigDFT and HPC GPU 60 Practical cases
Discussion SpeedUp 40
20 ideal CPU+MPI GPU 0 0 20 40 60 80 100 No. of MPI proc
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese The time-to-solution problem II:Robustness
Not so good example: A too small system
Titane, ZnO 64 at.: CPU vs. Hybrid 100 7
6 80
5
BigDFT and 60 HPC 4
Percent 3 The code 40 Speedup with GPU Properties BigDFT and GPUs 2 Speedup Code details 20 Efficiency (%) 1 Other CPU BigDFT and Conv HPC LinAlg 0 0 Comms GPU 1 2 4 6 8 12 24 32 48 64 96 144 1 2 4 6 8 12 24 32 48 64 96 144 CPU code Hybrid code (rel.) Practical cases No. of MPI proc Discussion CPU efficiency is poor (calculation is too fast) Amdahl’s law not favorable (5x SU at most) GPU SU is almost independent of the size The hybrid code always goes faster
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid and Heterogeneous runs with OpenCL
NVidia S2070 Connected each ATI HD 6970 to a Nehalem Workstation
BigDFT may run BigDFT and HPC on both
The code
Properties BigDFT and GPUs Code details Sample BigDFT run: Graphene, 4 C atoms, 52 kpts BigDFT and 12 HPC No. of Flop: 8.053 · 10 GPU MPI 1 1 4 1 4 8 Practical cases GPU NO NV NV ATI ATI NV+ ATI Discussion Time (s) 6020 300 160 347 197 109 Speedup 1 20.07 37.62 17.35 30.55 55.23 GFlop/s 1.34 26.84 50.33 23.2 40.87 73.87
Next Step: handling of Load (un)balancing
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A look in near future: science with HPC DFT codes
A concerted set of actions Improve codes functionalities for present-day and next generation supercomputers Test and develop new formalisms BigDFT and Insert ab-initio codes in new scientific workflows HPC (Multiscale Modelling) The code
Properties BigDFT and GPUs Code details The Mars mission
BigDFT and Is Petaflop performance possible? HPC GPU GPU acceleration → one order of magnitude Practical cases
Discussion Bigger systems, heavier methods → (more than) one order of magnitude bigger
BigDFT experience makes this feasible An opportunity to achieve important outcomes and know-how
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Exercises of this afternoon
Profiling and interpreting BigDFT result
B80 Cluster, Keeneland, Multi-Node (CPU only) 100 7
90
6 Dimension job size 80 70 5 Interpret and 60 50 4 Percent Speedup BigDFT and distinguish intranode 40 3 HPC 30
and internode 20 2 Speedup The code Other 10 Potential Conv Properties LinAlg behaviour Comms 0 1
BigDFT and GPUs 6-2 4-3 8-3 3-4 6-4 2-6 4-6 8-6 12-1 24-1 48-1 12-2 24-2 60-2 16-3 40-3 12-4 30-4 20-6 120-1 Code details MPI proc - OMP threads
BigDFT and HPC BigDFT usage of GPU resources
GPU ZnO 8 atoms, gamma point, Keeneland, Intranode (GPU accelerated) Practical cases 100 11
90 10
Discussion 80 9
70 8
60 7
Combining MPI, 50 6 Percent Speedup 40 5
OpenMP and GPU 30 4
20 3 Speedup Other 10 2 Potential acceleration Conv LinAlg Comms 0 1 1-1-OCL 3-1-OCL 3-4-OCL 6-2-OCL 1-1-CPU 3-4-CPU 12-1-OCL 12-1-CPU 1-1-CUDA 3-1-CUDA 12-1-CUDA MPI proc - OMP threads - Acceleration
Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese