Bigdft Approach to High Performance Computing

Daubechies Wavelets in Electronic Structure Calculation: BigDFT Code Tutorial ORNL/NICS – OAK RIDGE,TENNESSEE BigDFT and HPC The code Properties BigDFT approach to High Performance BigDFT and GPUs Code details BigDFT and Computing HPC GPU Practical cases Discussion Luigi Genovese L_Sim – CEA Grenoble February 8, 2012 Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Outline 1 The code Formalism and properties BigDFT and The needs for hybrid DFT codes HPC Main operations, parallelisation The code Properties BigDFT and GPUs 2 Code details Performance evaluation BigDFT and Evaluating GPU gain HPC GPU Practical cases Practical cases Discussion 3 Discussion Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese A DFT code based on Daubechies wavelets BigDFT: a PSP Kohn-Sham code A Daubechies wavelets basis has unique properties for DFT usage Systematic, Orthogonal BigDFT and HPC Localised, Adaptive The code Properties Kohn-Sham operators are analytic BigDFT and GPUs Code details Daubechies Wavelets Short, Separable convolutions BigDFT and 1.5 φ(x) HPC ~c = a c ` ∑j j `−j 1 ψ(x) GPU Practical cases 0.5 Discussion Peculiar numerical properties 0 Real space based, highly flexible -0.5 -1 Big & inhomogeneous systems -1.5 -6 -4 -2 0 2 4 6 8 x Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Basis set features BigDFT features in a nutshell 4 Arbitrary absolute precision can be achieved Good convergence ratio for real-space approach (O(h14)) 4 Optimal usage of the degrees of freedom (adaptivity) BigDFT and Optimal speed for a systematic approach (less memory) HPC 4 Hartree potential accurate for various boundary The code Properties conditions BigDFT and GPUs Code details Free and Surfaces BC Poisson Solver BigDFT and (present also in CP2K, ABINIT, OCTOPUS) HPC GPU * Data repartition is suitable for optimal scalability Practical cases Discussion Simple communications paradigm, multi-level parallelisation possible (and implemented) Improve and develop know-how Optimal for advanced DFT functionalities in HPC framework Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Moore’s law 40 years of improvements BigDFT and HPC Transistor counts double every The code two years. Properties BigDFT and GPUs . but how? Code details BigDFT and HPC GPU Practical cases Discussion Power is the limiting factor Power ∝ Frequency3 * Clock rate is limited Multiple slower devices preferable than one superfast device * More performance with less power ! software problem? Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Hybrid Supercomputing nowadays GPGPU on Supercomputers Traditional architectures are somehow saturating More cores/node, memories (slightly) larger but not faster Architectures of Supercomputers are becoming hybrid BigDFT and 3 out to 5 Top Supercomputers are hybrid machines HPC Extrapolation: In 2015, No. 500 will become petafloppic The code Properties Likely it will be a hybrid machine BigDFT and GPUs Code details BigDFT and Codes should be conceived differently HPC GPU # MPI processes is limited for a fixed problem size Practical cases Discussion Performances increase only by enhancing parallelism Further parallelisation levels should be added (OpenMP, GPU) Does electronic structure calculations codes are suitable? Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How far is petaflop (for DFT)? At present, with traditional architectures Routinely used DFT calculations are: Few dozens (hundreds) of processors Parallel intensive operations (blocking communications, BigDFT and HPC 60-70 percent efficiency) The code Not freshly optimised (legacy codes, monster codes) Properties BigDFT and GPUs * Optimistic estimation: 5 GFlop/s per core × 2000 cores × Code details 0.9 = 9 TFlop/s = 200 times less than Top 500’s #3! BigDFT and HPC GPU Practical cases It is such as Discussion Distance Earth-Moon = 384 Mm Distance Earth-Mars = 78.4 Gm = 200 times more Moon is reached. can we go to Mars? (. in 2015?) Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Operations performed The SCF cycle Real Space Daub. Wavelets Orbital scheme: Hamiltonian Preconditioner BigDFT and HPC Coefficient Scheme: The code Overlap matrices Properties BigDFT and GPUs Code details Orthogonalisation BigDFT and HPC Comput. operations GPU Practical cases Convolutions Discussion BLAS routines FFT (Poisson Solver) Why not GPUs? Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Separable convolutions We must calculate L F(I ;I ;I ) = h h h G(I − j ;I − j ;I − j ) 1 2 3 ∑ j1 j2 j3 1 1 2 2 3 3 j1;j2;j3=0 L L L BigDFT and = hj1 hj2 hj3 G(i1 − j1;i2 − j2;i3 − j3) HPC ∑ ∑ ∑ j1=0 j2=0 j3=0 The code Properties BigDFT and GPUs Application of three successive operations Code details 1 BigDFT and A3(I3;i1;i2) = ∑j hj G(i1;i2;I3 − j) 8i1;i2; HPC 2 GPU A2(I2;I3;i1) = ∑j hj A3(I3;i1;I2 − j) 8I3;i1; Practical cases 3 Discussion F(I1;I2;i3) = ∑j hj A2(I2;I3;I1 − j) 8I2;I3. Main routine: Convolution + transposition F(I;a) = ∑hj G(a;I − j) 8a ; j Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese CPU performances of the convolutions Initially, naive routines (FORTRAN?) do j=1,ndat U do i=0,n1 y(j;I) = ∑ h`x(I + `;j) tt=0.d0 `=L do l=lowfil,lupfil tt=tt+x(i+l,j)*h(l) BigDFT and Easy to write and HPC enddo debug y(j,i)=tt The code Properties Define reference enddo BigDFT and GPUs Code details results enddo BigDFT and HPC GPU Optimisation can then start (Ex. X5550,2.67 GHz) Practical cases Discussion Method GFlop/s % of peak SpeedUp Naive (FORTRAN) 0.54 5.1 1/(6.25) Current (FORTRAN) 3.3 31 1 Best (C, SSE) 7.8 73 2.3 OpenCL (Fermi) 97 20 29 (12.4) Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese How to optimize? A trade-off between benefit and effort FORTRAN based 4 Relatively accessible (loop unrolling) BigDFT and 4 Moderate optimisation can be achieved relatively fast HPC 6 Compilers fail to use vector engine efficiently The code Properties BigDFT and GPUs Code details Push optimisation at the best BigDFT and HPC Only one out of 3 convolution type has been GPU implemented Practical cases Discussion About 20 different patterns have been studied for one 1D convolution Tedious work, huge code −! Maintainability? * Automatic code generation under study Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization I: Orbital distribution scheme Used for the application of the hamiltonian Operator approach: The hamiltonian (convolutions) is applied separately onto each wavefunction BigDFT and HPC y1 The code Properties MPI 0 BigDFT and GPUs Code details y2 BigDFT and HPC GPU y3 Practical cases Discussion MPI 1 y4 y5 MPI 2 Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese MPI parallelization II: Coefficient distribution scheme Used for scalar products & orthonormalisation BLAS routines (level 3) are called, then result is reduced MPI 0 MPI 1 MPI 2 BigDFT and HPC y1 The code Properties y2 BigDFT and GPUs Code details BigDFT and HPC y3 GPU Practical cases Discussion y4 y5 At present, MPI_ALLTOALL(V) is used to switch Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese OpenMP parallelisation Innermost parallelisation level (Almost) Any BigDFT operation is parallelised via OpenMP 4 Useful for memory demanding calculations BigDFT and 4 Allows further increase of speedups HPC 4 Saves MPI processes and intra-node Message Passing The code 4.5 Properties 2OMP BigDFT and GPUs 3OMP 6OMP 6 Less efficient than 4 Code details MPI BigDFT and 3.5 HPC GPU 6 Compiler and 3 Practical cases system dependent OMP Speedup Discussion 2.5 6 OMP sections 2 should be regularly 1.5 0 20 40 60 80 100 120 maintained No. of MPI procs Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Task repartition for a small system (ZnO, 128 atoms) 1 Th. OMP per core 100 1000 90 80 70 100 60 BigDFT and HPC 50 Percent 40 The code 10 Efficiency (%) Properties 30 Seconds (log. scale) Time (sec) BigDFT and GPUs Other Code details 20 CPU Conv BigDFT and 10 LinAlg HPC Comms GPU 0 1 Practical cases 16 24 32 48 64 96 144 192 288 576 No. of cores Discussion What are the ideal conditions for GPU GPU-ported routines should take the majority of the time What happens to parallel efficiency? Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Parallelisation and architectures Same1 code, Th. OMP same per core runs. Which is the best? 100 1000 1 Th. OMP per core 90100 1000 90 80 80 70 100 70 60 50 100 Percent 60 40 10 Efficiency (%) 30 BigDFT and Seconds (log. scale) Time (sec) Other HPC 50 20 CPU Conv 10 Percent LinAlg 40 Comms The code 0 1 16 24 32 48 64 96 144 192 288 576 10 Efficiency (%) Properties 30 No. of cores Seconds (log. scale) Time (sec) BigDFT and GPUs CCRT Titane (Nehalem, Infiniband) CSCS Rosa (Opteron, CrayOther XT5) Code details 20 CPU Conv BigDFT and 10 Titane is 2.3 to 1.6 times faster than Rosa! LinAlg HPC Comms GPU 0 1 16 24 32 48 64 96 144 192 288 576 Practical cases Degradation of parallel performances: why? No. of cores Discussion 1 Calculation power has increased more than networking 2 Better libraries (MKL) * Walltime reduced, but lower parallel efficiency This will always happen while using GPU! Laboratoire de Simulation Atomistique http://inac.cea.fr/L_Sim Luigi Genovese Architectures, libraries, networking Same runs, same sources; different user conditions 6 Titane MpiBull2 no MKL 5.5 Jade MPT Titane MpiBull2, MKL 5 Titane OpenMPI, MKL Rosa Cray, istanbul 4.5 BigDFT and 4 HPC Differences 3.5 up to a The code Cpu Hours 3 Properties 2.5 factor of 3! BigDFT and GPUs Code details 2 BigDFT and 1.5 HPC 1 GPU 0 100 200 300 400 500 600 Practical cases No.

Load more