HPC Issues for DFT Calculations

Adrian Jackson EPCC http://www.nu-fuse.com Reproduce “real world” incomputers • Explore universe through simulationrather than • Simulation fast becoming 4 • experimentation – Wide rangeapproaches of Wide – Parameterstudiesspace – data simulated of Output – data real of Input – environmentor problem scientific of Simulation – restricted andtimescales Dimensions – simplified Generally – science “untestable” Simulate – experiments or validate Predict – theories Test – Observation, Theory, Experimentation, Simulation– ScientificSimulation th pillarof science http://www.nu-fuse.com Serial codeoptimisations • • Upgrade hardware Upgrade • – Unlikely to produce required savingsrequired produceto Unlikely – efficienciesthroughruntime Reduce – – Performance often improved through improved often Performance – performanceCPU of Doubling – in growth predicts law 1965:Moore’s – on chip parallelismchip on processorsof complexity Reduce Reduce runtime http://www.nu-fuse.com • Why not just make a faster chip?faster a make just not Why • Practical – Theoretical – • Must Must makemaximum use of existing technology• Developing new chips isincredibly expensive • Clock speed for power Voltage reduction vs • The power used by a CPU core is proportionalto • Speed of light, size of atoms, dissipationof heat • Capacitance increases with complexity• Physical limitationsto size and speed of a single • requirements Clock Frequency Voltage x chip Voltages become too smallfor “digital” – differences Parallel BackgroundParallel 2

http://www.nu-fuse.com • Different types of parallel systemsparallel of types Different • Shared memory: OpenMP • memory: MPI Distributed • – Usually modest numbers of processors of numbers modest Usually – bott access memory of because difficultis Scaling – communications)explicit(no program to Simple – or up-to-datekept automatically are Caches – memoryto write/reads via Communications – memoryglobal a to access has processor Each – architecturescalable Highly – passin message explicitvia communicate Processors – mechanis interconnectsome by connected Processors – memorylocal own its has processor Each – memory Distributed – memory Shared – Optimal hybrid parallelisation for COSA Parallel SystemsParallel P P coherent Memory P P Bus M P leneck M g P m Interconnect P M P P M P P M M P http://www.nu-fuse.com HPCTrends • • • • • • • • FLOPS • • • • • • • • FLOPS Kilo: Kilo: 10 Mega:10 10Giga: Tera:10 Peta:10 Exa: 10 Zetta:10 Yotta: 10 Kilo: Kilo: 10 Mega:10 10Giga: Tera:10 Peta:10 Exa: 10 Zetta:10 Yotta: 10 18 18 12 15 21 12 15 21 3 3 24 24 9 9 6 6 http://www.nu-fuse.com • Typically, each core has its own Level 1 and Level and 1 Level itsown has core each Typically, • share may processors that is difference Main • • May also share other functional unitsfunctionalother share also May • largely isthis perspective, programming a From • placeto desirable) economically (and possible Now • in designchip explicit parallelism Now • caches cores between sharedis cache3 the Level but caches, irrelevant chip. a on processorsmultiple – on-chip buses can have very highbandwidth have very can buses on-chip – SMP a small way to build a convenient simply – – i.e. FPU i.e. – multi-is of pipelines, parallelism implicit Beyond – vector unitsvector Parallel BackgroundParallel sue and sue 2 http://www.nu-fuse.com Multi- and and Many-coreMulti- http://www.nu-fuse.com HECToR XE6 XE6 Compute NodeHECToR • • • HyperTransport directlythrough • nodes • Gemini interconnect: • • • Node architecture: Bandwidth 5GB/s Bandwidth Latency~1-1.5μs; Hardwarefor: support Interconnectconnected compute2 per Gemini 1 85.3GB/s (3.6GB/s/core) RAM 32GB DDR3 2x 16-core 2x 2.3GHz • • • • access Directmemory communications Single-sided MPI Really8-core 4x http://www.nu-fuse.com • Dominant HPC architectureHPC Dominant • • Dominate processor architectureprocessor Dominate • • Complex hierarchy of technologiesof hierarchy Complex • Multi-core nodes– Shared memoryclusters – Combineto give48-64cores perserver – 4-16available per processor – Smallshared memorysystems on chip– Multi-core processors– Efficient utilisationmore challenging – Shared memory Shared memory clusters P P P P M P P P P M Interconnect P P P P M P P P P M P P P P P P P P M M http://www.nu-fuse.com • However, still need to run complex programs complex run to need still However, • is cores CPU by expended power the of Much • possibleas low as power/corekeep to Want • parallel many perform can which chip a Need • (operating systems) (operating HPC for useful that generally not functionality on cycleclockevery operations Solution is to addspecialised processing elements – Branch prediction, out-of-order execution etc– Many cores and/or many operations percore – GPGPU (Graphical Processing Units) most common– standard systems: Accelerators Accelerators to http://www.nu-fuse.com • Not Not much spaceon CPU is • dedicated to dedicatedcompute AMD 12-core CPU 12-core AMD (= (= core) = unitcompute http://www.nu-fuse.com • GPU dedicates much more space to computeto space more much dedicates GPU • – At At expense of caches, controllers, sophistication – NVIDIAGPUFermi (= (= SM = unitcompute = = 32CUDA cores) etc http://www.nu-fuse.com • GPU vs CPU: Theoretical Peak capabilities Peak Theoretical CPU: vs GPU • • • prtoscce14 (2.1GHz) 12 (1.15GHz) 448 (peak) Memory Bandwidth 1 (peak) Performance DP Operations/cycle Cores Application on performanceapplic veryApplication depends much andmemory main bandwidthbothcompute for ~5x Foradva theoretical particular GPUproducts, these • • architecture for to/tuned suited is application well how Depends Typically afraction small peak of Fermi NVIDIA 144 GB/s 27.5 GB/s 27.5 GFlops 101 GB/s 144 GFlops 515 GPU PerformanceGPU (6172) AMD Magny-Cours ntage is ntageis ation http://www.nu-fuse.com • 60 60 cores, threadscode, 4 per • In In principle could support an OS Fully cache coherent Relatively small number cores/chip standard CPUs x86 cores: same instruction set as CPU-like 1.053 GHz1.053 – 1 Tflop/s DP performance DP 1Tflop/s – (at (at least initially) accelerator Not expected to run OS, but used as Supports multithreading processing unit (16 SP or 8 DPnumbers) Each core contains 512-bit vector out-of-order execution Simple cores, lack sophistication e.g. no GPU-like Intel Xeon Intel Xeon Phi http://www.nu-fuse.com • Access Access to network• Cost transferof data • Challengetofully exploitboth • Small amounts of memory percore • accelerator processorand – Currently through hostthrough Currently – Accelerators http://www.nu-fuse.com TIDy asO(1Day) MW 20 TB/s 20 MW 10 Days O(Million) TB/s 10 O(1000) MW 10 500,000 PB 10 Days TB/s 2 O(100) 100,000 Power TFlops 1-10 PB 5 MTTI I/O GFlops 400 Nodes BW 32 EFlops 1 GFlops 200 PB 1 InterconnectPFlops 100-200 Concurrency Perf. Node PFlops 20 Memory System Perf. Exascale challenge exacerbates issues • – Very expensive communications Very expensive – memory Very sharedlarge nodes – 40 GB/s 100 GB/s 200-400 GB/s 200-400 GB/s 100 2022 GB/s 40 2017 2013 Exascale challenges http://www.nu-fuse.com • Each node has node Each • 30- Currently • HPC National UK • • system XE6Cray cabinet Service Peak of over 800 TF and 90TBof memory 90,112cores – 2 – – 32 GB memoryGB 32 – (2.3GHz Interlagos) (2.3GHz Opterons × 16-core AMD HECToR http://www.nu-fuse.com MOLPRO) Quantum Espresso, GAMESS-US, SIESTA, GAMESS-UK, initiocodes (VASP, CP2K, CASTEP, ONETEP, NWChem, Ab Apr 2013)Phase statistics3 2011 (Nov - CASTEP DL_POLY MITgcm 4% 5% 4% 4% UM ChemShell SENGA 3% 2% GROMACS 6% GS2 2% NEMO 2% HECToR usage usage statisticsHECToR CP2K 8% VASP 19% Others 34%

http://www.nu-fuse.com methods. is using DFT 35% of the Chemistry on HECToR Apr 2013)Phase statistics3 2011 (Nov - HECToR usage usage statisticsHECToR Chemistry Other 3% Classical MDClassical 14% 35% DFT

http://www.nu-fuse.com

Parallel Tasks 16384 32768 65536 1024 2048 4096 8192 128 256 512 32 64 000100010002000000 1500000 1000000 500000 0 HECToR Usage Usage StatisticsHECToR Node Node Hours ONETEP CP2K Espresso Quantum SIESTA CASTEP MOLPRO MITgcm GAMESS-US CPMD GAMESS-UK VASP NWChem http://www.nu-fuse.com DFT IssuesDFT http://www.nu-fuse.com • Pseudo-Potential Plane-wave Pseudo-Potential • • CASTEP simulation of 64 atoms (Ti-Al-V): (Ti-Al-V): 64 atoms of simulation CASTEP • time 50% execution using the of dominate, MPI to 128 causescores Scaling for FFTs. MPI_alltoallv performing spent time is 13.5% on 64 cores execution of – Relies heavily on 3D FFTs3D on heavily Relies – scaling) (cubic needed where not Coverage – bias region no coverage, Good – Pseudo-potentials• 3D FFTs• Diagonalisation• Orthoganlisation• DFT IssuesDFT http://www.nu-fuse.com

Time (s) 0.1 10 1 100 Scalingof N=128 3D FFTs Cores 10 00 16 cores per node 32 cores per node 8 cores8 per node 10000 3 http://www.nu-fuse.com

Time (s) 100 10 1 100 Scalingof N=1024 3D FFTs Cores 0010 16 cores per node 32 cores per node 8 cores per node 10000 3 http://www.nu-fuse.com 3x1D 3x1D FFT and N=128MPI, 3 http://www.nu-fuse.com GPU 1D FFTGPU http://www.nu-fuse.com

Time (s) 0.01 0.1 10 1 100 GPGPU: 3D FFTusingGPGPU: cu-FFT FFT length FFT FFT excluding transfer time Total execution time Transfer time 1000 http://www.nu-fuse.com okb eelyLabsWorkbyBerekley • Reduce FFT processes Reduce • parallelisation Hybrid • Optimisations http://www.nu-fuse.com okb eelyLabsWorkbyBerekley FFTpart http://www.nu-fuse.com piiain linear scaling Optimisations – • – Not clear beneficial for metalsfor beneficial clear Not – setsBasis Localised – (high- algorithms Real-space-grid – wave functions) wave localised + operator kinetic-energy the as such derivatives calculate to method difference finite order Optimisation http://www.nu-fuse.com Accelerators andcomplex hardware • • DFT DFT generallyperformed through • becoming more common packages – This means the optimisations should optimisationsthe means This – be done for youfor done be Conclusions