First principles modeling with Octopus: massive parallelization towards petaflop computing and more
A. Castro, J. Alberdi and A. Rubio Outline
Theoretical Spectroscopy The octopus code Parallelization
2 Outline
Theoretical Spectroscopy The octopus code Parallelization
3 Theoretical Spectroscopy
4 Theoretical Spectroscopy
Electronic excitations: ~ Optical absorption ~ Photoemission ~ Electron energy loss ~ Inverse photoemission ~ Inelastic X-ray scattering ~ …
5 Theoretical Spectroscopy Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):
6 Theoretical Spectroscopy Role: interpretation of (complex) experimental findings
7 Theoretical Spectroscopy Role: interpretation of (complex) experimental findings
Theoretical atomistic structures, and corresponding8 TEM images. Theoretical Spectroscopy
9 Theoretical Spectroscopy
10 Theoretical Spectroscopy
The European Theoretical Spectroscopy Facility (ETSF)
11 Theoretical Spectroscopy The European Theoretical Spectroscopy Facility (ETSF)
~ Networking ~ Integration of tools (formalism, software) ~ Maintenance of tools ~ Support, service, formation
12 Theoretical Spectroscopy
The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF: ~ abinit ~ octopus ~ dp
13 Outline
Theoretical Spectroscopy The octopus code Parallelization
14 The octopus code
Targets: ~ Optical absorption spectra of molecules, clusters, nanostructures, solids. ~ Response to lasers (non-perturbative response to high-intensity fields) ~ Dichroic spectra, and other mixed (electric-magnetic responses) ~ Adiabatic and non-adiabatic Molecular Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions). ~ Quantum Optimal Control Theory for molecular processes.
15 The octopus code
Physical approximations and techniques: ~ Density-Functional Theory, Time-Dependent Density-Functional Theory to describe the electron structure. • Comprehensive set of functionals through the libxc library. ~ Mixed quantum-classical systems. ~ Both real-time and frequency domain response (“Casida” and “Sternheimer” formulations).
16 The octopus code
Numerics : ~ Basic representation: real space grid. ~ Usually regular and rectangular, occasionally curvilinear. ~ Plane waves for some procedures (especially for periodic systems) ~ Atomic orbitals for some procedures
17 The octopus code
Derivative in a point: sum over neighbor points. C depend on the points used: the stencil.ij More points -> more precision. Semi-local operation. 18 The octopus code
The key equations ~ Ground-state DFT: Kohn-Sham equations.
~ Time-dependent DFT: time-dependent KS eqs:
19 The octopus code
Key numerical operations : ~ Linear systems with sparse matrices. ~ Eigenvalue systems with sparse matrices. ~ Non-linear eigenvalue systems. ~ Propagation of “Schrödinger-like” equations.
~ The dimension can go up to 10 million points. ~ The storage needs can go up to 10 Gb.
20 The octopus code
Use of libraries: ~ BLAS, LAPACK ~ GNU GSL mathematical library. ~ FFTW ~ NetCDF ~ ETSF input/output library ~ Libxc exchange and correlation library ~ Other optional libraries.
21 www.tddft.org/programs/octopus/
22 Outline
Theoretical Spectroscopy The octopus code Parallelization
23 Objective
Reach petaflops computing, with a scientific code Simulate photosynthesis of the light in chlorophyll
24 Simulation objective
Photovoltaic materials Biomolecules
25 The Octopus http://www.tddft.org/programs/octopus/code
Software package for electron dynamics Developed in the UPV/EHU Ground state and excited states properties Realtime, Casida and Sternheimer TDDFT Quantum transport and optimal control Free software: GPL license
26 Octopus simulation strategy
Pseudopotential approximation Realspace grids Main operation: the finite difference Laplacian
27 Libraries
Intensive use of libraries General libraries: Specific libraries ~ BLAS ~ Libxc ~ LAPACK ~ ETSF_IO ~ FFT ~ Zoltan/Metis ~ ...
28 Multilevel parallelization
KohnSham states
MPI Realspace domains
OpenMP threads OpenCL Vectorization tasks In Node
CPU GPU
29 Target systems:
Massive number of execution units ~ Multicore processors with vectorial FPUs
~ IBM Blue Gene architecture
~ Graphical processing units
30 High Level Parallelization
MPI parallelization
31 Parallelization by states/orbitals
Assign each processor a group of states Timepropagation is independent for each state Little communication required Limited by the number of states in the system
32 Domain parallelization
Assign each processor a set of grid points Partition libraries: Zoltan or Metis
33 Main operations in domain parallelization
Laplacian: copy Overlap points in computation domain and boundaries communication
Integration: Group global sums reduction (reductions) operations
34 Low level paralelization and vectorization
OpenMP and GPU Two approaches
OpenMP OpenCL
Thread programming based on Hundreds of execution units compiler directives High memory bandwidth but with Innode parallelization long latency Little memory overhead compared to Behaves like a vector processor MPI (length > 16) Scaling limited by memory Separated memory: copy from/to bandwidth main memory Multithreaded Blas and Lapack
36 Supercomputers Corvo cluster ~ X86_64 VARGAS (in IDRIS) ~ Power6 ~ 67 teraflops MareNostrum ~ PowerPC 970 ~ 94 teraflops Jugene (image) ~ 1 petaflops
37 Test Results
38 Laplacian operator Comparison in performance of the finite difference Laplacian operator
CPU uses 4 threads GPU is 4 times faster Cache effects are visible
39 Time propagation Comparison in performance for a time propagation
Fullerene molecule The GPU is 3 times faster Limited by copying and nonGPU code
40 Multilevel parallelization
Clorophyll molecule: 650 atoms Jugene Blue Gene/P Sustained throughput: > 6.5 teraflops Peak throughput: 55 teraflops
41 Scaling Scaling (II)
Comparison of two atomic system in Jugene
43 Target system
Jugene all nodes ~ 294 912 processor cores = 73 728 nodes ~ Maximum theoretical performance of 1002 MFlops 5879 atoms chlorophyll system ~ Complete molecule of spinach
44 Tests systems
Smaller molecules ~ 180 atoms ~ 441 atoms ~ 650 atoms ~ 1365 atoms Partition of machines ~ Jugene and Corvo
45 Profiling
Profiled within the code Profiled with Paraver tool ~ www.bsc.es/paraver
46 1 TD iteration
Poisson Some “inner” iterations One “inner” iteration Ireceive Isend Iwait Poisson solver
Allgather 2 xAlltoall Allgather Scatter Improvements
Memory improvements in GS ~ Split the memory among the nodes ~ Use of ScaLAPACK Improvements in the Poisson solver for TD ~ Pipeline execution • Execute Poisson while continues with an approximation ~ Use new algorithms like FFM ~ Use of parallel FFTs
51 Conclusions
KohnSham scheme is inherently parallel It can be exploited for parallelization and vectorization Suited to current and future computer architectures Theoretical improvements for large system modeling
52