Algorithms and Software Infrastructure for High-Performance Electronic Structure Based Simulations by Wenzhe Yu
Total Page:16
File Type:pdf, Size:1020Kb
Algorithms and Software Infrastructure for High-Performance Electronic Structure Based Simulations by Wenzhe Yu Department of Mechanical Engineering and Materials Science Duke University Date: Approved: Volker Blum, advisor Olivier Delaire Jianfeng Lu Stefan Zauscher Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Mechanical Engineering and Materials Science in the Graduate School of Duke University 2020 ABSTRACT Algorithms and Software Infrastructure for High-Performance Electronic Structure Based Simulations by Wenzhe Yu Department of Mechanical Engineering and Materials Science Duke University Date: Approved: Volker Blum, advisor Olivier Delaire Jianfeng Lu Stefan Zauscher An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Mechanical Engineering and Materials Science in the Graduate School of Duke University 2020 Copyright c 2020 by Wenzhe Yu All rights reserved Abstract Computer simulations based on electronic structure theory, particularly Kohn-Sham density-functional theory (KS-DFT), are facilitating scientific discoveries across a broad range of disciplines such as chemistry, physics, and materials science. The tractable size of KS-DFT is often limited by an algebraic eigenproblem, the com- putational cost of which scales cubically with respect to the problem size. There have been continuous efforts to improve the performance of eigensolvers, and develop alternative algorithms that bypass the explicit solution of the eigenproblem. As the number of algorithms grows, it becomes increasingly difficult to comparatively as- sess their relative computational cost and implement them efficiently in electronic structure codes. The research in this dissertation explores the feasibility of integrating different electronic structure algorithms into a single framework, combining their strengths, assessing their accuracy and computational cost relative to each other, and under- standing their scope of applicability and optimal use regime. The research has led to an open-source software infrastructure, ELSI, providing the electronic structure com- munity with access to a variety of high-performance solver libraries through a unified software interface. ELSI supports and enhances conventional cubic scaling eigen- solvers, linear scaling density-matrix-based algorithms, and other reduced scaling methods in between, with reasonable default parameters for each of them. Flexible matrix formats and parallelization strategies adopted in ELSI fit the need of most, if not all, electronic structure codes. ELSI has been connected to four electronic structure code projects, allowing us to rigorously benchmark the performance of the solvers on an equal footing. Based on the results of a comprehensive set of bench- marks, we identify factors that strongly affect the efficiency of the solvers and regimes iv where conventional cubic scaling eigensolvers are outperformed by lower scaling al- gorithms. We propose an automatic decision layer that assists with the algorithm selection process. The ELSI infrastructure is stimulating the optimization of existing algorithms and the development of new ones. Following the worldwide trend of employing graphi- cal processing units (GPUs) in high-performance computing, we have developed and optimized GPU acceleration in the two-stage tridiagonalization eigensolver ELPA2, targeting distributed-memory, hybrid CPU-GPU architectures. A significant perfor- mance boost over the CPU-only version of ELPA2 is achieved, as demonstrated in routine KS-DFT simulations comprising thousands of atoms, for which a couple of GPU-equipped supercomputer nodes reach the throughput of some tens of conven- tional CPU supercomputer nodes. The GPU-accelerated ELPA2 solver can be used through the ELSI interface, smoothly and transparently bringing GPU support to all the electronic structure codes connected with ELSI. To reduce the computational cost of systems containing heavy elements, we propose a frozen core approximation with proper orthonormalization of the wavefunctions. This method is tolerant of er- rors due to the finite precision of numerical integrations in electronic structure codes. A considerable saving in the computational cost can be achieved, with the electron density, energies, and forces all matching the accuracy of all electron calculations. This research shows that by integrating a broad range of electronic structure algorithms into one infrastructure, new algorithmic developments and optimizations can take place at a faster pace. The outcome is open and beneficial to the entire electronic structure community, instead of being restricted to one particular code project. The ELSI infrastructure has already been utilized to accelerate large-scale electronic structure simulations, some of which were not feasible before. v Contents Abstract iv List of Figures x List of Tables xiii Acknowledgments xiv 1 Introduction 1 1.1 Emergence of Software Libraries in Electronic Structure Theory . .1 1.2 Cubic Scaling Wall in Kohn-Sham Density-Functional Theory . .5 1.3 Overview of This Dissertation . .7 2 Theoretical Background 10 2.1 Born-Oppenheimer Approximation . 11 2.2 Kohn-Sham Density-Functional Theory . 12 2.3 Practical Realization of KS-DFT . 14 2.3.1 Basis Sets . 14 2.3.2 Periodic Boundary Conditions . 16 2.3.3 Self-Consistent Field . 16 2.3.4 Computational Cost . 18 2.4 Eigensolvers (Diagonalization) . 19 2.4.1 Textbook Algorithm: One-Stage Tridiagonalization . 20 2.4.2 ELPA: Two-Stage Tridiagonalization . 21 2.4.3 EigenExa: Penta-Diagonalization . 22 2.4.4 SLEPc-SIPs: Shift-and-Invert Transformation . 23 vi 2.5 Density Matrix Solvers . 24 2.5.1 NTPoly: Density Matrix Purification . 25 2.5.2 libOMM: Orbital Minimization Method . 28 2.5.3 PEXSI: Pole Expansion and Selected Inversion . 29 3 Software Infrastructure for Scalable Solutions of the Kohn-Sham Problem: ELSI 32 3.1 Overview of the ELSI Software Infrastructure . 33 3.2 Programming Language . 33 3.3 Distributed Matrix Storage . 34 3.3.1 Distribution Scheme . 34 3.3.2 Local Storage Format . 35 3.4 Parallelization . 38 3.5 Implementation of the ELSI Interface . 41 3.5.1 Initialization, Reinitialization, and Finalization . 41 3.5.2 Solving Eigenvalues and Eigenvectors . 48 3.5.3 Computing Density Matrices . 54 3.5.4 Customizing the ELSI Interface and the Solvers . 61 3.5.5 Extrapolation of Wavefunctions and Density Matrices . 67 3.5.6 Parallel Matrix I/O . 70 3.6 Integration Into the Broader HPC Ecosystem . 71 3.7 Summary of Algorithmic Features . 72 4 Comparative Benchmark of Electronic Structure Solvers 74 4.1 Performance Benchmark With Varying System Size and Fixed Proces- sor Count . 75 vii 4.1.1 Benchmark Set I: Carbon Allotropes . 76 4.1.2 Benchmark Set II: Heavy Elements . 82 4.1.3 Benchmark Set III: 3D Structures . 84 4.2 Performance Benchmark with Varying Processor Count and Fixed Sys- tem Size . 87 4.3 Automatic Solver Selection . 87 4.4 Parallel Solution for Periodic Systems . 89 4.4.1 MULTI PROC Parallelization Mode . 90 4.4.2 SINGLE PROC Parallelization Mode . 92 4.5 Summary of Findings . 93 5 GPU-Accelerated Two-Stage Dense Eigensolver 96 5.1 GPU Acceleration of ELPA2 . 97 5.1.1 GPU Offloading via cuBLAS . 99 5.1.2 CUDA Kernel of Parallel Householder Transformations . 100 5.2 Performance and Scalability . 108 5.2.1 Overall Performance . 109 5.2.2 Performance of Individual Computational Steps . 113 5.2.3 Runtime Parameters . 115 5.3 Application in KS-DFT Calculations . 116 6 Normalized Frozen Core Approximation for All Electron Density- Functional Theory 120 6.1 Normalized Frozen Core Approximation . 121 6.2 Accuracy Analysis . 128 6.3 Performance Benchmark . 130 viii 7 Conclusions 135 Bibliography 139 ix List of Figures 1.1 The modular paradigm in electronic structure coding. .4 1.2 Simulation time of graphene supercells as a function of the number of atoms. .6 2.1 Key computational steps of Kohn-Sham density-functional theory (KS- DFT). 17 2.2 Computational steps of the two-stage tridiagonalization approach. 22 3.1 Schematic visualizations of 2D block-cyclic distribution, 1D block- cyclic distribution, and arbitrary distribution. 36 3.2 A 5 × 5 matrix stored in the dense format, the COO sparse format, and the CSC sparse format. 37 3.3 Parallel calculation of spin-polarized and periodic systems in ELSI. 39 3.4 Dependency graph of the ELSI interface. 72 4.1 Atomic structures of 1D carbon nanotube (CNT), 2D graphene, and 3D graphite. 77 4.2 Performance of key steps in ELPA, libOMM, and PEXSI for carbon nanotube models, graphene models, and graphite models. 79 4.3 Performance of key steps in ELPA and PEXSI for carbon nanotube models, graphene models, and graphite models. 80 4.4 Atomic structures of 1D Ge nanotube, 2D MoS2 monolayer, and 3D Cu2BaSnS4................................. 83 4.5 Performance of key steps in ELPA and PEXSI for Ge nanotube models, MoS2 monolayer models, and Cu2BaSnS4 models. 84 4.6 Atomic structures of water and silicon. 85 x 4.7 Performance of key steps in ELPA, PEXSI, and NTPoly for water models and silicon models. 86 4.8 Performance of key steps in ELPA, PEXSI, and NTPoly for water 41,472-atom model and silicon 31,250-atom model. 88 4.9 Performance comparison of two parallelization strategies in KS-DFT calculations of an 864-atom graphite model. 91 5.1 Visualization of the sweeps in the fourth stage of the bulge chasing procedure. 102 5.2 Visualization of Householder vectors v(i;j) and eigenvectors X involved in the tridiagonal-to-banded back-transformation. 103 5.3 Workflow of the Householder transformation CUDA kernel. 107 5.4 Timings of CPU-ELPA1, CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2 for randomly generated matrices. 110 5.5 Timings of CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2 for ran- domly generated matrices. 112 5.6 Timings of GPU-ELPA1 and GPU-ELPA2 as a function of the number of eigenvectors computed, for randomly generated matrices. 113 5.7 Timing decomposition of CPU-ELPA2 and GPU-ELPA2 for randomly generated matrices.