Hardware-Oriented Krylov Methods for High-Performance Computing

WWU M UNSTER Hardware-Oriented Krylov Methods for High-Performance Computing PhD Thesis arXiv:2104.02494v1 [math.NA] 6 Apr 2021 Nils-Arne Dreier – 2020 – WWU M UNSTER Fach: Mathematik Hardware-Oriented Krylov Methods for High-Performance Computing Inaugural Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften – Dr. rer. nat. – im Fachbereich Mathematik und Informatik der Mathematisch-Naturwissenschaftlichen Fakultät der Westfälischen Wilhelms-Universität Münster eingereicht von Nils-Arne Dreier aus Bünde – 2020 – Dekan: Prof. Dr. Xiaoyi Jiang Westfälische Wilhelms-Universität Münster Münster, DE Erster Gutachter: Prof. Dr. Christian Engwer Westfälische Wilhelms-Universität Münster Münster, DE Zweiter Gutachter: Laura Grigori, PhD INRIA Paris Paris, FR Tag der mündlichen Prüfung: 08.03.2021 Tag der Promotion: 08.03.2021 Abstract Krylov subspace methods are an essential building block in numerical simulation software. The efficient utilization of modern hardware is a challenging problem inthe development of these methods. In this work, we develop Krylov subspace methods to solve linear systems with multiple right-hand sides, tailored to modern hardware in high-performance computing. To this end, we analyze an innovative block Krylov subspace framework that allows to balance the computational and data-transfer costs to the hardware. Based on the framework, we formulate commonly used Krylov methods. For the CG and BiCGStab methods, we introduce a novel stabilization approach as an alternative to a deflation strategy. This helps us to retain the block size, thus leading to a simpler and more efficient implementation. In addition, we optimize the methods further for distributed memory systems and the communication overhead. For the CG method, we analyze approaches to overlap the communication and computation and present multiple variants of the CG method, which differ in their communication properties. Furthermore, we present optimizations ofthe orthogonalization procedure in the GMRes method. Beside introducing a pipelined Gram-Schmidt variant that overlaps the global communication with the computation of inner products, we present a novel orthonormalization method based on the TSQR algorithm, which is communication-optimal and stable. For all optimized method, we present tests that show their superiority in a distributed setting. Zusammenfassung Krylovraummethoden stellen einen essentiellen Bestandteil numerischer Simulati- onssoftware dar. Die effiziente Nutzung moderner Hardware ist ein herausforderndes Problem bei der Entwicklung solcher Methoden. Gegenstand dieser Dissertation ist die Formulierung von Krylovraumverfahren zur Lösung von linearen Gleichungssys- temen mit mehreren rechten Seiten, welche die Eigenschaften moderner Hardware berücksichtigen. Dazu untersuchen wir ein innovatives Blockkrylovraum-Framework, welches es er- möglicht die Berechnungs- und Datentransferkosten der Blockkrylovraummethode an die Hardware anzupassen. Darauf aufbauend formulieren wir mehrere Krylovraummethoden. Für die CG und BiCGStab Methoden führen wir eine neuartige Stabilisierungstrategie ein, die es ermöglicht die Spaltenanzahl des Residuums beizubehalten. Diese ersetzt die bekannte Deflationstrategien und ermöglicht eine einfachere und effizientere Implemen- tierung der Methoden. Des Weiteren optimieren wir die Methoden bezüglich der Kommunikation auf Systemen mit verteiltem Speicher. Für die CG Methode untersuchen wir Strategien, um die Kommunikation mit Berechnungen zu überlappen. Dazu stellen wir mehrere Varianten des Algorithmus vor, welche sich durch ihre Kommunikationseigenschaften unterscheiden. Außerdem werden für die GMRes Methode optimierte Varianten der Orthonormalisierung entwickeln. Neben einem Gram-Schmidt Verfahren, welches Be- rechnungen und Kommunikation überlappt, präsentieren wir eine neue Methode, welche auf dem TSQR-Algorithmus aufbaut und Stabilität sowie geringe Kommunikations- kosten vereint. Für alle optimierten Varianten zeigen wir numerische Tests, welche die Verbesserungen auf Systemen mit verteiltem Speicher demonstrieren. Acknowledgments I would like to express my deep gratitude to all people who have supported me over the last years. First, I thank Prof. Dr. Christian Engwer for giving me the opportunity to work on this topic, all the creative discussions, motivation, great guidance and for being an excellent supervisor. I thank all my colleges in our workgroup for the pleasant atmosphere, in particular I thank Liesel Sommer and Marcel Koch for proof-reading this thesis and giving useful hints for improvements. Furthermore, I thank Prof. Dr. Robert Klöfkorn for giving me the opportunity to work for a few weeks in Bergen, collecting valuable experience and enjoying the Norwegian nature. All implementations of algorithms in the thesis are based on the Dune software framework, thus I thank all the developers of Dune for making this project happen. Diese Arbeit wäre ohne die bedingungslose Unterstützung meiner Eltern Kirsten und Eckhard nicht möglich gewesen. Danke für all die finanzielle und moralische Unterstützung während meiner gesamten Studienzeit. Der größte Dank gilt meiner Frau Eileen. Danke dafür, dass es dich in meinem Leben gibt und für all den Rückhalt, die Unterstützung und Liebe, die es mir sehr erleichtert haben diese Arbeit zu verfassen. Contents 1 Introduction1 1.1 Motivation..................................1 1.2 Related Work................................3 1.3 Contributions and Outline.........................4 2 A Brief Introduction to Krylov Methods6 Part 1 Block Krylov Methods 13 3 A General Block Krylov Framework 14 3.1 Block Krylov Spaces............................ 15 3.2 Implementation............................... 23 3.3 Performance Analysis............................ 26 3.4 Numerical Experiments........................... 27 4 Block Conjugate Gradients Method 31 4.1 Formulation of the Block Conjugate Gradients Method......... 31 4.2 Convergence................................. 35 4.3 Residual Re-Orthonormalization...................... 38 4.4 Numerical Experiments........................... 40 5 Block GMRes Method 46 5.1 Formulation of the Block GMRes Method................ 46 5.2 Convergence................................. 49 5.3 Numerical Experiments........................... 52 6 Block BiCGStab Method 54 6.1 Formulation of the Block BiCGStab Method............... 54 6.2 Residual Re-Orthonormalization...................... 59 6.3 Numerical Experiments........................... 62 Part 2 Communication-aware Block Krylov Methods 65 7 Challenges on Large Scale Parallel Systems 66 7.1 Implementation............................... 67 7.2 Collective Communication Benchmark.................. 68 7.3 The TSQR Algorithm........................... 70 8 Pipelined Block CG Method 74 8.1 Fusing Inner Block Products........................ 74 8.2 Overlap Computation and Communication................ 76 8.3 Numerical Experiments........................... 83 9 Communication-optimized Block GMRes Method 85 9.1 Classical Gram-Schmidt with Re-Orthogonalization........... 86 9.2 Pipelined Gram-Schmidt.......................... 87 9.3 Localized Arnoldi.............................. 88 9.4 Numerical Experiments........................... 94 10 Pipelined Block BiCGStab Method 97 10.1 Numerical Experiments........................... 99 11 Summary and Outlook 101 Bibliography 105 List of Figures 1.1 Performance development of the top 500 supercomputers.........2 3.1 Schematic representation of different *-subalgebras............ 21 3.2 Microbenchmarks for kernels BOP, BDOT and BAXPY executed on one core. 29 3.3 Microbenchmarks for kernels BOP, BDOT and BAXPY executed on 20 cores. 30 4.1 Frobenius norm of the residual vs. iterations of BCG methods...... 42 4.2 Convergence of the BCG method for different re-orthonormalization parameters.................................. 44 5.1 Convergence of block GMRes method................... 53 7.1 Collective communication benchmark. I_MPI_ASYNC_PROGRESS on vs. off. 69 7.2 Time per global reduction vs. peak performance of the hypothetical machine.................................... 71 8.1 Schematic representation of the program flow in the different BCG variants. 81 8.2 Strong scaling of the time per iteration for different CG variants..... 84 9.1 Schematic flow diagram of the pipelined Gram-Schmidt orthogonalization Algorithm 9.3 for r = 3........................... 87 9.2 Reduction operation of the localized Arnoldi method........... 93 9.3 Back-propagation operation of the localized Arnoldi method....... 93 9.4 Speedup of runtime for different BGMRes variants compared to the modified method............................... 95 10.1 Speedup of runtime for different BiCGStab variants............ 99 List of Tables 2.1 Overview of commonly used Krylov methods their properties and references. 11 3.1 Performance relevant characteristics for the BOP, BDOT and BAXPY kernels. 27 4.1 Iteration counts and numbers of residual re-orthonormalization for single and double precision............................. 43 4.2 Iteration counts, runtime and number of re-orthonormalizations for the solution of several matrices from MatrixMarket with different p..... 45 6.1 Convergence results of the BBiCGStab method.............. 62 6.2 Iterations of the BBiCGStab method with residual re-orthonormalization. 63 8.1 Arithmetical and communication properties of different BCG algorithms. 81 9.1 Comparison of the arithmetical complexity, number of messages and stability for

Hardware-Oriented Krylov Methods for High-Performance Computing

A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems Xinzhe Wu, Serge Petiton

Implicitly Restarted Arnoldi/Lanczos Methods for Large Scale Eigenvalue Calculations

Krylov Subspaces

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

Iterative Methods for Linear System

Microbenchmarks in Big Data

Overview of the SPEC Benchmarks

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Power Measurement Tutorial for the Green500 List

Implicitly Restarted Arnoldi/Lanczos Methods for Large Scale Eigenvalue Calculations

Limited Memory Block Krylov Subspace Optimization for Computing Dominant Singular Value Decompositions

The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 Diss