C Parallelogram Triangular Fill Bulge

UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Avoiding Communication in Dense Linear Algebra Permalink https://escholarship.org/uc/item/95n2b7vr Author Ballard, Grey Publication Date 2013 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Avoiding Communication in Dense Linear Algebra by Grey Malone Ballard A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science and the Designated Emphasis in Computational Science and Engineering in the Graduate Division of the University of California, Berkeley Committee in charge: Professor James Demmel, Chair Professor Ming Gu Professor Katherine Yelick Fall 2013 Avoiding Communication in Dense Linear Algebra Copyright 2013 by Grey Malone Ballard 1 Abstract Avoiding Communication in Dense Linear Algebra by Grey Malone Ballard Doctor of Philosophy in Computer Science with a Designated Emphasis in Computational Science and Engineering University of California, Berkeley Professor James Demmel, Chair Dense linear algebra computations are essential to nearly every problem in scientific computing and to countless other fields. Most matrix computations enjoy a high computational intensity (i.e., ratio of computation to data), and therefore the algorithms for the computations have a potential for high efficiency. However, performance for many linear algebra algorithms is limited by the cost of moving data between processors on a parallel computer or throughout the memory hierarchy of a single processor, which we will refer to generally as communication. Technological trends indicate that algorithmic performance will become even more limited by communication in the future. In this thesis, we consider the fundamental computations within dense linear algebra and address the following question: can we significantly improve the current algorithms for these computations, in terms of the communication they require and their performance in practice? To answer the question, we analyze algorithms on sequential and parallel architectural models that are simple enough to determine coarse communication costs but accurate enough to predict performance of implementations on real hardware. For most of the computations, we prove lower bounds on the communication that any algorithm must perform. If an algorithm exists with communication costs that match the lower bounds (at least in an asymptotic sense), we call the algorithm communication optimal. In many cases, the most commonly used algorithms are not communication optimal, and we can develop new algorithms that require less data movement and attain the communication lower bounds. In this thesis, we develop both new communication lower bounds and new algorithms, tightening (and in many cases closing) the gap between best known lower bound and best known algorithm (or upper bound). We consider both sequential and parallel algorithms, and we asses both classical and fast algorithms (e.g., Strassen's matrix multiplication algorithm). In particular, the central contributions of this thesis are • proving new communication lower bounds for nearly all classical direct linear algebra computations (dense or sparse), including factorizations for solving linear systems, 2 least squares problems, and eigenvalue and singular value problems, • proving new communication lower bounds for Strassen's and other fast matrix multiplication algorithms, • proving new parallel communication lower bounds for classical and fast computations that set limits on an algorithm's ability to perfectly strong scale, • summarizing the state-of-the-art in communication efficiency for both sequential and parallel algorithms for the computations to which the lower bounds apply, • developing a new communication-optimal algorithm for computing a symmetric-indefinite factorization (observing speedups of up to 2.8× compared to alternative shared- memory parallel algorithms), • developing new, more communication-efficient algorithms for reducing a symmetric band matrix to tridiagonal form via orthogonal similar transformations (observing speedups of 2{6× compared to alternative sequential and parallel algorithms), and • developing a new communication-optimal parallelization of Strassen's matrix multiplication algorithm (observing speedups of up to 2.84× compared to alternative distributed-memory parallel algorithms). i Table of Contents Table of Contents i List of Figures vi List of Tables viii 1 Introduction 1 1.1 The Role of Scientific Computing . 1 1.2 The Importance of Dense Linear Algebra . 1 1.3 The Rise of Parallelism and the Relative Costs of Communication . 2 1.4 Thesis Goals and Contributions . 3 1.5 Thesis Organization . 5 2 Preliminaries 6 2.1 Notation and Definitions . 6 2.1.1 Asymptotic Notation . 6 2.1.2 Algorithmic Terminology . 7 2.1.3 Communication Terminology . 8 2.2 Memory Models . 10 2.2.1 Two-Level Sequential Memory Model . 10 2.2.2 Distributed-Memory Parallel Model . 12 2.3 Data Layouts . 13 2.3.1 Matrix Layouts in Slow Memory . 13 2.3.2 Matrix Distributions on Parallel Machines . 14 2.4 Fast Matrix Multiplication Algorithms . 15 2.4.1 Strassen's Algorithm . 15 2.4.2 Strassen-Winograd Algorithm . 16 2.5 Lower Bound Lemmas . 16 2.5.1 Loomis-Whitney Inequality . 16 2.5.2 Expansion Preliminaries . 17 2.5.3 Latency Lower Bounds . 18 2.6 Numerical Stability Lemmas . 18 ii I Communication Lower Bounds 21 3 Communication Lower Bounds via Reductions 22 3.1 Classical Matrix Multiplication . 22 3.2 Reduction Arguments . 23 3.2.1 LU Decomposition . 23 3.2.2 Cholesky Decomposition . 24 3.3 Conclusions . 29 4 Lower Bounds for Classical Linear Algebra 30 4.1 Lower Bounds for Three-Nested-Loops Computation . 31 4.1.1 Lower Bound Argument . 32 4.1.2 Applications of the Lower Bound . 35 4.2 Lower Bounds for Three-Nested-Loop Computation with Temporary Operands 41 4.2.1 Lower Bound Argument . 41 4.2.2 Applications of the Lower Bound . 43 4.3 Applying Orthogonal Transformations . 47 4.3.1 First Lower Bound Argument: Applying Theorem 4.10 . 48 4.3.2 Second Lower Bound Argument: Bounding Z Values . 49 4.3.3 Generalizing to Eigenvalue and Singular Value Reductions . 57 4.3.4 Applicability of the Lower Bounds . 58 4.4 Attainability . 59 5 Lower Bounds for Strassen's Matrix Multiplication 60 5.1 Relating Edge Expansion to Communication . 61 5.1.1 Computation Graph . 61 5.1.2 Partition Argument . 61 5.1.3 Edge Expansion and Communication . 62 5.2 Expansion Properties of Strassen's Algorithm . 63 5.2.1 Computation Graph for n-by-n Matrices . 64 5.3 Communication Lower Bounds . 70 5.4 Conclusions . 71 6 Extensions of the Lower Bounds 72 6.1 Strassen-like Algorithms . 73 6.1.1 Connected Decoding Graph Assumption . 73 6.1.2 Communication Costs of Strassen-like Algorithms . 73 6.1.3 Fast Linear Algebra . 74 6.1.4 Fast Rectangular Matrix Multiplication Algorithms . 75 6.2 Memory-Independent Lower Bounds . 76 6.2.1 Communication Lower Bounds . 77 6.2.2 Limits of Strong Scaling . 79 iii 6.2.3 Extensions of Memory-Independent Bounds . 81 6.3 Other Extensions . 82 6.3.1 k-Nested-Loops Computations . 82 6.3.2 Sparse Matrix-Matrix Multiplication . 82 II Algorithms and Communication Cost Analysis 84 7 Sequential Algorithms and their Communication Costs 85 7.1 Classical Linear Algebra . 86 7.1.1 BLAS Computations . 88 7.1.2 Cholesky Decomposition . 88 7.1.3 Symmetric-Indefinite Decompositions . 89 7.1.4 LU Decomposition . 90 7.1.5 QR Decomposition . 91 7.1.6 Symmetric Eigendecomposition and SVD . 92 7.1.7 Nonsymmetric Eigendecomposition . 93 7.2 Fast Linear Algebra . 94 7.3 Conclusions and Future Work . 95 8 Parallel Algorithms and their Communication Costs 97 8.1 Classical Linear Algebra (with Minimal Memory) . 98 8.1.1 BLAS Computations . 99 8.1.2 Cholesky Decomposition . 99 8.1.3 Symmetric-Indefinite Decompositions . 99 8.1.4 LU Decomposition . 100 8.1.5 QR Decomposition . 100 8.1.6 Symmetric Eigendecomposition and SVD . 101 8.1.7 Nonsymmetric Eigendecomposition . 101 8.2 Classical Linear Algebra (with Extra Memory) . 102 8.2.1 Matrix Multiplication . 102 8.2.2 Other Linear Algebra Computations . 103 8.3 Fast Linear Algebra . 103 8.4 Conclusions and Future Work . 104 9 Communication-Avoiding Symmetric-Indefinite Factorization 105 9.1 Block-Aasen Algorithm . 107 9.1.1 Correctness . 108 9.1.2 Solving Two-Sided Triangular Linear Systems . 111 9.1.3 Pivoting . 115 9.1.4 Computing W and H . 116 9.1.5 The Second Phase of the Algorithm: Factoring T . 116 iv 9.2 Numerical Stability . 117 9.2.1 Stability of the Two-Sided Triangular Solver . 117 9.2.2 Stability of the Block-Aasen Algorithm . 119 9.2.3 Growth . 126 9.3 Sequential Complexity Analyses . 126 9.3.1 Computational Cost . 127 9.3.2 Communication Costs . 127 9.4 Numerical Experiments . 132 9.5 Conclusions . 135 10 Communication-Avoiding Successive Band Reduction 136 10.1 Preliminaries . 137 10.1.1 Eigendecomposition of Band Matrices . 137 10.1.2 SBR Notation . ..

C Parallelogram Triangular Fill Bulge

Life As a Developer of Numerical Software

Numerical and Parallel Libraries

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

Randnla: Randomized Numerical Linear Algebra

Evolving Software Repositories

2 Accessing Matlab at ER4

CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix

Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems

Using MATLAB Version 5 How to Contact the Mathworks

Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers

Solving Large Sparse Eigenvalue Problems on Supercomputers

Seasonal Influenza & Weather Factors