Parallel Reduction from Block Hessenberg to Hessenberg Using MPI
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master's Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson Ume˚a University Department of Computing Science SE-901 87 UMEA˚ SWEDEN Abstract In many scientific applications, eigenvalues of a matrix have to be computed. By first reducing a matrix from fully dense to Hessenberg form, eigenvalue computations with the QR algorithm become more efficient. Previous work on shared memory architectures has shown that the Hessenberg reduction is in some cases most efficient when performed in two stages: First reduce the matrix to block Hessenberg form, then reduce the block Hessen- berg matrix to true Hessenberg form. This Thesis concerns the adaptation of an existing parallel reduction algorithm implementing the second stage to a distributed memory ar- chitecture. Two algorithms were designed using the Message Passing Interface (MPI) for communication between processes. The algorithms have been evaluated by an analyze of trace and run-times for different problem sizes and processes. Results show that the two adaptations are not efficient compared to a shared memory algorithm, but possibilities for further improvement have been identified. We found that an uneven distribution of work, a large sequential part, and significant communication overhead are the main bottlenecks in the distributed algorithms. Suggested further improvements are dynamic load balancing, sequential computation overlap, and hidden communication. ii Contents 1 Introduction 1 1.1 Memory Hierarchies . 1 1.2 Parallel Systems . 2 1.3 Linear Algebra . 3 1.4 Hessenberg Reduction Algorithms . 4 1.5 Problem statement . 6 2 Methods 7 2.1 Question 1 . 7 2.2 Questions 2 and 3 . 7 2.3 Question 4 . 7 3 Literature Review 9 3.1 Householder reflections . 9 3.2 Givens Rotations . 11 3.3 Hessenberg reduction algorithm . 11 3.4 WY Representation . 12 3.5 Blocking and Data Distribution . 14 3.6 Block Reduction . 15 3.7 Parallel blocked algorithm . 16 3.7.1 QR Phase . 18 3.7.2 Givens Phase . 18 3.8 Successive Band Reduction . 19 3.9 Two-staged algorithm . 21 3.9.1 Stage 1 . 21 3.9.2 Stage 2 . 22 3.10 MPI Functions . 24 3.11 Storage . 25 4 Results 27 4.1 Root-Scatter algorithm . 27 iii iv CONTENTS 4.2 DistBlocks algorithm . 29 4.3 Performance . 29 5 Conclusions 35 5.1 Future work . 35 6 Acknowledgements 37 References 39 Chapter 1 Introduction This chapter is an introduction to topics in matrix computations and parallel systems that are related to this Thesis. 1.1 Memory Hierarchies The hierarchy of memory inside a computer has changed the way efficient algorithms are designed. Central processing unit (CPU) caches exploits temporal and spatial locality. Data that have been used recently (temporal locality) and data that are close, in memory, to recently used data (spatial locality) have shorter access time. Figure 1.1 illustrates a common memory architecture for modern computers. Fast and small memory is located close to the CPU. Memory reference m is accessed from the slow and large RAM and loaded into fast cache memory, along with data located in the same cache line (typically 64{128B) as m. If a subsequent memory access is to m or data nearby m, it will be satisfied from the fast cache memory unless the corresponding cache line has been evicted. To avoid costly memory communication, many efficient algorithms are designed for data reuse. By often reusing data that has been brought into the cache, an algorithm will be able to amortize the high cost of the initial main memory communication. Figure 1.1: Memory hierarchy of many uni-core modern computers. The CPU works on small and fast registers. These registers are loaded with data from RAM. Data is cached in the fast L1, L2, and L3 caches. If the data stays in cache, access time will be shorter the next time data is accessed. Basic Linear Algebra Subprograms (BLAS), is a standard interface for linear algebra operations. BLAS Level 1 and 2 operations (see Figure 1.2(a) and 1.2(b) for examples) fea- ture little data reuse and are therefore bounded by the memory bandwidth. Data reuse and 1 2 Chapter 1. Introduction locality are very important in order to achieve efficiency on modern computer architectures. BLAS Level 3 operations involve many more arithmetic operations than data accesses. For example, a matrix-matrix multiplication (see Figure 1.2(c)) involves O(n3) arithmetic oper- ations and O(n2) data, for matrices of order n. The amount of data reuse is high for Level 3 operations. Many modern linear algebra algorithms try to minimize the amount of Level 1 (see Figure 1.2(a)) and 2 operations and maximize the amount of Level 3 operations [3]. (a) BLAS level 1, (b) BLAS level (c) BLAS level 3, matrix- vector-vector. 2, vector- matrix. matrix/matrix- vector. Figure 1.2: Types of BLAS operations, exemplified with matrix/vector multiplication. 1.2 Parallel Systems There are two main types of system architectures used for parallel computations. The first one is the shared memory architecture (SM). SM is a computer architecture, where all processing nodes share the same memory space. Figure 1.3 illustrates a shared memory architecture with four processing nodes. All processing nodes can make computations on the same memory and this requires each node to have access to the memory. Figure 1.3: Shared memory architecture. Processing nodes P0 to P3 share the same memory space. The second type of architecture is the distributed memory architecture (DM). DM is an architecture where the processing nodes do not share the same memory space. In a DM, the nodes work on local memory and communicate with other nodes through an interconnection network. Figure 1.4 illustrates a distributed memory architecture with four processing nodes, each with its own local memory. Designing an algorithm for distributed memory has the advantage that the algorithm could scale to larger problems than an analogous SM algorithm. The reason for this is that SM is bound to its shared memory performance and capacity. A disadvantage with designing an algorithm for a DM is communication, which needs to be done explicitly through message passing. Communication in a DM is often costly and algorithms designed for this type of architecture have to avoid excessive communication. DM also has the property of explicit 1.3. Linear Algebra 3 Figure 1.4: Distributed memory architecture. Processing nodes P0 to P3 have separate memory space and they interact through the interconnection network. data distribution. Explicit data distribution can be difficult to utilize in some problems, but when the data is distributed, the programmer does not have to deal with race conditions. Memory is scalable in a DM system. When more processes are added, the total memory size is increasing. Local memory can be accessed efficient in the processes of the DM system, without interference from other processes. Another advantage with systems based on DM is economy: They can be built using low-price computers, connected through a cheap network. MPI is a large set of operations that are based on message passing. Message passing is the de facto standard way of working with a DM. Memory is not shared between processes and interaction is done by explicitly passing messages. 1.3 Linear Algebra The subjects linear algebra and matrix computations contain many terms. The ones related to this Thesis are described in this section. A square matrix is a matrix A 2 Rm×n with the same number of rows as columns n = m. The main diagonal of A is all elements aij, where i = j ([a11; a22; : : : ; ann]) and a symmetric matrix is a matrix A where aij = aji for any element aij. One type of a square matrix is the triangular matrix, which has all elements below (upper triangular) or above (lower triangular) the main diagonal equal to zero. An (upper/lower) Hessenberg matrix is a (upper/lower) triangular matrix with an extra subdiagonal (upper Hessenberg) or superdiagonal (lower Hessenberg). Eigenvalue λ is defined as Ax = λx, where x is a non-zero eigenvector. The two variants of a Hessenberg matrix are illustrated in Figure 1.5. By reducing full matrices to Hessenberg matrices, some matrix computations require less computational effort. The most important scientific application of Hessenberg reductions is the QR algorithm for finding eigenvalues of a non-symmetric matrix. 0 2 2 7 0 1 0 2 7 0 0 1 B 2 8 3 2 C B 2 8 4 0 C B C B C @ 0 7 5 6 A @ 1 8 5 1 A 0 0 5 9 7 2 7 9 (a) Upper Hessen- (b) Lower Hessen- berg. berg. Figure 1.5: Examples of Hessenberg Matrices. 4 Chapter 1. Introduction Hessenberg reduction is the process of transforming a full matrix to Hessenberg form by means of an (orthogonal) similarity transformation H = QT AQ; where Q is an orthogonal matrix. Orthogonal matrices are invertible, with their inverse being the transpose (Q−1 = QT , for an orthogonal matrix Q). A similarity transformation A 7! P −1AP is a transformation where a square matrix A is multiplied from left with P −1 and right with P , for an invertible matrix P . A similarity transformation preserves the eigenvalues of A [8, (p.312)]. Let B = P −1AP be a similarity transformation of A, then B = P −1AP , PB = AP , P BP −1 = A; which can be substituted in the definition of eigenvalues: Ax = λx , P BP −1x = λx (substitute A) , BP −1x = P −1λx (multiply with P −1) , B(P −1x) = λ(P −1x) (factor out vector P −1x): This shows that if λ is an eigenvalue of A corresponding to an eigenvector x, then λ is also an eigenvalue of B corresponding to an eigenvector P −1x.