Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Viktor Jonsson

May 24, 2013 Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson

Ume˚a University Department of Computing Science SE-901 87 UMEA˚ SWEDEN

Abstract

In many scientific applications, eigenvalues of a have to be computed. By first reducing a matrix from fully dense to Hessenberg form, eigenvalue computations with the QR algorithm become more efficient. Previous work on shared memory architectures has shown that the Hessenberg reduction is in some cases most efficient when performed in two stages: First reduce the matrix to block Hessenberg form, then reduce the block Hessen- berg matrix to true Hessenberg form. This Thesis concerns the adaptation of an existing parallel reduction algorithm implementing the second stage to a distributed memory ar- chitecture. Two algorithms were designed using the Message Passing Interface (MPI) for communication between processes. The algorithms have been evaluated by an analyze of trace and run-times for different problem sizes and processes. Results show that the two adaptations are not efficient compared to a shared memory algorithm, but possibilities for further improvement have been identified. We found that an uneven distribution of work, a large sequential part, and significant communication overhead are the main bottlenecks in the distributed algorithms. Suggested further improvements are dynamic load balancing, sequential computation overlap, and hidden communication. ii Contents

1 Introduction 1 1.1 Memory Hierarchies ...... 1 1.2 Parallel Systems ...... 2 1.3 ...... 3 1.4 Hessenberg Reduction Algorithms ...... 4 1.5 Problem statement ...... 6

2 Methods 7 2.1 Question 1 ...... 7 2.2 Questions 2 and 3 ...... 7 2.3 Question 4 ...... 7

3 Literature Review 9 3.1 Householder reflections ...... 9 3.2 Givens Rotations ...... 11 3.3 Hessenberg reduction algorithm ...... 11 3.4 WY Representation ...... 12 3.5 Blocking and Data Distribution ...... 14 3.6 Block Reduction ...... 15 3.7 Parallel blocked algorithm ...... 16 3.7.1 QR Phase ...... 18 3.7.2 Givens Phase ...... 18 3.8 Successive Band Reduction ...... 19 3.9 Two-staged algorithm ...... 21 3.9.1 Stage 1 ...... 21 3.9.2 Stage 2 ...... 22 3.10 MPI Functions ...... 24 3.11 Storage ...... 25

4 Results 27 4.1 Root-Scatter algorithm ...... 27

iii iv CONTENTS

4.2 DistBlocks algorithm ...... 29 4.3 Performance ...... 29

5 Conclusions 35 5.1 Future work ...... 35

6 Acknowledgements 37

References 39 Chapter 1

Introduction

This chapter is an introduction to topics in matrix computations and parallel systems that are related to this Thesis.

1.1 Memory Hierarchies

The hierarchy of memory inside a computer has changed the way efficient algorithms are designed. Central processing unit (CPU) caches exploits temporal and spatial locality. Data that have been used recently (temporal locality) and data that are close, in memory, to recently used data (spatial locality) have shorter access time. Figure 1.1 illustrates a common memory architecture for modern computers. Fast and small memory is located close to the CPU. Memory reference m is accessed from the slow and large RAM and loaded into fast cache memory, along with data located in the same cache line (typically 64–128B) as m. If a subsequent memory access is to m or data nearby m, it will be satisfied from the fast cache memory unless the corresponding cache line has been evicted. To avoid costly memory communication, many efficient algorithms are designed for data reuse. By often reusing data that has been brought into the cache, an algorithm will be able to amortize the high cost of the initial main memory communication.

Figure 1.1: Memory hierarchy of many uni-core modern computers. The CPU works on small and fast registers. These registers are loaded with data from RAM. Data is cached in the fast L1, L2, and L3 caches. If the data stays in cache, access time will be shorter the next time data is accessed.

Basic Linear Algebra Subprograms (BLAS), is a standard interface for linear algebra operations. BLAS Level 1 and 2 operations (see Figure 1.2(a) and 1.2(b) for examples) fea- ture little data reuse and are therefore bounded by the memory bandwidth. Data reuse and

1 2 Chapter 1. Introduction locality are very important in order to achieve efficiency on modern computer architectures. BLAS Level 3 operations involve many more arithmetic operations than data accesses. For example, a matrix- (see Figure 1.2(c)) involves O(n3) arithmetic oper- ations and O(n2) data, for matrices of order n. The amount of data reuse is high for Level 3 operations. Many modern linear algebra algorithms try to minimize the amount of Level 1 (see Figure 1.2(a)) and 2 operations and maximize the amount of Level 3 operations [3].

(a) BLAS level 1, (b) BLAS level (c) BLAS level 3, matrix- vector-vector. 2, vector- matrix. matrix/matrix- vector.

Figure 1.2: Types of BLAS operations, exemplified with matrix/vector multiplication.

1.2 Parallel Systems

There are two main types of system architectures used for parallel computations. The first one is the shared memory architecture (SM). SM is a computer architecture, where all processing nodes share the same memory space. Figure 1.3 illustrates a shared memory architecture with four processing nodes. All processing nodes can make computations on the same memory and this requires each node to have access to the memory.

Figure 1.3: Shared memory architecture. Processing nodes P0 to P3 share the same memory space.

The second type of architecture is the distributed memory architecture (DM). DM is an architecture where the processing nodes do not share the same memory space. In a DM, the nodes work on local memory and communicate with other nodes through an interconnection network. Figure 1.4 illustrates a distributed memory architecture with four processing nodes, each with its own local memory. Designing an algorithm for distributed memory has the advantage that the algorithm could scale to larger problems than an analogous SM algorithm. The reason for this is that SM is bound to its shared memory performance and capacity. A disadvantage with designing an algorithm for a DM is communication, which needs to be done explicitly through message passing. Communication in a DM is often costly and algorithms designed for this type of architecture have to avoid excessive communication. DM also has the property of explicit 1.3. Linear Algebra 3

Figure 1.4: Distributed memory architecture. Processing nodes P0 to P3 have separate memory space and they interact through the interconnection network.

data distribution. Explicit data distribution can be difficult to utilize in some problems, but when the data is distributed, the programmer does not have to deal with race conditions. Memory is scalable in a DM system. When more processes are added, the total memory size is increasing. Local memory can be accessed efficient in the processes of the DM system, without interference from other processes. Another advantage with systems based on DM is economy: They can be built using low-price computers, connected through a cheap network. MPI is a large set of operations that are based on message passing. Message passing is the de facto standard way of working with a DM. Memory is not shared between processes and interaction is done by explicitly passing messages.

1.3 Linear Algebra

The subjects linear algebra and matrix computations contain many terms. The ones related to this Thesis are described in this section. A is a matrix A ∈ Rm×n with the same number of rows as columns n = m. The main diagonal of A is all elements aij, where i = j ([a11, a22, . . . , ann]) and a is a matrix A where aij = aji for any element aij. One type of a square matrix is the , which has all elements below (upper triangular) or above (lower triangular) the main diagonal equal to zero. An (upper/lower) Hessenberg matrix is a (upper/lower) triangular matrix with an extra subdiagonal (upper Hessenberg) or superdiagonal (lower Hessenberg). Eigenvalue λ is defined as Ax = λx, where x is a non-zero eigenvector. The two variants of a Hessenberg matrix are illustrated in Figure 1.5. By reducing full matrices to Hessenberg matrices, some matrix computations require less computational effort. The most important scientific application of Hessenberg reductions is the QR algorithm for finding eigenvalues of a non-symmetric matrix.

 2 2 7 0   2 7 0 0   2 8 3 2   2 8 4 0       0 7 5 6   1 8 5 1  0 0 5 9 7 2 7 9 (a) Upper Hessen- (b) Lower Hessen- berg. berg.

Figure 1.5: Examples of Hessenberg Matrices. 4 Chapter 1. Introduction

Hessenberg reduction is the process of transforming a full matrix to Hessenberg form by means of an (orthogonal) similarity transformation

H = QT AQ, where Q is an . Orthogonal matrices are invertible, with their inverse being the transpose (Q−1 = QT , for an orthogonal matrix Q). A similarity transformation A 7→ P −1AP is a transformation where a square matrix A is multiplied from left with P −1 and right with P , for an P . A similarity transformation preserves the eigenvalues of A [8, (p.312)]. Let B = P −1AP be a similarity transformation of A, then

B = P −1AP ⇔ PB = AP ⇔ PBP −1 = A, which can be substituted in the definition of eigenvalues:

Ax = λx ⇔ PBP −1x = λx (substitute A) ⇔ BP −1x = P −1λx (multiply with P −1) ⇔ B(P −1x) = λ(P −1x) (factor out vector P −1x).

This shows that if λ is an eigenvalue of A corresponding to an eigenvector x, then λ is also an eigenvalue of B corresponding to an eigenvector P −1x.

1.4 Hessenberg Reduction Algorithms

There are several known algorithms for reducing a full matrix to Hessenberg form. The following section is a brief introduction required to state the questions in Section 1.5. The details of the following algorithm will be presented in Chapter 3. The unblocked Hessenberg reduction algorithm for a matrix A is described in Algorithm 1. The underlying idea is to reduce the matrix one column at a time from left to right using Householder reflections. For each reduced column, a trailing submatrix is fully updated before the algorithm proceeds with the next column. Figure 1.6 illustrates which elements of an 8 × 8 matrix are modified by the second iteration of Algorithm 1. The figure shows how the algorithm reduces column c = 2 (Fig- ure 1.6(a)) and applies reflection Pc from left (Figure 1.6(b)) and right (Figure 1.6(c)). The unblocked Hessenberg reduction algorithm requires 10n3/3 floating point operations (flops) [8, p. 345] for a matrix of size n × n. Flops are used as a measure of work and flops/s is a measure of performance for computers and programs, where Gflops/s is most common today (109flops/s). The problem with the basic Algorithm 1 is that for each iteration, it 1.4. Hessenberg Reduction Algorithms 5

Algorithm 1 Unblocked Hessenberg reduction for input matrix A of order n. When ter- minated, the algorithm has overwritten A with the Hessenberg form of A. Q = P1P2 ...Pn−2 is updated in each iteration by a multiplication of Pc from the right. Q = Rn×n for c = 1, 2, . . . n − 2 do Generate Householder reflection Pc that reduces column c of A Q(:, c : n − 1) = Q(:, c : n)Pc T A(c + 1 : n, c : n) = Pc A(c + 1 : n, c : n) // Left update A(1 : n, c + 1 : n) = A(1 : n, c + 1 : n)Pc // Right update end for

(a) Step 1: Generate (b) Step 2: Apply re- (c) Step 3: Apply reflec- Householder reflection. flection from the left. tion from the right.

Figure 1.6: Unblocked Hessenberg reduction. Gray squares are the elements used for the reduction. This example shows the second iteration of an 8 × 8 matrix reduction. executes only a few operations on a large amount of data, in the form of BLAS Level 2 operations. Low amount of operations on a large set of data leads to low data reuse which is not suitable for modern computer’s memory hierarchy. Reducing a full matrix to block Hessenberg form has proven to be more efficient than the direct reduction of Algorithm 1. Block (upper/lower) Hessenberg form is a matrix with more than one sub/superdiagonal. A block is a contiguous submatrix and in the block (upper/lower) Hessenberg form, each sub/superdiagonal block is on upper/lower triangular form. The block Hessenberg form is illustrated in the middle of Figure 1.8. An improved adaptation of a known two-staged Hessenberg reduction algorithm has been proposed by researchers at Ume˚aUniversity [11]. The outline of the two-staged algorithm is:

1. Reduce a full matrix A to block Hessenberg form.

2. Reduce the block Hessenberg matrix to true Hessenberg form.

The first step of the two-staged algorithm works on blocks, where each block is reduced to upper trapezoidal form. A a trapezoidal matrix (see Figure 1.7 for an illustration) is a rectangular matrix with zeros below or above a diagonal (there can be several diagonals in a rectangular matrix). The second stage uses band reduction methods for reducing the matrix to true Hessenberg form. Band reduction is the process of reducing the number of sub or superdiagonals. In Figure 1.8 the full 6 × 6 matrix is first reduced to block upper Hessenberg form. The block upper Hessenberg matrix is further reduced to true Hessenberg form, by reducing the number of subdiagonals. 6 Chapter 1. Introduction

Figure 1.7: Example of a trapezoidal matrix. The gray squares represent zero elements. Zero elements do not have to be explicitly stored in memory.

 x x x x x x   x x x x x x   x x x x x x  x x x x x x  x x x x x x  x x x x x x        x x x x x x   x x x x x x   0 x x x x x    →   →    x x x x x x   0 x x x x x   0 0 x x x x         x x x x x x   0 0 x x x x   0 0 0 x x x  x x x x x x 0 0 0 x x x 0 0 0 0 x x

Figure 1.8: Outline of the two-staged Hessenberg reduction algorithm.

The two-staged method requires 10n3/3 flops for the first stage and 2n3 flops for the second stage[11], assuming the number of subdiagonals in the block Hessenberg form is small relative to n and constant. This is 2n3 more flops when compared with the unblocked algorithm, which is significantly more work (+60%). In return, the two-staged algorithm has better data reuse than the unblocked algorithm and as a consequence the two-staged algorithm has potential of being faster on a modern computer. In some environments, the two-staged algorithm is proven to be faster. This has been shown in [11] where the two algorithms where compared on a shared memory architecture.

1.5 Problem statement

The main goal of this project is to design and implement an algorithm with better scalability than previous designs. Whether or not the main goal is reached, the project could generate knowledge about limitations and difficulties with this type of problem. There are several questions that should be answered in this project. The questions are:

1. Is it possible to adapt or redesign the second stage of the two-stage Hessenberg reduc- tion algorithm for a DM while preserving efficiency?

2. How does the DM implementation compare with the existing SM implementation? 3. How scalable is the DM implementation? 4. What are the main factors that ultimately limit the performance and scalability of the DM implementation?

The work presented in this report has been done in collaboration with Ume˚aUniversity, research group Parallel and High Performance Computing. The rest of the report is structured as follows. Chapter 2 describes the planning of the work and how it was accomplished. Chapter 3 contains a literature review that covers pre- vious research related to this subject. The result of the work is then presented in Chapter 4 followed by conclusions in Chapter 5. Chapter 2

Methods

In order to answer the questions raised in Section 1.5, one or several algorithms have to be designed and evaluated. Before design and implementation, previous designs have been studied in a literature review. The literature review spans several weeks, so that enough knowledge can be collected in order to design an algorithm. Notes from the literature review are used for the literature review chapter. When sufficient knowledge is collected, algorithms can be designed and implemented. The implementations are systematically done by careful design decisions and tests, in order to see which solution candidates that are useful or not. By making intermediate performance tests and comparing the numerical errors with a stable algorithm, bad solutions can be abolished. Through series of tests on a large parallel system, the resulting solution candidates are evaluated. Tests generate test data that is visualized and analyzed. The results from all these steps are documented in this report.

2.1 Question 1

Algorithms has to be designed and implemented for a DM in order to evaluate a solution. Question 1 will be measured by evaluating the performance results. Performance is measured by running tests on a system and comparing Gflop/s with the theoretical peak performance for the system. A successful implementation can reach a large fraction of the theoretical peak performance. The results from the SM adaptation[11] will be used as reference.

2.2 Questions 2 and 3

Question 2 and 3 will be answered by a series of carefully designed experiments. The experiments will contain tests that are similar to those made for the SM adaptation [12] in order to compare the implementations. Scalability will be evaluated with the speedup measure for parallel algorithms.

2.3 Question 4

Question 4 will be answered by an introspective analysis of the implementation in order to identify bottlenecks. Time for the different operations in the iterations will be analyzed in order to explain the behavior of the implementations.

7 8 Chapter 2. Methods Chapter 3

Literature Review

In order to solve the problems stated in Chapter 1, several subjects need to be studied. The chosen subjects for the literature review are listed in Table 3.1, with corresponding motivation on why the subject is relevant.

Subject Motivation Householder reflections It is the cornerstone of many reduction algorithms. It is required to know the Householder reflection’s properties in order to use them in an algorithm. Hessenberg reduction algo- A basic reduction to Hessenberg form. Many Hessenberg rithm reduction algorithms are based on this one. WY representation Efficient applications of aggregated Householder reflec- tions. Block reduction Efficient way of reducing to block Hessenberg form. Parallel blocked algorithm DM algorithm for reduction to block Hessenberg form. Important because it deals with distributed memory and the problems that includes. Successive band reduction Is similar to the second stage of the two-staged algorithm. Two-staged algorithm The main subject of this Thesis.

Table 3.1: Subjects for literature review.

3.1 Householder reflections

In matrix computations, factorizations of a matrix A as in Table 3.2 all include orthogonal transformations applied to a non-symmetric (QR, Hessenberg, bidiagonal) or symmetric (tridiagonal) matrix. The transformations zeros elements after position i in vector x. QR, Hessenberg, tridiagonal, and bidiagonal are important factorizations that are used in eigen- value and singular value-problems. A method [10] of reducing the columns of a full matrix one by one to Hessenberg form, is by applying Householder reflections

P = I − 2uuT ,

9 10 Chapter 3. Literature Review

QR

Hessenberg

Tridiagonal

Bidiagonal

Table 3.2: Different types of factorizations, with example illustrations on a 6 × 6 matrix A. Squares represent non-zero values.

where ||u||2 = 1. Given a vector x, a Householder reflection P , defined by a vector u, can be generated such that all but one of the elements in the vector P T x are zero. Figure 3.1 illustrates the procedure of applying a Householder reflection. The reflection transforms a vector x ∈ Rm by reflecting it to a subspace Ri ∈ Rm. Ri is spanned by the base vector T i ei = [0, 0,..., 0, 1, 0,..., 0, 0] , where i is the position of the value one (1). R is chosen to be the subspace that only spans the first dimension of vector x. When x is reflected to Ri, component xi is the only element of x that “survives” the reflection, every other element in x will be annihilated. Unit vector u is chosen to be orthogonal to reflection plane M.

T Y Example 3.1.1. If the last two elements of x = [1, 4, 3] (e1 = [1, 0, 0] ) are to be annihi- lated, then u is chosen as:

y u = , ||y||2 where y = x − αe1, α = ±||x||. u constructs reflection matrix P such that:

T  0.19 ... 0.78 ... 0.58 ...   1   5.09 ...  T T P x = (I − 2uu )x =  0.78 ... 0.23 ... −0.57 ...   4  =  0  . 0.58 ... −0.57 ... 0.56 ... 3 0

Two reasons why Householder reflections are used for the factorizations in Table 3.2 are:

1. Householder reflections are numerically stable, because the reflections are orthogonal transformations. The transformations do not change the norm of the vector x.A numerical error in x does not grow when the reflections are applied.

2. A Householder reflection can introduce multiple zeros at once. This is opposed to Givens rotations where each element of x below i gets zeroed one at a time. 3.2. Givens Rotations 11

Figure 3.1: By applying a reflection constructed of vector u that reflects vector x to the subspace Ri (e is a basis vector in Ri), all components under x will be set to zero.

3.2 Givens Rotations

Givens rotations is another type of orthogonal transformations that are used in reductions. A 2 × 2 rotation has the form  cos(θ) sin(θ)  G = . −sin(θ) cos(θ)

In Givens rotations, the rotation angle θ is chosen such that an element in a vector can be annihilated. The rotations can only annihilate one element at a time and require more flops than Householder reflections, but Givens rotations are better for selectively annihilating elements. The updates required from a rotation are applied only on few rows or columns. They are therefore preferable for selective annihilation in some applications that require low data dependency [8, (p.216)]. √ Example 3.2.1. If x = [1, 3]T then angle θ is chosen such that √  cos(30◦) sin(30◦)   3/2 1/2  = √ −sin(30◦) cos(30◦) 1/2 − 3/2 which transforms vector x into P T x = [2, 0]T .

3.3 Hessenberg reduction algorithm

Table 3.2 shows the Hessenberg reduction, which is a series of similarity transformations applied to a matrix in order to create a Hessenberg matrix. Since the reduction is a similarity transformation, the eigenvalues can not change during the process. Algorithm 1 does this transformation by applying one Householder reflection at a time. The transformations reduce one column of the initial full matrix at a time, from left to right. Figure 3.2 shows an example on a full 4 × 4 matrix A. After two similarity transformations (four matrix multiplications) the matrix is reduced to upper Hessenberg form. Similarity transformations (described in Section 1.3) do not preserve the eigenvectors. While in many applications only the eigenvalues are required, eigenvectors are necessary in some. To recover the eigenvectors, the transformations Q = P1P2 ...Pi ...Pn−2 made on A have to be stored. The transformations can be stored in the lower triangle of A, as described in [13]. After the transformation, most of the lower triangular part of A is zero 12 Chapter 3. Literature Review

(a) Generate reflection vector u1 that reduces A(1 : n, 1). Apply reflection from the left.

(b) Apply reflection from the right.

(c) Generate reflection vector u2 that reduces A(2 : n, 2). Apply reflection from the left.

(d) Apply reflection from the right.

Figure 3.2: Example of the basic Hessenberg reduction algorithm. Gray squares represents the elements in A that are not used in the multiplication.

(the subdiagonal is not zero). The zero values can be considered implicit after the reduction and instead of the implicit zero values, a normalized version vi of ui is stored under the subdiagonal (see Figure 3.3). Vector vi is a normalization of ui such that vi(1) = 1. Because vi(1) = 1 for all i, the first element of v can implicitly be considered as 1 and does not have to be stored. The normalization changes the norm of ui and therefore requires extra storage for a scaling factor τi (one real value per i). vi(2 : n) and τi can construct Pi as   1 T Pi = I − τi [1, vi (2 : n)]. vi(2 : n)

By storing the reflection vectors “in-place”, no extra memory except for τ1, τ2 . . . τn−2 is required to preserve eigenvectors.

3.4 WY Representation

The WY representation of a product of Householder reflections enables the efficient appli- m×m cation of several reflections at once. Let Qk = P1P2 ··· Pk, where Qk ∈ R , be a product of k reflections. Then Qk can be rewritten as

T m×k m×k Qk = I + WkYk ,Wk ∈ R ,Yk ∈ R . 3.4. WY Representation 13

Figure 3.3: Storage of transformations in a 6 × 6 Hessenberg matrix.

The matrices Wk and Yk are defined by the following recurrence: For k = 1, the transfor- mation matrix is T T Q1 = P1 = I + w1u1 = I + WY , where w1 = −2u1. Here, W is chosen as w1 and Y is u1. For k > 1, the Wk and Yk are calculated by multiplying Qk−1 with the k-th Householder reflection Pk. The multiplication T T T T is carried out as Qk = Qk−1Pk = (I + Wk−1Yk−1)(I + wkuk ), where Pk = I + wkuk , uk is the k:th Householder reflection vector, and wk = −2uk. Qk−1Pk can be rewritten as:

T T T Qk−1Pk = (I + Wk−1Yk−1)(I + ukvk ) = I + [Wk−1,Qk−1wk][Yk−1, uk] where the updated matrices Wk = [Wk−1,Qk−1wk] and Yk = [Yk−1, uk] is Wk−1 and Yk−1 appended with the vectors Qk−1wk and uk respectively. Figure 3.4 shows the form of the W and Y matrices.

Figure 3.4: Shape of the W and Y matrices when k = 2 and m = 6.

Aggregated reflections in the form of a WY representation have a larger amount of BLAS Level 3 operations than reflections applied one by one. However, the WY representation requires extra flops compared to non-aggregated transformations, both in the formation of the WY representation and its application. In comparison with the LINPACK QR-reduction (at the time [6] was written, 1985), the WY representation has a higher amount of work by a factor of (1 + 2/N)[6], when applied in a blocked manner with N blocks. By choosing an appropriate value of N, the WY representation can perform better than reflections that are applied one by one. If N is too large, the representation loses its advantage of applying several reflections at once. If N is too small, the increased amount of flops could have significance. Y can be stored in the lower triangular part of A as in the previous section. Y is the only matrix needed in order to yield the eigenvectors because it contains all reflection vectors u. The compact WY representation is a more storage efficient variant of the WY represen- tation. Aggregating the compact WY representation Q = I +YTY T requires less flops than the WY representation and the two require almost the same amount for applying Q [15]. 14 Chapter 3. Literature Review

The compact WY representation Q = I + YTY T is illustrated in Figure 3.5. Y is a trape- zoidal matrix of size m × k and T is a upper triangular matrix of size k × k for k aggregated transformations. The sparse and small structure of Y and T enables the compact storage, where Y T can be derived from Y .

Figure 3.5: Compact WY representation. Black areas represents non-zero elements.

3.5 Blocking and Data Distribution

Blocking is a proven way of increasing the performance of a dense matrix algorithm [7]. The concept of blocking is that a matrix is divided into smaller blocks. If a block is small, it can fit in the fast cache memory, as opposed to the whole matrix. If the blocks are often reused, blocking gives better performance. Blocking can also be used to exploit parallelism in two ways:

1. Operations on distinct blocks may be done in parallel and

2. operations within a block may be parallelized.

One difficulty with blocking is to find a suitable block layout (see Figure 3.6 for examples) and block size. Blocking can also be difficult because of dependencies in the computations. Another disadvantage with blocking is that it introduces a concept that is not related to the original problem, it is not always intuitive and therefore reduces code readability.

(a) Row blocking. (b) Column blocking. (c) 2D blocking.

Figure 3.6: Examples of blocking of an 8 × 8 matrix into blocks B.

Blocking is also used for partitioning data in a DM system. Data can be distributed as in Figure 3.6 where each process store one block which is either blocked along rows, columns, or is square shaped (2D). Another type of distribution is the block cyclic partitioning of data, where the matrix is partitioned into 2D blocks and arranged into a logical two-dimensional mesh in both directions. Figure 3.7 describes the layout on a 32 × 32 matrix, divided into 4 3.6. Block Reduction 15 pieces of 4 × 4 blocks. The block cyclic distribution is preferable for an algorithm that will work on both columns and rows. An advantage with the block cyclic distribution is that if some parts of the matrix requires more work than other, a block row, block column, or basic 2D blocks could give uneven load. Even load is important in parallel computations. If one process has a lot more work to do than other processes of the same speed, the program will be close to sequential and nothing is gained by parallelization. Another advantage with the block cyclic distribution is that data can be divided into very small blocks. This is not possible with row cyclic or column cyclic distribution (matrix is divided into block row or block column and distributed to each process in a round-robin fashion). In row or column cyclic distribution, the minimum size of each block is bounded by the number of rows or columns. The block cyclic distribution, on the other hand, is bounded by the size of an element.

Figure 3.7: A block cyclic distribution. Every block Bij consists of 4×4 elements distributed to a process pij.

3.6 Block Reduction

The block reduction algorithms presented in [9] reduces a matrix to any of the forms listed in Table 3.2 by working on blocks. This is done in a similar way as the WY representation. For the Hessenberg reduction, reflections can be aggregated and applied in a blocked manner, as described in [9]. The aggregation of reflections is outlined as follows: – The Hessenberg form of matrix A ∈ Rn×n is defined as T T T H = Pn−2 ··· P2 P1 AP1P2 ··· Pn−2 T T T – Every update APi = A(I − 2uu ) = A − A(2uu ) = A − 2Au(u A) is a rank-1 update (outer product update) of A. When the update is made from left and right, the update has rank-2.

T T – The rank-2 update of step i+1 can be rewritten as Ai+1 = Ai −2uivi −2wiui where: T yi = Ai ui,

zi = 2Aiui, T vi = yi − (zi ui)ui, and T wi = zi − (yi ui)ui. 16 Chapter 3. Literature Review

T T – The transformations at column k + 1 are aggregated as Ak+1 = A1 − 2UV − 2WU T The important part with this rewriting scheme is that the the update Ak+1 = A − 2UV − 2WU T is a rank-2k update that is rich in BLAS Level 3 operations and can be performed in blocks. U, V, and Y are trapezoidal and due to their sparse structure, they can be stored efficiently. Algorithm 2 shows an outline of the blocked reduction for number of blocks N − 1. For every block k, b aggregated transformations are generated. The transformations are applied as a rank-2b update on the trailing submatrix after column (k − 1) ∗ b in A.

Algorithm 2 Blocked Hessenberg reduction of matrix A. A is overwritten, one column block of size b at a time, with the Hessenberg form of A. N = (n − 2)/b for k = 1, 2,...,N − 1 do s = (k − 1)b + 1 for j = s, s + 1 . . . s + b − 1 do Generate Uj, Vj, and Wj Aggregate Uj, Vj, and Wj in U, V , and W respectively end for // Perform rank-2b update on the trailing submatrix A(1 : n, s + b : n) = (A − 2UV T − 2WU T )(1 : n, s + b : n) end for

The first stage in the two-staged algorithm (see Algorithm 3) uses a similar approach for reducing a matrix to block upper Hessenberg form [11]. The algorithm applies aggregated reflections from left to right. The reflections are generated by a recursive QR factorization and aggregated as compact WY. When the algorithm is finished, matrix A with has been overwritten by a block upper Hessenberg matrix with r diagonals below the main diagonal.

3.7 Parallel blocked algorithm

As described in Chapter 1, the two-staged reduction requires two stages for reducing a matrix to Hessenberg form. Two stages for reduction is preferred in some cases because the blocked updates in the first stage are more memory efficient than for the direct reduction. The first stage reduces the full matrix to block (upper) Hessenberg form. An efficient way of doing this is by dividing a matrix A into four blocks:  A A  A = 11 12 , A21 A22

n×n b×b where A ∈ R and A11 ∈ R . Reduction of block A21 to upper triangular form R1 = T Q˜1 A21, computed with the QR factorization, can be done without using any of the elements in the other blocks. When A21 has been reduced, all blocks are updated as  A A Q˜   I 0  QT AQ = 11 12 1 ,Q = b , 1 1 ˜T ˜ 1 ˜ R1 Q1 A22Q1 0 Q1 where Ib is the of size b. This procedure is repeated from left to right for N − 1 iterations k = 1,...,N − 1, by reducing submatrix A(n − kb : n, b : n) in every 3.7. Parallel blocked algorithm 17

Algorithm 3 First stage of the two-staged algorithm. Outer block size is given by b and the resulting block upper Hessenberg matrix is given a lower bandwidth of r.

for j1 = 1 : b : n − r − 1 do ˆ b = min(b, n − r − j1) ˆ ˆ ˆ ˆ Y ∈ Rn×b,V ∈ Rn×b,T ∈ Rb×b ˆ for j2 = j1 : r : j1 + b − 1 do ˆ rˆ = min(r, j1 + b − 1) i4 = j2 + r : n i5 = 1 : j2 − j1 i6 = j2 − j1 + 1 : j2 +r ˆ − j1 i7 = j2 : j2 +r ˆ − 1 T A(i2, i7) = A(i2, i7) − Y (i2, i5)V (i7, i5) T T A(i2, i7) = (I − V (i2, i5)T (i5, i5)V (i2, i5) ) A(i2, i7) T QR-factorize block as A(i4, i7) = (I − Vˆ TˆVˆ )R Aggregate reflections V (i4, i6) = Vˆ T (i6, i6) = Tˆ T T (i5, i6) = V (i4, i5) V (i4, i6)T (i6, i6) Y (i2, i6) = A(i2, i4)V (i4, i6)T (i6, i6) − Y (i2, i5)T (i5, i6) T (i5, i6) = −T (i5, i5)T (i5, i6) end for Y (i1, :) = A(i1, i2)V (i2, :)T T Apply compact WY transformations A(i1, i2) = A(i1, i2) − Y (i1, :)V (i2, :) T A(i2, i3) = A(i2, i3) − Y (i2, :)V (i3, :) T T A(i2, i3) = (I − V (i2, :)TV (i2, :) ) A(i2, i3) end for iteration. When N − 1 subdiagonal blocks have been reduced, a block upper Hessenberg matrix develops:   H11 H12 ······ H1N  H21 H22 ······ H2N    T  ..  H = Q AQ =  0 H32 . ··· H3N  .    ......   . . . . .  0 0 ··· HN,N−1 HNN

Each transformation Qi is represented as WY for efficient application. The algorithm de- scribed above for block upper Hessenberg form can be parallelized and this has been done for a SM architecture in [11]. Algorithm 3 is parallelized such that the operations are divided into coarse grained tasks. The tasks are scheduled for a parallel SM system using threads, where each thread calls a sequential implementation of BLAS operations. In [4], another parallel algorithm for reducing a full matrix to block upper Hessenberg is described. It is an adaptation of the block Hessenberg reduction for a DM architecture. The algorithm performs operations on both columns and rows and therefore uses a block cyclic partitioning of data. Each block column k = 1, 2 ...N − 1 of size b for N − 1 block columns is reduced to block upper Hessenberg form and the trailing submatrix beginning at column kb is updated. After N − 1 iterations, the full matrix is reduced to block upper Hessenberg form. In each iteration k, two phases are executed: A QR phase and a Givens phase. In the QR phase, all blocks below the diagonal on block column k is reduced to 18 Chapter 3. Literature Review upper triangular form. In the Givens phase, all blocks below the diagonal on block column k, except the block closest to the diagonal, are annihilated with Givens rotations.

3.7.1 QR Phase

In the QR phase of the algorithm, every block Bik (belonging to process Pij) below the sub- diagonal on the column that should be reduced (iteration k), is reduced to upper triangular form. This is carried out individually (without communication) using a QR reduction. The QR reduction generates transformation matrices Qik. Figure 3.8(a) shows the form of the process blocks after this reduction. When the processes below the subdiagonal on column k have finished their reduction, the rest of the matrix has to be updated, as seen in the previous section. The update is carried out by broadcasting each Qik to all other blocks, which they can apply individually. Process Pij broadcasts Qij in an efficient way by sending to all processes on the same block row and column. Figure 3.8(b) shows the path of the messages sent in this phase. The matrix is in block upper Hessenberg form, except for the upper triangular blocks below the subdiagonal on block column k (and all block columns > k). The upper triangular blocks below the subdiagonal have to be eliminated in order to complete the iteration.

(a) Individual reduction to upper trian- (b) Message paths when broadcasting gular form in block column k. Qik.

Figure 3.8: The QR-phase of the parallel block Hessenberg reduction algorithm.

3.7.2 Givens Phase As seen in Section 3.2, Givens rotations are suited for applications where elements should be selectively annihilated with low data dependency and this is why Givens rotations are used in the second phase of the algorithm. The Givens phase consists of two steps, local annihilation and global annihilation. In the local annihilation step, each processor block below the subdiagonal on block column k reduces its local blocks without communication. The outline of this step is:

1. Each process Pij picks its upper triangular block Bd, that is closest to the diagonal.

2. For every other block owned by Pij, reduce that block with Givens rotations, using the Bd block as a pivot block. 3.8. Successive Band Reduction 19

The pivot block is used as a reference for eliminating the other local blocks. Figure 3.9(a) shows an outline of the local annihilation step. In Figure 3.9(a), the example is too small to show the general case, but for a larger matrix the concept is the same. When a process has computed a transformation (Givens rotation), it broadcasts the transformation matrices to all block columns and rows. The broadcast and application of each transformation is executed in the same way as in the QR phase. The result from the local annihilation step is illustrated in Figure 3.9(b).

(a) Local annihilation using the blocks (b) Result after local annihilation. closest the diagonal as pivot blocks. This example is extended with gray blocks in order to illustrate the general idea.

Figure 3.9: Local annihilation step of the Givens phase.

The global annihilation step is acquired by reducing the remaining upper triangular blocks by using the subdiagonal block as a pivot block. This is executed in log2(np) steps, where np is the number of processes involved in the reduction. Figure 3.10(a) and 3.10(b) illustrates the procedure of reducing the blocks below the subdiagonal in block column k. The result of the second step is illustrated in Figure 3.10(c).

3.8 Successive Band Reduction

In the previous section, a full matrix was reduced to block upper Hessenberg form. The analog of block Hessenberg form for symmetric matrices is banded form (or block tridiagonal form). The resulting symmetric banded matrix has to be further reduced to true tridiagonal form in order to be used in eigenvalue computations. If the bandwidth (the number of non- zero diagonals) is low compared to the full matrix size, there are more efficient approaches than the previously described for full reduction to Hessenberg form. The Successive Band Reduction Toolbox (SBR) [5] is an implementation of a framework for algorithms that “peel off” diagonals from a symmetric banded matrix. These algorithms have the same structure: 20 Chapter 3. Literature Review

(a) Global annihilation (b) Global annihilation (c) Result after global step, iteration one. step, iteration two. annihilation.

Figure 3.10: Global annihilation step of the Givens phase.

– Annihilate one or several elements close to the diagonal.

– Bulge chasing in order to restore the banded form.

When one or several elements are annihilated, the rest of the matrix has to be updated with the transformation. This introduces a bulge, a sort of a “lump” on the banded matrix. If the introduced bulge is not taken into account and the algorithm continues to annihilate elements, the bandwidth will rapidly increase. If the bandwidth is increased too much, the algorithm will be as inefficient as a full reduction algorithm (see in Algorithm 1). Between each annihilation in the reduction, a bulge chasing step is executed. Bulge chasing has the purpose of chasing off bulges, down the diagonal, in order to prepare for the next iteration. When a bulge has been chased off, the next iteration can begin without further increase of bandwidth. Figure 3.11 explains the concept of bulge chasing. If all lower diagonals except the subdiagonal are eliminated, tridiagonal form is achieved. This direct reduction only requires 6bn2 flops [5] for a symmetric matrix A ∈ Rn×n with the half bandwidth b. If b  n, this is a huge improvement from Algorithm 1, which also can be used for tridiagonal reduction. If matrix A is a full symmetric matrix, the reduction to banded form as in previous algorithms contributes with additional work. This additional work reduces the performance gap between Algorithm 1 and the reduction from full symmetric matrix to banded form followed by a band reduction. An iteration, with annihilation of elements and

(a) Banded (b) First col- (c) This intro- (d) If the first (e) The bulge matrix with umn is reduced duces a bulge column of the moves down the semi bandwidth and the rest of that has to be bulge is reduced, diagonal. 3 (non-zeros the matrix is chased off the di- the next itera- below the main updated. agonal before the tion can begin. diagonal). second column can be reduced.

Figure 3.11: Bulge chasing. the bulge chasing that follows, is called a sweep. What can be varied in the process of the 3.9. Two-staged algorithm 21

SBR is the number of elements d that is going to be eliminated per iteration. By changing d, the SBR can have three optimal implementations with

1. minimum algorithmic complexity (fewest flops),

2. minimum algorithmic complexity subject to limited storage or

3. better support for using Level 3 BLAS kernels (sub-programs).

The last implementation type requires an idea used in previous sections. By aggregating several transformations in the annihilation step, the updates can be represented as WY and involve more Level 3 BLAS operations. Figure 3.12 shows how the bulge chasing works with aggregated updates. Aggregated updates can be executed in parallel and give high performance. In the band reduction, the reflection vectors can not be stored in the lower part of the matrix, as in previous methods. The large amount of reflections required in the bulge chasing does not fit in the zeroed part of the matrix.

(a) The lower diago- (b) The first columns (c) The first q columns (d) Result after one re- nals. q = 3 are reduced with of the bulge is reduced duced bulge. a QR reduction step. A with a QR reduction. bulge is introduced.

Figure 3.12: Bulge chasing with aggregated transformations. This shows the lower diagonals because they are the only diagonals that needs to be updated for a symmetric matrix. Updates are identical on the upper diagonals and do not have to be shown.

3.9 Two-staged algorithm

The two-staged algorithm, as proposed in [11], is a parallel algorithm for a SM architecture that reduces a full matrix A to Hessenberg form in two stages. The first stage is to reduce A to block upper Hessenberg form with r subdiagonals. The second stage is to reduce the block upper Hessenberg matrix to true upper Hessenberg form by trimming the lower bandwidth down to r = 1 (see Figure 1.8).

3.9.1 Stage 1 The first stage reduces a full matrix to block upper Hessenberg form, as described in Algo- rithm 3 and parallelized for a SM architecture in [11] and Section 3.7. The implementation 10 3 of the blocked algorithm in [11] requires 3 n flops. 80% of the flops are BLAS level 3 oper- ations (matrix-matrix multiplications) and the remaining 20% are BLAS level 2 operations (matrix-vector multiplications). 22 Chapter 3. Literature Review

3.9.2 Stage 2 The second stage is highly related to the Successive Band Reduction as described in Sec- tion 3.8. By adapting SBR to the unsymmetric case (Hessenberg reduction), the lower bandwidth r can be reduced. The basic reduction from blocked Hessenberg to true Hes- senberg form is shown in Algorithm 4 (described and improved in [11] and [12]). In this algorithm, one column at a time is reduced. When one column is reduced, the reflections are applied from left and right. This introduces a bulge that is chased off the diagonal in the same way as the column reduction was made.

Algorithm 4 Unblocked reduction from blocked Hessenberg matrix A with lower band- width r. Notice that this algorithm uses zero-based indexation.

for j1 = 1 : n − 2 do  n−j−2  k1 = 1 + r for k = 0 : k1 − 1 do l = j + max(0, (k − 1)r + 1) i1 = j + kr + 1 : min(j + (k + 1)r, n) i2 = l : n i3 = 1 : min(j + (k + 2)r, n) j Reduce column A(i1, l) with householder reflection Qk j T Apply reflection from the left A(i1, i2) = (Qk) A(i1, i2) j Apply reflection from the right A(i3, i1) = A(i3, i1)Qk end for end for

(a) A block up- (b) The first (c) Reflection (d) Reflection (e) A bulge is per Hessenberg columns is is applied from is applied from introduced. matrix. reduced with the left. the right. a Householder reflection.

Figure 3.13: One iteration of the inner loop of Algorithm 4. As seen in the last figure, a bulge is introduced. The inner for-loop of Algorithm 4 both reduces a column and chases the introduced bulge down the diagonal.

An example execution of Algorithm 4 is illustrated in Figure 3.13. This algorithm has the same disadvantages as Algorithm 1, it has a low amount of operations compared to data. In order to improve this, the bulge-sweep process is reordered and the reduction is divided into two steps for each iteration as described in [12].

Step 1, Generate The first step is to generate transformations from several sweeps. This procedure is similar to Algorithm 4 except that it only reads and updates values close to the diagonal. Trans- formations for q consecutive sweeps are generated along the diagonal. Before a column is 3.9. Two-staged algorithm 23 reduced, it must be brought up-to-date by applying reflections from previous sweeps. Algo- rithm 5 gives a detailed description of the generate step. Because of data dependency and fine grained tasks, the generate step is not suitable for parallelization.

j Algorithm 5 Generates reflections Q∗ for q columns and updates matrix A close to the diagonal. Notice that this algorithm uses zero-based indexation.

for j = j1 : j1 + q − 1 do  n−j−1  k1 = 2 + r for k = 0 : k1 − 1 do α1 = j + kr + 1 α2 = min(α1 + r − 1, n) β = j + max(0, (k − 1)r + 1) for ˆj = j1 : j − 1 do αˆ1 = ˆj + kr + 1 αˆ2 = min(α ˆ1 + r − 1, n) if αˆ2 − αˆ1 + 1 ≥ 2 then ˆj T Bring A(α ˆ1 :α ˆ2, b) up-to-date by A(α ˆ1 :α ˆ2, b) = (Qk) A(α ˆ1 :α ˆ2, b) end if end for if α2 − α1 + 1 ≥ 2 then j Reduce A(α1 : alpha2, β) with reflection Qk j T Update column A(α1 : alpha2, β) = (Qk) A(α1 : alpha2, β) γ1 = j1 + 1 + max(0, (k + j − j1 − q + 2)r) γ2 = min(j + (j + (k + 2)r, n) j Introduce bulge A(γ1 : γ2, α1 : α2) = A(γ1 : γ2, α1 : α2)Qk end if end for end for

Step 2, Update

When Algorithm 5 has completed, the rest of the matrix blocks can be updated with the j reflections Q∗. Using threads, the updates are made in parallel on a shared memory ar- chitecture. Algorithm 6 describes how the updates are applied. Each process has its own range of rows (r1 : r2) and columns (c1 : c2) and can execute their updates on these ranges individually. Between variant R and L, the processes have to synchronize in order to avoid conflicts. In [12], the two steps are further optimized by dividing variant R into PR, UR and L into PL, UL (P and U stands for Prepare and Update). By dividing the updates in this manner, the PR and PL will update values close to the diagonal, which are required for the next generate-step. This allows the sequential computations involved in the generate step to be performed on one process at the same time as the other processes apply the UR and UL variants. Another improvement made in [12] is that the processes are given the row and col- umn range that matches their performance. This range is calculated between each iteration in order to minimize idle time at the processes. Range r1 : r2 and c1 : c2 depends on how fast the processes executed their updates in the previous iterations. Using all these optimization techniques, the Generate and Update-algorithm runs 13 times faster than Algorithm 4 in the test environment used in [12](8 cores, r = 12, q = 16 and A ∈ Rn×n, n ≈ 2200). 24 Chapter 3. Literature Review

Algorithm 6 Updates the remaining part of matrix A with reflections generated in Algo- rithm 5. The updates come in different variants (R and L). The variant specifies if the reflection should be applied from the right or left to the block.

 n−j1−2  k1 = 1 + r for k = k1 − 1 : −1 : 0 do for j = j1 : j1 + q − 1 do α1 = j + kr + 1 α2 = min(α1 + r − 1, n) if α2 − α1 ≥ 2 then if variant is R then γ1 = max(r1, 1) γ2 = max(r2, j1 + max(0, (k + j − j1 − q + 2)r)) j A(γ1 : γ2, α1 : α2) = A(γ1 : γ2, α1 : α2)Qk end if if variant is L then β1 = max(c1, j1 + q, j1 + q + q(k − 1)r + 1) β2 = min(c2, n) j T A(α1 : α2, β1 : β2) = (Qk) A(α1 : α2, β1 : β2) end if end if end for end for

Figure 3.14: Broadcast operation.

3.10 MPI Functions

For communication in the distributed algorithms, three operations are used: Broadcast, scatter, gather and all-to-all. Broadcast is an operation where one process sends a message to all other processes (also called one-to-all broadcast)[1, (p.167–186)]. Figure 3.14 illustrates the broadcast operation. Scatter, also called one-to-all personalized communication [1](p.167–186), is an operation where one process sends message Mi to process pi (see Figure 3.15). A customization of the MPI scatterv function is used in the distributed algorithms. The MPI scatterv function is used for sending messages from one process to all other processes, with different size and displacement for each message (see Figure 3.16 for comparison). This allows the root process to send a variation of messages1. In the implemented scatter, root process sends one message Mi to process pi. Process pi receives Mi and root process can continue to send message Mi+1 to process pi+1. When all processes p = 1, ··· , np −1 has received a message, the scatter of data is finished. Gather is the dual to the scatter operation. In this operation, each process pi sends a

1A limitation of the MPI scatterv function is that it can only scatter data using one data type. Because of this, a customized MPI scatterv function has been implemented, which can send different sizes of blocks to each process. 3.11. Storage 25

Figure 3.15: Scatter and gather operations.

personalized message Mi to the root process. When the operation is finished, root holds all messages M0,M1 ...Mnp−1.

(a) Scatter of memory from root process. Ev- (b) scatterv. Blocks of different sizes ery process gets an equal amount of data. are scattered, starting at different positions. Gray memory is not sent.

Figure 3.16: Comparison between scatter and MPI scatterv.

In the all-to-all operation (also called all-to-all personalized communication) all pro- cesses sends a distinct message to every other process (see Figure 3.17). This operation is implemented as MPI Alltoall in MPI. MPI Alltoall has a related MPI Alltoallv-function that can work with different sized messages. MPI Alltoallv can not send messages with different data types. A custom all-to-all function has been implemented that can send mes- sages with different data types2. Many functions in MPI have a non-blocking counterpart, which does not wait for communication to finish. Non-blocking functions can be used to overlap communication with computation and hide communication overhead. In the custom implementation of all-to-all, each process pi sends its messages Mi,k to the corresponding receivers k = 0, 1, . . . , np − 1. The sending is done by using a non-blocking send, so that the next send can begin. This is also the case for the receiving side: Receiving is done non-blocking. When all non-blocking sends and receives have been initiated, all processes waits for their receives and then their sends.

3.11 Storage

For storage in the distributed algorithms, four data types are used: Full matrix, block row, block column, and banded matrix. The block row and block column data types are generated at all processes using MPI create subarray. MPI create subarray is a function for creating sub-blocks in a larger matrix. The blocking is done in rows and columns, similarly to the

2The custom all-to-all function behaves as the MPI Alltoallw (which was unknown for the implemen- tor at the time). No performance comparison has been done between the custom implementation and MPI Alltoallw. This is a topic for further studies. 26 Chapter 3. Literature Review

Figure 3.17: The all-to-all operation. In total np × np messages are exchanged between the np processes.

row and column blocking described in Section 3.5. When used for Hessenberg reduction, j1 reduced columns can be excluded in the distribution (they are already reduced). Figure 3.18 shows an example on how the rows and columns could be blocked.

(a) Row blocking with (b) Column blocking j1 = 5. with j1 = 5.

Figure 3.18: Row and column blocking with j1 reduced columns and 4 processors.

The generate step (see Algorithm 5) only operates around the diagonal of A and the root process only needs some diagonals of A. When only some diagonals are used, banded matrix[2] is a storage efficient data type. When stored as banded matrix, only kl subdiago- nals, ku superdiagonals and the main diagonal have to be stored. Figure 3.19 illustrates this storage. In order to access an element in the banded matrix with a position based on the full matrix, the following conversion is made: An element on position (i, j) in the original matrix A, is located at position (ku + 1 + i − j, j) in the banded matrix.

Figure 3.19: Banded matrix storage. ku = 3 superdiagonals and kl = 3 subdiagonals from the left full matrix is stored in the banded matrix to the right. Chapter 4

Results

The second stage of the two-staged algorithm (see Section 3.9) has been adapted for a distributed memory architecture. The result is two implementations that uses MPI for communication.

4.1 Root-Scatter algorithm

Our Root-Scatter algorithm is a basic adaptation of the algorithm described in Section 3.9. This algorithm uses the same algorithms as in the SM implementation[12], combined with message passing communication. The algorithm works similarly to the SM implementation, described in Section 3.9.2. One difference is that before each update, the matrix A(:, j1 : n) is scattered in blocks from the root process, which always holds the whole matrix A. For the R update (see Algorithm 6), A(:, j1 : n) is row blocked and for the L update, A(:, j1 : n) is column blocked. Each process pi updates its block Ai and the root process gathers the results. Algorithm 7 describes the procedure where q columns are reduced in each iteration, from left to right. A problem with this approach is that the root process is storing the whole matrix consisting of n2 elements. For a large matrix, the root process could run out of memory. Another problem is that scatter → update → gather → scatter → update → gather makes the root process a bottleneck (see Figure 4.1). All communication has to go through the root process. For many processes, this solution should not scale well, but the algorithm is implemented for comparison with other solutions. All blocks have to be gathered to the root process before n the next update can occur. One row block is of size (n − j1 + 1) × and one column block np (n−j1+1) is of size × n (they have the same size). In each update, root process sends np − 1 np blocks and gathers the same amount of blocks. In one iteration, sending blocks of matrix A has a total communication volume of n 4(np − 1)(n − j1 + 1) np with the additional latency involved with 4(np − 1) message transfers. The complexity is of O(np) which can be improved to logarithmic complexity by using existing collective communication algorithms. This has not been done for the Root-Scatter algorithm, because of a limited amount of time.

27 28 Chapter 4. Results

Figure 4.1: Communication of A in the Root-Scatter algorithm.

n×n n Algorithm 7 Root-Scatter. Matrix A ∈ R is held at root and is updated in q iterations, where q is the number of consequtive sweeps per iteration.

for j1 = 1 : q : n do if root process then Run Algorithm 5 with A and j1 in order to generate transformations Q end if Broadcast Q Scatter A(:, j1n) as row blocks Ai Update block Ai with Algorithm 6 (variant=R) Gather blocks to A(:, j1 : n) Scatter A(:, j1 : n) as column blocks Ai Update block Ai with Algorithm 6 (variant=L) Gather blocks to A(:, j1 : n) end for 4.2. DistBlocks algorithm 29

4.2 DistBlocks algorithm

Our DistBlocks algorithm is an improvement of the Root-Scatter algorithm. By using a more distributed communication model and partitioning, scalability issues can be avoided. As the Root-Scatter algorithm, the DistBlocks algorithm uses algorithms from the SM implementation. The key differences here is the communication between the updates, and the storage of A at the root process. First off, (q ∗ r − 1) superdiagonals, (2 ∗ r − 1) subdiagonals, and the main diagonal of matrix A is copied to a banded matrix banded as seen in Figure 3.19. The root process will need (q ∗ r − 1) superdiagonals and (2 ∗ r − 1) subdiagonals for executing the generate step. Root process scatters matrix A in equal sized row blocks. When the generate step is finished, root process broadcasts transformations Q and scatters the banded matrix to the processes that have diagonal elements. All processes runs the R variant of Algorithm 6 on their row range r1 : r2. The transition from row blocks to column blocks is done through the all-to-all-operation. Each process can now update its columns by running the L variant of Algorithm 6 on column range c1 : c2. A transition back to row blocks is made. Before the next iteration can begin, the root process gathers the updated entries near the diagonal, to the banded matrix. Root process holds the full matrix A in the beginning, but uses it to store the result for q columns per iterations. A description can be found in Algorithm 8, where q columns is reduced for each iteration. The transition from row blocks to column blocks is illustrated in Figure 4.2. The scatter and gather of the banded matrix is implemented using the MPI Type indexed. MPI Type indexed can create complex data types, as the banded form. Scatter of the banded matrix is illustrated in Figure 4.3. The transition from row block to column block requires every process to send n blocks with a total of (n − j1 + 1) elements. Per iteration, the transition executes twice, np which leads to a total communication cost of n 2(n − j1 + 1) . np This is an improvement from the previous algorithm. The communication volume for the all-to-all operation will be reduced for a higher amount of processes, which is not the case for communication in the Root-Scatter algorithm.

Figure 4.2: Transition from block rows to block columns, after j1 = 5 columns have been reduced. The reverse procedure can be used to go from block columns to block rows.

4.3 Performance

In order to test the performance of the two distributed algorithms, tests are made on the Abisko high performance computing (HPC) cluster at High Performance Computing Center North (HPC2N). Abisko is a cluster of 322 nodes, each comprising 4 AMD Opteron 12 core 30 Chapter 4. Results

Algorithm 8 DistBlocks. Root process (rank = 0) uses a banded matrix banded in order to do the generate step. When the L-updates have been made, the root process holds at least q columns. This allows the root process to store the result from each iteration in A, without communication. if root process then Copy bands from A to banded matrix banded end if Scatter A as row blocks Ai for j1 = 1 : q : n do if root process then Run Algorithm 5 with banded and j1 in order to generate transformations Q end if Broadcast Q Scatter banded(:, j1 : n) as to row blocks Ai Update block Ai with Algorithm 6 (variant=R) Transit to column blocks with all-to-all Update block Ai with Algorithm 6 (variant=L) if root process then Copy the first q columns of A0 to matrix A(:, 1 : j1 + q) end if Transit to row blocks with all-to-all Gather diagonals from blocks to banded end for

Figure 4.3: Scatter of updated diagonals, from the root process to all processes. The reverse can be used as gather.

2.6 GHz processors. Each core has a theoretical peak performance (TPP) of 10.4 Gflop/s. Two ranges of tests are designed in order to answer the questions stated in Section 1.5: 1. Test the performance of the Root-Scatter algorithm and the DistBlocks algorithm compared to the unblocked band reduction algorithm. Tests are made on sizes n = {800 : 200 : 6000} with q = 16, r = 12 and on 8 processes. The tests are similar to the one done in [12] for testing the SM implementation. They are used to compare the SM implementation with the DM implementation. The result from these tests will be used to answer question 2. 2. The DM implementation can run on a large amount of cores. In order to test its efficiency, a test of the scaled speedup is executed. Speedup is a measure for parallel T1 scalability. It is defined as S = , where T1 is the execution time for the sequential Tp algorithm (in this case, Algorithm 4) and Tp is the time for the parallel algorithm, executed on p processes [1, (p.130)]. Scaled speedup (S0) is a measurement where the problem size increases with the number of processes in the test. In this case, 4.3. Performance 31

the problem size W is defined as Cn3 and as the number of processes are increasing with√ √ a factor√ of 1, 2, 3 ... 8, the order n of matrix A is increasing with a factor of 3 1, 3 2,..., 3 8. Scaled speedup is used in order to analyze the efficiency of a parallel program. Ideally, the scaled speedup should be linear in relation to the problem size W . Table 4.1 present the parameters used in the scaled speedup test. The scaled matrix order is reduced to the nearest multiple of the number of processes, np, due to a limitation in the implementation. All tests will run with q = 16, r = 12. The result from these tests will be used to answer question 1 and 3.

Problem size W Matrix order n Processors np 1 2000√ 8 3 2 2000 ∗ √2 ≈ 2526 8 ∗ 2 = 16 3 3 2000 ∗ √3 ≈ 2880 8 ∗ 3 = 24 3 4 2000 ∗ √4 ≈ 3168 8 ∗ 4 = 32 3 5 2000 ∗ √5 ≈ 3440 8 ∗ 5 = 40 3 6 2000 ∗ √6 ≈ 3648 8 ∗ 6 = 48 3 7 2000 ∗ √7 ≈ 3808 8 ∗ 7 = 56 8 2000 ∗ 3 8 ≈ 4032 8 ∗ 8 = 64

Table 4.1: Tests created for scaled speedup.

Besides this, tests will be made on 4 processors and n = 4000 (q = 16, r = 12) in order to make an introspective analysis of some iterations. The introspective analysis is made by time stamping certain key passages in the code. This will produce trace data that will be analyzed in order to answer question 4. Implementations have been made in C++ using the OpenMPI implementation of MPI [14]. Code is compiled with GCC version 4.4.3. Each test is being executed three times. Because of startup time of the program (setting up MPI environment), the first execution is disregarded and the result is an average of the two last runs. Figure 4.4 shows the perfor- mance of the two distributed implementations of the algorithm described in Section 3.9.2. The algorithms have approximately a flop count of 2n3, which is the factor used to calculate Tp the performance , where Tp is the parallel execution time and Wn is the amount of flops Wn required for a problem of size n. Ideally, this test would reach the TPP of 83.8 Gflop/s (8 cores). In Figure 4.4, the highest performance 2.5Gflop/s is reached at n = 6000, which is 3% of the TPP. For similar tests of the SM implementation in [12], another cluster (2.5 GHz processors, TPP=80 Gflop/s) reaches approximately 12.5% of its theoretical peak for sizes n > 3000. The performance declines for the unblocked algorithm for n > 1000 and the reason for this is that for large problem, larger parts has to be used in the computations. If the parts do not fit in fast cache memory, it is mostly stored in slower caches. The unblocked algorithm scales worse on computers with caches, than cache efficient algorithms. Figure 4.5 shows the result from the scaled speedup test. In Figure 4.5, the distributed algorithms outperform the sequential unblocked algorithm at small problem sizes and with few processes. The reason for this is that the distributed algorithms apply their updates in a more cache efficient way than the unblocked one. For larger problem sizes, the distributed algorithms performs poorly, even though more processes are doing the computation. For large problem sizes, the DistBlocks algorithm performs better than the Root-Scatter al- gorithm. The DistBlocks algorithm has a more efficient communication model than the Root-Scatter algorithm, this could be the reason for the performance difference. 32 Chapter 4. Results

Figure 4.6(a) and 4.6(b) shows trace for the two distributed algorithms for n = 4000. What they demonstrate is that a substantial amount of time is spent on waiting for other processes. The three main factors of the poor performance are: 1. Bad load balancing. The processes have a very different load in each iteration. The equally partitioned columns do not give an equal amount of work. 2. Sequential generate step. The sequential generate step requires much time. During this time, other processes are waiting for process 0 to finish. 3. Communication overhead. Figure 4.6(b) illustrates that much more (about six times) time is spent on communication than on the generate step.

Three theoretical scenarios will be made in order to see how much the three factors above affects the performance. The three scenarios are:

1. Perfect load balancing (time tload). This will be be simulated by summing the time spent in updates and evenly divide the time with the number of processes.

2. Hidden generate step (time tgen). This will be simulated by taking the maximum of tgentot and tuptot/(np − 1), where tgentot is the total time spent in the generate step and tuptot is the total time for the updates. This simulates that the root process can do the generate step at the same time as other processes do the updates.

3. No communication overhead (time tcomm). This will be simulated by subtracting all time spent on communication. Table 4.2 shows the result of combinations of these simulated implementations, when running on a problem of order n = 4000. Results show that no communication does the best improvement, but this is a very unrealistic case. The results also show how badly the load is distributed on the processes. To see how much the sequential generate step is slowing down the algorithm, a calculation has been made for a theoretically perfect load balanced case. The total time spent in the update steps is summarized into tR, tL, for all processes. tR and tL are then divided with the number of processes, which will simulate perfect load balancing, as in the previous test. tR, tL values are compared with tG, the total amount of time spent in the generate step. Table 4.3 shows calculations for both implementations. The last column is the fraction of time that is spent in the generate step (when conditions are ideal). The two implementations spends 15% respectively 26% of the time in the generate step, even though the executions are theoretically perfectly load balanced and have no communication overhead. The DistBlocks algorithm performs worse than the Root-Scatter algorithm. The difference is not large for the updates, but in the generation of transformations, the gap is larger. This could be an effect of the banded form used in the DistBlocks algorithm. In order to use a banded matrix, every indexation is converted with (ku + 1 + i − j, j). This calculation could slow down the DistBlocks algorithm. Kernels used in the algorithms are not optimized for banded matrices. 4.3. Performance 33

Figure 4.4: Performance of the two distributed memory algorithms compared with the unblocked sequential.

Figure 4.5: Scaled speedup S0 for normalized problem sizes W = 1 : 8. 34 Chapter 4. Results

(a) Trace of the Root-Scatter implementation.

(b) Trace of the DistBlocks implementation.

Figure 4.6: Trace of the two distributed algorithms, running on 4 processes with n = 4000, q = 16, r = 12. Iteration 1 to 4 of a total of 250 iterations is illustrated. The time for broadcasting transformations Q is too short to be visible in the figures.

Root-Scatter DistBlocks Actual runtime [s] 82.93 86.87 tload [s] 49.47 50.61 tload and tgen [s] 44.88 39.75 tload and tcomm [s] 18 26.39 tgen [s] 78.34 76.01 tgen and tcomm [s] 46.86 51.79 tcomm [s] 51.45 62.65 tload, tgen and tcomm [s] 13.4 15.52

Table 4.2: Combinations of different types of ideal scenarios, running on 4 processes with n = 4000, q = 16, r = 12.

tR tL tG Implementation tG [s] p [s] p [s] tR tL tG+ p + p Root-Scatter 4.59 13.75 12.55 0.148 DistBlocks 10.86 16.75 14.98 0.255

Table 4.3: Sum of time spent in each computation step (ideally), running on 4 processes with n = 4000, q = 16, r = 12. Chapter 5

Conclusions

Recall the questions stated in Section 1.5:

1. Is it possible to adapt or redesign the second stage of the two-stage Hessenberg reduc- tion algorithm for a DM while preserving efficiency?

2. How does the DM implementation compare with the existing SM implementation?

3. How scalable is the DM implementation?

4. Which are the main factors that ultimately limit the performance and scalability of the DM implementation?

The results in Chapter 4 are used to answer these questions. For question 1, the results in this Thesis are insufficient in order to answer it. Yes, two adaptations have been made, but the efficiency of these are too low. It is a good chance that there does exists such an adaptation, but none the adaptations done in this Thesis are one of those. Compared with the SM implementation, the DM algorithms does not perform well. This answers question 2. For answering question 3, results have shown that the scalability of the DM adaptations was poor and a scaled speedup of 3 on a 8 times larger problem, is not considered scalable. The main factors that limited the performance in the DM adaptations is not communication, but the uneven load and sequential generate step. If adaptive workload and an overlapping generate step is implemented, as in the SM adaptation, the performance can be improved. This answers question 4. An attempt of implementing adaptive workload was made in this project. Because of insufficient time, it could not be done. A solution with better performance and scalability has not been realized, but much knowledge has been gained in the process. One important experience is that the DM adap- tations can be made faster, but there were not enough time for this. A DM adaptation of the two-staged algorithm should not be disregarded. For large scale problems, development of an efficient DM adaptation is worth it.

5.1 Future work

There are many areas to further explore in the DM adaptation of a Hessenberg reduction. Future work related to this Thesis could be (ordered by importance):

35 36 Chapter 5. Conclusions

1. Develop a solution for a DM that applies the improvements specified for the SM adaptation in [12]. The result from this work shows tendencies that the techniques used in [12] could be suitable for the DM case. 2. How can the communication be hidden? Many of the communication operations could be done non-blocking so that some computations could start earlier. The communica- tion functions used is the distributed algorithms have poor scalability and should be replaced by modern implementations with logarithmic complexity. 3. The eigenvectors were not preserved in the DM adaptations by storing the transfor- mations. Could this be done in a efficient way on a DM?

4. Can an adaptation be made for a heterogeneous architecture? Modern graphical processing units (GPUs) are very good at performing highly parallel tasks. Which tasks are candidates for being performed on a GPU?

5. The matrices considered in this Thesis are all real (Rn×n). A further investigation would be to see if there are differences in the algorithms if the matrix is complex (Cn×n). Does calculations on complex values require a considerably higher amount of flops than in the real case? Chapter 6

Acknowledgements

I would like to express my sincere gratitude to Dr. Lars Karlsson for supervision and support. Without Dr. Karlsson’s guidance, this Thesis would never have been completed. I would also like to thank Prof. Bo K˚agstr¨omfor his help with arranging this project. He has made it possible for me to do my work from abroad. Dr. Pedher Johansson has spent time on helping me with practical issues and should therefore be thanked. Last but not least, I would like to thank my family and my girlfriend Jennie for help and motivational support.

37 38 Chapter 6. Acknowledgements References

[1] A. Gupta G. Karypis V. Kumar A. Grama. Introduction to Parallel Computing, 2nd Edition. Pearson Education Limited, Harlow, 2003.

[2] Netlib Repository at UTK and ORNL. Band storage. http://www.netlib.org/ /lug/node124.html, 1999. Accessed: 20130521.

[3] Netlib Repository at UTK and ORNL. The blas as the key to portability. http: //www.netlib.org/lapack/lug/node65.html, 1999. Accessed: 20130521.

[4] Michael W. Berry, Jack J. Dongarra, and Youngbae Kim. A parallel algorithm for the reduction of a nonsymmetric matrix to block upper-hessenberg form. Parallel Comput., 21(8):1189–1211, August 1995.

[5] Christian H. Bischof, Bruno Lang, and Xiaobai Sun. A framework for symmetric band reduction. ACM Trans. Math. Softw., 26(4):581–601, December 2000.

[6] Christian H. Bischof and Charles Van Loan. The WY representation for products of householder matrices. In Parallel Processing for Scientific Computing, pages 2–13, 1985.

[7] Jack Dongarra and Robert Schreiber. Automatic blocking of nested loops. Technical report, Knoxville, TN, USA, 1990.

[8] C. F. V. Loan G. H. Golub. Matrix Computations, 3rd Edition. The John Hopkins Univesity Press, Baltimore and London, 1996.

[9] Sven J. Hammarling, Danny C. Sorensen, Jack J. Dongarra, and Jack J. Dongarra. Block reduction of matrices to condensed forms for eigenvalue computations. J. Com- put. Appl. Math, 27:215–227, 1987.

[10] Alston S. Householder. Unitary triangularization of a nonsymmetric matrix. J. ACM, 5(4):339–342, October 1958.

[11] L. Karlsson and B. K˚agstr¨om. Parallel two-stage reduction to hessenberg form using dynamic scheduling on shared-memory architectures. Parallel Comput., 37(12):771– 782, December 2011.

[12] Lars Karlsson and Bo K˚agstr¨om. In PARA 2010: State of the Art in Scientific and Parallel Computing, Reykjavik, June 6-9, 2010.

[13] Roger S. Martin and J.H. Wilkinson. Similarity reduction of a general matrix to hes- senberg form. Numerische Mathematik, 12(5):349–368.

39 40 REFERENCES

[14] The Open MPI Project. Open mpi documentation. http://www.open-mpi.org/doc/, 2013. Accessed: 20130521. [15] C. V. Loan R. Schreiber. A storage-efficient wy representation for products of house- holder transformations. SIAM J. Sci. Stat. Comput., 10(1):53–57, January 1989.