The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK

CIRCULANT LINEAR SYSTEMS FOR ACOUSTIC RADIATION

PROBLEMS WITH ROTATIONALLY SYMMETRIC

BOUNDARY SURFACES

A Thesis in

Computer Science and Engineering

by

Kenneth D. Czuprynski

2012 Kenneth D. Czuprynski

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

May 2012 The thesis of Kenneth D. Czuprynski was reviewed and approved* by the following:

Suzanne M. Shontz Assistant Professor of Computer Science and Engineering Thesis Adviser

Jesse L. Barlow Professor of Computer Science and Engineering

John B. Fahnline Assistant Professor of Acoustics

Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

*Signatures are on file in the Graduate School. iii Abstract

Coupled finite element/boundary element (FE/BE) formulations are commonly used to solve structural-acoustic problems where a vibrating structure is idealized as being submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE) formulations, the structural analysis is performed using the finite element method, and the acoustic analysis is performed using the boundary element method. In general, the problem is solved frequency by frequency, and the coefficient for the boundary element analysis is fully populated and little can be done to alleviate the storage and computational requirements. Because acoustic boundary element calculations require approximately six elements per wavelength to produce accurate solutions, the boundary element formulation is limited to relatively low frequencies. However, when the outer surface of the structure is rotationally symmetric, the system of linear equations becomes block circulant. We propose a parallel algorithm for distributed memory systems which takes advantage of the underlying concurrency of the inversion formula for block circulant matrices. By using the structure of the coefficient matrix in tandem with a distributed memory system setting, we show that the storage and computational requirements are substantially lessened. iv Table of Contents

List of Figures ...... viii

Chapter 1. Introduction ...... 1

1.1 Acoustic Radiation Problems ...... 3

1.2 Boundary Element Method ...... 4

1.3 The Fourier Matrix and Fast Fourier Transform ...... 10

1.4 Circulant Matrices ...... 18

Chapter 2. Literature Review ...... 21

Chapter 3. Problem Formulation ...... 25

3.1 Coefficient Matrix Derivation ...... 26

3.2 Block Circulant Inversion ...... 32

3.3 Invertibility ...... 35

Chapter 4. Parallel Solution Algorithm ...... 38

4.1 Block DFT Algorithm ...... 38

4.2 Block FFT Algorithm ...... 44

4.3 System Solves ...... 48

4.4 Parallel Algorithm ...... 49

Chapter 5. Theoretical Timing Analysis ...... 51

5.1 Parallel Linear System Solve ...... 52 v

5.2 Block DFT using the DFT Algorithm ...... 53

5.3 Block DFT Using the FFT Algorithm ...... 55

5.4 Bounds ...... 56

Chapter 6. Numerical Experiments ...... 58

6.1 Experiment 1 ...... 58

6.2 Experiment 2 ...... 59

6.3 Numerical Results ...... 59

6.3.1 Experiment 1 ...... 59

6.3.2 Experiment 2 ...... 64

Chapter 7. Conclusions ...... 70

Appendix. BEM Code ...... 72

A.1 STATIC MULTIPOLE ARRAYS ...... 73

A.1.1 Sequential ...... 73

A.1.1.1 General Case ...... 73

A.1.1.2 Rotationally Symmetric ...... 74

A.1.2 Parallel ...... 74

A.1.2.1 General Case ...... 74

A.1.2.2 Rotationally Symmetric ...... 75

A.2 COEFF MATRIX ...... 75

A.2.1 Sequential ...... 76

A.2.1.1 General Case ...... 76 vi

A.2.1.2 Rotationally Symmetric ...... 76

A.2.2 Parallel ...... 77

A.2.2.1 General Case ...... 77

A.2.2.2 Rotationally Symmetric ...... 77

A.3 SOURCE AMPLITUDES MODES ...... 80

A.3.1 Sequential ...... 80

A.3.1.1 General Case ...... 80

A.3.1.2 Rotationally Symmetric ...... 80

A.3.2 Parallel ...... 82

A.3.2.1 General Case ...... 82

A.3.2.2 Rotationally Symmetric ...... 83

A.4 SOURCE POWER ...... 84

A.4.1 Sequential ...... 86

A.4.1.1 General Case ...... 86

A.4.1.2 Rotationally Symmetric ...... 87

A.4.2 Parallel ...... 87

A.4.2.1 General Case ...... 87

A.4.2.2 Rotationally Symmetric ...... 90

A.5 MODAL RESISTANCE ...... 91

A.5.1 Sequential ...... 91

A.5.1.1 General Case ...... 91

A.5.2 Rotationally Symmetric ...... 92

A.5.3 Parallel ...... 92 vii

A.5.3.1 General Case ...... 92

A.5.3.2 Rotationally Symmetric ...... 92

References ...... 93 viii List of Figures

1.1 Radix 2 element interaction pattern obtained from [18]...... 16

3.1 A propeller with three times rotational symmetry [37]...... 26

3.2 A four times rotationally symmetric sketch of a propeller...... 27

4.1 Initial data distribution assumed in the DFT computation for the case

P = m =4...... 39

4.2 The DFT computation for the case P = m = 4. Each arrow indicates

the communication of a processor’s owned submatrix to a neighboring

processor in the direction of the arrow...... 40

4.3 Parallel block DFT data decomposition for P > m...... 42

4.4 Parallel block DFT data decomposition and processor groupings for P > m. 43

4.5 Process illustrating the distributed FFT. Lines crossing to different pro-

cessors indicate communication from left to right. Note the output is in

reverse bit-reversed order relative to numbering starting at zero; that is,

A1 is element 0; A2 is element 1, etc...... 47

4.6 Processor grid creation for P=16 and m=4...... 48

6.1 Runtime comparison using the DFT algorithm for varying P and N with

m =4...... 60

6.2 Runtime comparison using the FFT algorithm for varying P and N with

m =4...... 60 ix

6.3 Speedups using the DFT algorithm for varying P and N with m = 4. . 61

6.4 Speedups using the FFT algorithm for varying P and N with m = 4. . . 61

6.5 Efficiency using the DFT algorithm for varying N and P with m = 4. . 63

6.6 Efficiency using the FFT algorithm for varying N and P with m = 4. . 64

6.7 Runtime comparison using the DFT algorithm for varying P and N with

m =8...... 65

6.8 Runtime comparison using the FFT algorithm for varying P and N with

m =8...... 65

6.9 Speedup comparison using the DFT algorithm for varying P and N when

m =8...... 67

6.10 Speedup comparison using the FFT algorithm for varying P and N when

m =8...... 67

6.11 Efficiency comparison using the DFT algorithm for varying P and N

when m =8...... 68

6.12 Efficiency comparison using the FFT algorithm for varying P and N

when m =8...... 69 1

Chapter 1

Introduction

Coupled finite element/boundary element (FE/BE) formulations are commonly used to solve structural-acoustic problems where a vibrating structure is idealized as be- ing submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE) formulations, the structural analysis is performed using the finite element method, and the acoustic analysis is performed using the boundary element method (BEM). The boundary element formulation is advantageous for the acoustic radiation problem be- cause only the outer surface of the structure in contact with the acoustic medium is discretized. This formulation also allows us to neglect meshing the infinite fluid exterior to the structure, as would be required if the finite element method were used instead.

Using the BEM, we compute the radiated sound field of a vibrating structure

3 Ω ⊂ R . The main obstacle in computing the sound radiation is solving the linear system of equations to enforce the specified boundary conditions. In the context of the BEM, this requires the solution of a dense, complex linear system. In general, the problem is solved frequency by frequency, and the coefficient matrix for the boundary element analysis is fully populated and exhibits no exploitable structure. The size, N 2, of the coefficient matrix is directly correlated with the level of discretization, N, used for the surface in question. Because acoustic boundary element calculations require approximately six elements per wavelength to produce accurate solutions, the boundary 2 element formulation is limited to relatively low frequencies. For high frequency problems, and for problems which involve large and/or complex surfaces, these matrices are large, dense, and unstructured; therefore, there is little which can be done to alleviate the storage and computational requirements. Iterative solvers and have been investigated [4, 5, 28] and are a natural choice for large problems because the cost of direct solvers can become prohibitive. While the computational requirements can be lessened by iterative methods, the storage requirements can still present a problem. One obvious solution is to perform the solve in a distributed memory parallel setting. A distributed memory parallel algorithm distributes the workload and allows the storage of the matrix to be split between many individual systems with local memories, thereby increasing the total available memory. In addition, because linear systems are ubiquitous throughout scientific computation, libraries exist for their efficient parallel solution. In particular, because the matrix is dense, Scalable LAPACK (ScaLAPACK) [6] is a favored choice.

While in general these matrices exhibit no exploitable structure, when the bound- ary surface is rotationally symmetric, the coefficient matrix is block circulant. Circulant matrices are defined as each row being a circular shift of the row above it. One property of circulant matrices is that they are all diagonalizable by the Fourier matrix. There- fore, the Discrete or Fast Fourier Transform (D/FFT) can be used in the solution of the system. These results generalize to the block case and can be used in the solution of block circulant linear systems arising from acoustic radiation problems involving ro- tationally symmetric boundary surfaces. In addition, the inversion formula for block circulant matrices is highly amendable to parallel computation. 3

We propose an algorithm for distributed memory systems which takes advantage of the underlying concurrency of the inversion formula for block circulant matrices. By using the structure of the coefficient matrix in tandem with a distributed system setting, the storage and computational limitations are substantially lessened. Therefore, the algorithm allows larger and higher frequency acoustic radiation problems to be explored.

1.1 Acoustic Radiation Problems

The goal is to compute the radiated sound field due to a vibrating structure

3 Ω ⊂ R subject to given boundary conditions. The governing partial differential equation

(PDE) for acoustic radiation problems is the Helmholtz equation, i.e.,

 2 2 ∇ − k u(p) = 0, p  Ω+, (1.1)

2 ω where ∇ is the Laplacian; k = c is the wave number; ω is angular frequency, and c

3 the speed of sound in the chosen medium. Ω+ = R \{Ω}, denotes the region exterior to

Ω. In structural acoustics problems, it is common for the velocity distribution over the boundary of Ω, denoted by ∂Ω, to be specified. This equates to the Neumann boundary condition

∂u(p) = f(p), p  ∂Ω, (1.2) ∂np 4 where ∂ denotes differentiation in the direction of the outward normal at p  ∂Ω. In ∂np addition, to ensure all radiated waves are outgoing, the Sommerfield radiation condition

∂u(p)  lim r − iku(p) = 0 (1.3) r→∞ ∂r is enforced, where r is the distance of p from a fixed origin. Therefore, in order to solve for the radiated sound field due to Ω, a solution to the Helmholtz equation (1.1), subject to equations (1.2) and (1.3), must be found.

1.2 Boundary Element Method

The boundary element method is an algorithm for the numerical solution of PDEs which have an equivalent boundary integral representation. The BEM reformulates the PDE into an equivalent boundary integral equation (BIE), which is then solved numerically. The benefit of the formulation is that it reduces the problem to one over the boundary. However, because the BEM requires an equivalent BIE formulation, if the PDE cannot be represented as an equivalent BIE, the BEM cannot be used. The remainder of the section will outline the BEM within the context of an acoustic radiation problem.

3 Consider a vibrating structure Ω ⊂ R . The Helmholtz equation is the governing

PDE for the radiated sound field produced by Ω and is given by (1.1). A standard boundary integral formulation of (1.1) yields the following equations

1 Z ∂G(p, q) ∂u(q)  u(q) − G(p, q) d (∂Ω) = u(p), p  Ω+ (1.4) 4π ∂Ω ∂nq ∂nq 5 and

1 Z ∂G(p, q) ∂u(q)  u(q) − G(p, q) d (∂Ω) = u(p), p  ∂Ω, (1.5) 2π ∂Ω ∂nq ∂nq where G(p, q) is the Green’s function, which can loosely be thought of as the effect the point q has on point p. In the context of an acoustic radiation problem, the Green’s function corresponds to the fundamental solution of the Helmholtz equation and is given

eik|p−q| by G(p, q) = |p−q| , in which |p − q| denotes the Euclidean distance between the points p and q. A solution to u in the exterior domain with respect to the points on the boundary is provided by (1.4). Therefore, if the quantities u and ∂u(p) are known over ∂nq the boundary, the solution for the points in the exterior can be easily computed. In addition, (1.5) provides a means of solving for the aforementioned quantities. However, by applying the Fredholm alternative to (1.5) it is found that the solutions are not unique for all wave numbers k, and thus an alternative formulation is required [34]. Burton and

Miller [9] showed how a unique solution can be derived. Differentiating (1.5) in the direction of the outward normal yields

1 ∂ Z ∂G(p, q) ∂u(q)  ∂u(p) u(q) − G(p, q) d (∂Ω) = , p  ∂Ω. (1.6) 2π ∂np ∂Ω ∂nq ∂nq ∂np

Then constructing a linear combination of equations (1.5) and (1.6) using a purely imagi- nary coupling coefficient, β, produces a modified BIE formulation with a unique solution. 6

The formulation is given by

 1 Z ∂G(p, q) ∂u(q)   u(q) − G(p, q) d (∂Ω) + (1.7) 2π ∂Ω ∂nq ∂nq  1 ∂ Z ∂G(p, q) ∂u(q)   β u(q) − G(p, q) d (∂Ω) 2π ∂np ∂Ω ∂nq ∂nq ∂u(p) = u(p) + β . ∂np

Assuming a Neumann boundary condition, (1.7) can be rearranged as follows:

! Z ∂2G(p, q) ∂G(p, q) u(q) β + d (∂Ω) − 2πu(p) (1.8) ∂Ω ∂nq∂np ∂nq

Z ∂u(q)  ∂G(p, q) ∂u(p) = G(p, q) + β d (∂Ω) + β2π , p  ∂Ω. ∂Ω ∂nq ∂np ∂np

Note, in the case of a Dirichlet boundary condition, ∂u(p) can be solved for by rearranging ∂nq

(1.8). Once u(p) has been solved for over the boundary, the solution for all points in the exterior can be obtained. Therefore, a means for numerically solving equation (1.8) must be devised. For notational convenience, let v(q) = ∂u(q) , and redefine portions of ∂nq both integrands as

∂2G(p, q) ∂G(p, q) T (p, q) = β + (1.9) ∂nq∂np ∂nq and

∂G(p, q) H(p, q) = G(p, q) + β . (1.10) ∂np 7

Equation (1.8) becomes

Z Z u(q)T (p, q)d (∂Ω) − 2πu(p) = v(q)H(p, q)d (∂Ω) + β2πv(p), p  ∂Ω. (1.11) ∂Ω ∂Ω

The next step in the BEM is to discretize the boundary surface, ∂Ω, into smaller quadri- lateral or triangular surface elements. After the discretization, the boundary can be

th represented as ∂Ω = ∂Ω1 ∪∂Ω2 ∪· · ·∪∂ΩN , where ∂Ωi represents the i surface element in the discretization of ∂Ω and ∂Ωi ∩ ∂Ωj = ∅ for i 6= j.

Equation (1.11) can then be represented as

N N X Z  X Z  u(q)T (p, q)d (∂Ωi) − 2πu(p) = v(q)H(p, q)d (∂Ωi) (1.12) i=1 ∂Ωi i=1 ∂Ωi +2βπv(p), p  ∂Ω.

The most straightforward approach to numerically solving equation (1.12) is to assume u(p) and v(p) are constant along each surface element, ∂Ωi, i = 1,...,N. Therefore, let u(p) ≈ uj and v(p) ≈ vj for p  ∂Ωj, j = 1,...,N. Under this assumption, equation

(1.12) can be decomposed into N equations, i.e., one equation for each surface element; that is,

N N X Z  X Z  ui T (p, q)d (∂Ωi) − 2πuj = vi H(p, q)d (∂Ωi) (1.13) i=1 ∂Ωi i=1 ∂Ωi

+2βπvj, p  ∂Ωj.

Equation (1.13) yields a solution for the jth surface element of the boundary. The boundary is constructed of N surface elements; therefore, there are N equations and N 8 unknowns total. Using this, equation (1.13) can more concisely be expressed in matrix notation. Let

  R R R T (p, q)d (∂Ω1) T (p, q)d (∂Ω2) ... T (p, q)d (∂ΩN )  ∂Ω1 ∂Ω2 ∂ΩN     R R R   T (p, q)d (∂Ω1) T (p, q)d (∂Ω2) ... T (p, q)d (∂ΩN )   ∂Ω1 ∂Ω2 ∂ΩN  M =   .  . . . .   . . .. .   . . .      R T (p, q)d (∂Ω ) R T (p, q)d (∂Ω ) ... R T (p, q)d (∂Ω ) ∂Ω1 1 ∂Ω2 2 ∂ΩN N

Similarly, let the column vector b represent right hand side; that is,

  PN hR i vi H(p, q)d (∂Ωi) + β2πv1  i=1 ∂Ωi     PN hR i   vi H(p, q)d (∂Ωi) + β2πv2   i=1 ∂Ωi  b =   .  .   .   .     h i  PN v R H(p, q)d (∂Ω ) + β2πv i=1 i ∂Ωi i N

With a Neumann boundary condition, each vi, i = 1,...,N, is known, and the integrals can be computed via numerical quadrature. Therefore, the matrix M and vector b are known quantities. Using the new quantities, the linear system

(M − 2πI)u = b, p  ∂Ω, (1.14)

can be used to solve for the approximation of u over the boundary. Once we have an approximate solution for u over the surface, (1.4) can be used to solve for u in the exterior. 9

It is difficult to precisely enforce the boundary conditions for the surface velocity at edges and corners when the basis functions are constructed using surface distributions of simple and dipole sources, as they are in Burton and Miller’s standard implementation.

To avoid this difficulty, it is possible to rewrite the solution in terms of surface-averaged quantities instead, which is common in acoustics. For example, surface-averaged pres- sures and volume velocities are commonly used in lumped parameter representations of transducers. Since the goal is no longer to match the boundary conditions on a point-by-point basis, it becomes permissible to simplify the solution by constructing the basis functions from discrete sources rather than distributions of sources. Using surface- averaged pressures and volume velocities as variables can also be shown to produce a solution that converges with mesh density, unlike the standard formulation which can produce a less accurate solution as the mesh is refined. The solution is then derived in terms of source amplitudes rather than physical quantities, such as pressure or velocity.

For this type of indirect solution, an approach similar to Burton and Miller’s can be used to prevent nonexistence/nonuniqueness difficulties. A hybrid ”tripole” source type is created from a simple and dipole source with a complex-valued coupling coefficient, as is discussed by Hwang and Chang [19]. The numerical implementation discussed in this thesis is based on an indirect solution using tripole sources, but the basic formulation shares many characteristics with the standard Burton and Miller approach discussed previously. 10

1.3 The Fourier Matrix and Fast Fourier Transform

The Fourier matrix is given by

  1 1 1 ··· 1        1 ω1 ω2 ··· ωn−1   n n n    1   F = √  1 ω2 ω4 ··· ω2(n−1)  , (1.15) n  n n n     . . . . .   ......   . . . .     (n−1) 2(n−1) (n−1)(n−1)  1 ωn ωn ··· ωn

i2π √ n √1 where ωn = e , i = −1, and normalizing by n makes F unitary. The discrete Fourier transform (DFT) is defined as a matrix vector multiplication involving the Fourier ma- trix. That is,

y = F x. (1.16)

The vector y is called the DFT of x. Similarly, the inverse discrete Fourier transform

(IDFT) of x is given by

y = F −1x. (1.17)

However, because F has been defined to be unitary, (1.17) becomes

y = F ∗x. (1.18)

The Fourier matrix is highly structured, and this structure can be used to com- pute the DFT. The improved method of computing the DFT is called the Fast Fourier 11 transform (FFT) and was first introduced by Cooley and Tukey [12]. It was shown that

h + for vectors with n = 2 elements, h  Z , the DFT can be computed in O(n log n). Over the years, the method has been extended to handle vectors with an arbitrary number of elements; a comprehensive overview of these can be found in [11, 26]. This thesis uses the Cooley and Tukey version of the algorithm, also now termed the radix-2 FFT. We thus now overview the radix-2 algorithm.

Assuming the first column and first row are indexed by 0, consider the element

i2πkj th th kj n in the k row and the j column of the Fourier matrix, which is given by ωn = e .

Note then that each element is periodic in n. This can readily be seen by using Euler’s formula. Applying Euler’s formula, we have

 kj   kj  ωkj = cos 2π + i sin 2π . (1.19) n n n

Because sin and cos both have period 2π, by (1.19), if kj ≥ n, the elements begin to

k repeat. It follows that each element in the Fourier matrix can be represented by ωn for k = 0, . . . , n − 1. For example, consider the four-by-four Fourier matrix

  1 1 1 1        1 ω1 ω2 ω3  1  4 4 4  F = √   . (1.20) 4 4    1 ω2 ω4 ω6   4 4 4     3 6 9  1 ω4 ω4 ω4 12

By the periodicity of the elements, (1.20) becomes

  1 1 1 1        1 ω1 ω2 ω3  1  4 4 4  F = √   . (1.21) 4 4    1 ω2 1 ω2   4 4     3 2 1  1 ω4 ω4 ω4

The FFT algorithm uses properties of ω coupled with a divide and conquer strategy.

The following derivation relies heavily on [11]; we follow their derivation closely.

h + Recall that n = 2 for h  Z , and consider the operation y = F x. Expanding the matrix vector product gives

n−1 X jk yk = xjωn , k = 0, . . . , n − 1. (1.22) j=0

Equation (1.22) can be split into two summations: one containing all of the even terms, and one containing all of the odd terms, i.e.,

n n 2 −1 2 −1 X 2jk X (2j+1)k yk = x2jωn + x2j+1ωn , k = 0, . . . , n − 1. (1.23) j=0 j=0

k A ωn term in the second summation can be pulled out of the summation, i.e.,

n n 2 −1 2 −1 X 2jk k X 2jk yk = x2jωn + ωn x2j+1ωn , k = 0, . . . , n − 1. (1.24) j=0 j=0 13

2 Using the fact that ω = ω n , (1.24) becomes n 2

n n 2 −1 2 −1 X jk k X jk yk = x2jω n + ω x2j+1ω n , k = 0, . . . , n − 1. (1.25) 2 n 2 j=0 j=0

n (k+ 2 )j kj n The next observation to make is that ω n = ω n for k = 0,..., − 1. That is, 2 2 2 because ω n has a smaller period, the elements begin to repeat sooner, and k, in turn, 2

n need not go beyond 2 − 1. Therefore, (1.25) becomes

n n 2 −1 2 −1 X jk k X jk n yk = x2jω n + ω x2j+1ω n , k = 0,..., 2 − 1. (1.26) 2 n 2 j=0 j=0

n Looking more closely, each summation represents a DFT of length 2 . Therefore, a DFT of length n can be broken into two DFTs each half the size of the previous DFT. However,

n (1.26) contains only the first 2 terms of y. Computing the remaining terms yields

n n 2 −1 2 −1 j(k+ n ) n j(k+ n ) X 2 k+ 2 X 2 n yk+ n = x2jω n + ω x2j+1ω n , k = 0,..., − 1. (1.27) 2 2 n 2 2 j=0 j=0

We then obtain

n n 2 −1 2 −1 j n n j n X jk 2 k+ 2 X jk 2 n yk+ n = x2jω n ω n + ω x2j+1ω n ω n , k = 0,..., 2 − 1. (1.28) 2 2 2 n 2 2 j=0 j=0

n n j 2 k+ k Because ω n = 1 and ω 2 = −ω , (1.28) becomes 2 n n

n n 2 −1 2 −1 X jk k X jk n yk+ n = x2jω n − ω x2j+1ω n , k = 0,..., 2 − 1. (1.29) 2 2 n 2 j=0 j=0 14

Therefore, the entire vector y can be obtained by

n n 2 −1 2 −1 X jk k X jk n yk = x2jω n + ω x2j+1ω n , k = 0,..., 2 − 1 (1.30) 2 n 2 j=0 j=0

n n 2 −1 2 −1 X jk k X jk n yk+ n = x2jω n − ω x2j+1ω n , k = 0,..., 2 − 1. 2 2 n 2 j=0 j=0

Let sj = x2j and tj = x2j+1 for j = 0,...,N/2 − 1; that is, s is the vector containing all the even elements of x, and t is the vector contain all of its odd elements. Then (1.30) may be written as

h i k h i n [Fnx]k = F n s + ω F n t , k = 0,..., − 1 (1.31) 2 k n 2 k 2

h i k h i n n n [Fnx]k+ n = F s − ωn F t , k = 0,..., − 1. 2 2 k 2 k 2

From (1.31), the recursive nature of the algorithm should be clear. The DFT of a vector y can be split into two DFTs of half the size. We can proceed in computing F n s and 2

F n t, as if it were the first time, and proceed as above. Algorithm 1.1 gives a pseudocode 2 of the algorithm. 15

Algorithm 1.1 Radix-2 FFT pseudocode. 1: Y=Radix-2FFT(X,n) 2: if n == 1 then 3: return Y; 4: else n 5: s = Radix-2FFT(Even(X), 2 ); n 6: t = Radix-2FFT(Odd(X), 2 ); n 7: for k = 0 to 2 − 1 do k 8: Yk = sk + ωntk; k 9: Y n = s − ω t ; k+ 2 k n k 10: end for 11: end if 12: return Y;

Algorithm 1.1 follows nicely from the derived mathematics; however, the recursion can be unrolled into an iterative format which will later facilitate the explanation of our parallel algorithm. The algorithm can be found in [24], and our explanation follows their discussion closely.

Algorithm 1.2 Iterative Radix-2 FFT pseudocode as presented in [24]. 1: Y=Radix-2FFT(X,Y,n) 2: r = log n; 3: R = X; 4: for m = 0 to r − 1 do 5: S = R; 6: for i = 0 to n − 1 do 7: //Let (b0b1 . . . br−1) be the binary representation of i 8: j = (b0 . . . bm−1 0 bm+1 . . . br−1); 9: k = (b0 . . . bm−1 1 bm+1 . . . br−1); 10: r = (bmbm−1 . . . b0 0 ... 0); r 11: Ri = Sj + Skωn; 12: end for 13: end for 14: Y = R; 16

Algorithm 1.2 is the iterative version of Algorithm 1.1. Each iteration of the outer loop (line 4) represents one level of the recursion, starting with the deepest level. At each level of recursion, the output vector is updated by two entries of the given input vector and a multiple of the factor ω, (lines 8 and 9 of Algorithm 1.1 and line 11 for

Algorithm 1.2). Algorithm 1.1 uses the input to the function at each level of recursion to update the output vector; whereas, Algorithm 1.2 uses binary representations of the index being modified.

The most relevant property to notice, with respect to the parallel algorithm, is the pattern of interaction between different elements of the input vector. Figure 1.1 shows which elements in the input vector, denoted x, are used in computing each element of the output vector, denoted X, for a vector of length n = 16.

Fig. 1.1 Radix 2 element interaction pattern obtained from [18]. 17

In order to solidify this notion and to clarify the meaning behind Figure 1.1, consider the transformation of x(0). The elements of the initial input vector involved in the transformation of x(0) are: x(0), x(8), followed by modified versions of x(4), x(2), and x(1). Similarly, each element of the input vector in the diagram can be traced to see the elements of the initial vector involved in each computation.

A final note about FFTs is the ordering of the output. When the algorithm is run in place, such that it overwrites the array containing the initial data, the output is in bit-reversed order. This can be seen in Figure 1.1. For another example, let n = 8, and consider the computation x = F8x, where the vector x is overwritten. This yields

    x x  0   0               x1   x4               x2   x2               x3   x6      x =   7−→   .      x   x   4   1           x   x   5   5           x   x   6   3          x7 x7

The indices are converted to binary, and the bit string is reversed before being converted back into decimal. In the above example, consider the indices one and seven, i.e., (1)10 =

(001)2, and flipping the bit string yields (100)2 = (4)10. This means that data migrates to bit-reversed order when the FFT is done in place. 18

1.4 Circulant Matrices

Circulant matrices are a subset of Toeplitz matrices which have the added prop- erty that each row is a circular shift of the previous row. The matrix C is circulant if it has the form   c c c ··· c  1 2 3 n       c c c ··· c   n 1 2 n−1      C =  c c c ··· c  .  n−1 n 1 n−2     . . . . .   ......   . . . .      c2 c3 c4 ··· c1

Matrices of this form can be uniquely represented by their first row and will be denoted by C = circ(c0, c1, c2, ··· , cn−1).

A thorough treatment of circulant matrices is given in [13]. The important prop- erty of circulant matrices that is used heavily throughout this thesis concerns the -

T values and eigenvectors of circulant matrices. Let v = [c1 c2 c3 . . . cn] be the column vector constructed from the first row of a C. Then the eigenvalues of

C are given by

λ = F v, (1.32) where F is the unitary Fourier matrix [13]. That is, the discrete Fourier transform (DFT) of the first row of C yields the eigenvalues of C. Further, the eigenvectors of a circulant matrix C are given by the columns of the Fourier matrix of appropriate dimension. Thus,

C has eigenvalue decomposition

C = F ∗DF, (1.33) 19 where F is again the Fourier matrix, and D is the whose elements are the eigenvalues of C, i.e., D = diag(λ). This means that every circulant matrix of the same dimension has the same eigenvectors, and that the matrix C is given by

C = F ∗diag(λ)F. (1.34)

With this decomposition, a formulation for the inversion of C can easily be obtained.

The inverse of C is then given by

C−1 = F diag(λ)−1F ∗. (1.35)

This formulation can then be used to solve a linear system. Consider the linear system

Cx = b. (1.36)

Left multiplication by C−1 yields

x = C−1b. (1.37)

Now, substituting for the definition of C−1 given by (1.35) yields

x = F diag(λ)−1F ∗b. (1.38)

Rearranging gives

diag(λ)F ∗x = F ∗b. (1.39) 20

Letx ˜ = F ∗x and ˜b = F ∗b; then (1.39) becomes

diag(λ)˜x = ˜b, (1.40)

whose solution is trivial. Therefore, the solution of a linear system equates to computing three DFTs and a backsolve involving a diagonal matrix. The steps are:

1. Compute λ = F v.

2. Compute ˜b = F ∗b.

3. Solve diag(λ)˜x = ˜b.

4. Compute x = F x˜.

This formulation is advantageous because the most expensive operation needed is the computation of the DFT, which, in its crudest form, is a matrix vector multiplication, and is thus O(n2). However, if permissible, the fast Fourier transform (FFT) can be used in place of the DFT, and the computation becomes O(n log n). 21

Chapter 2

Literature Review

Circulant matrices are a desirable structure in computation because of their re- lation to the Fast Fourier transform (FFT). Therefore, many variations of circulant matrices have appeared throughout the literature and in a wide variety of contexts.

These range from the solution of circulant tridiagonal and banded systems [32, 16, 15] to effective preconditioners [25] and are able to exploit their computational relation to the FFT.

We are concerned with the solution of linear systems involving block circulant matrices and assume the blocks in the matrix themselves are dense and contain no additional structure. The desirable properties extend to the block case as well; namely, block circulant matrices are block diagonalizable by the block Fourier matrix. The generalization to the block case, however, means that the inversion/solution formula must be extended. We first note that every block circulant matrix (BCM) can be mapped to an equivalent block matrix with circulant blocks (CBM). This can be accomplished by multiplying by the appropriate permutation matrices. Therefore, algorithms for solving

BCMs and CBMs are equivalent.

Within engineering, when problems with periodicity properties are considered, block circulant matrices arise in many contexts. These usually result when such periodic problems are solved by means of integral equations, which includes the BEM. Using 22 the method of fundamental solutions [17], block circulant matrices in the contexts of axisymmetric problems in potential theory [21], as well as axisymmetric harmonic and biharmonic [38], linear elasticity [23, 22], and heat conduction problems [36] have been investigated. In addition, scattering and radiation problems in electromagnetics have taken advantage of block circulant matrices for a variety of integral equation techniques

[33, 30, 14, 20] including the BEM [40]. With respect to acoustics, a National Physical

Laboratory tech report discussed some properties of rotationally symmetric problems for the BEM as applied to the Helmholtz equation [42].

Just as circulant matrices are a subset of Toeplitz matrices, block circulant matri- ces are a subset of block Toeplitz matrices. Therefore, it is not surprising that one of the

first inversion algorithms applied to block circulant matrices was an inversion algorithm for block Toeplitz matrices [2]. Closed form solutions for the inversion of block circulant matrices were formalized in [27] and presented again more concisely in [41]. The se-

∗ quential inversion formula shows that a BCM, A, has the decomposition A = Fb DFb, in which Fb represents the block Fourier matrix, and D represents a block diagonal matrix.

The blocks along the diagonal are obtained by computing the block DFT of the first block row of A; this means if v is defined to be the first block row of A, D = diag{Fbv}.

−1 −1 ∗ The inversion is then given by A = Fb (diag{Fbv}) Fb , and only the blocks of the block diagonal matrix are inverted. Extending the closed form inversion formulations, an algorithm for solving a block circulant linear system was developed alongside many vari- ants of circulant linear systems [10]. The solution of the linear system involving BCMs resulted from a straightforward application of the inversion formula. Following these ef- forts, [31] proposed an algorithm for the solution of CBMs. The most recent contribution 23 to CBMs was given in [39]. The algorithm first diagonalizes each block of the matrix by the Fourier relation. The matrix is then a block matrix with diagonal blocks. The algo- rithm decomposes the matrix into a two-by-two block matrix and successively performs this decomposition to the first principal submatrix until a diagonal matrix is reached.

The diagonal matrix is inverted, and the Schur complement formulation for the inverse of a two-by-two block matrix is successively used to compute the inversion of the entire matrix. All inversion/solution formula of consequence use the spectral properties of the circulant matrices. This is exploited in all aforementioned sequential inversion/solution algorithms.

While sequential solution algorithms have been fully developed, little work has been done on parallel algorithms for block circulant linear systems. A parallel solution for block Toeplitz matrices exists, and parallelizes the generalized Schur algorithm [3].

Yet, using a Toeplitz solver neglects the use of the FFT and potential concurrent cal- culations found in the BCM inversion formula. In fact, the only work we are aware of is a parallel solver for electromagnetic problems which considers the axisymmetric case

[29]. The proposed parallel algorithm was for distributed memory systems and paral- lelized the inversion formulation for BCMs. The assumptions of the work differ from our own; that is, they assume a larger number of blocks of smaller order, and, in turn, assumed that the number of processors was some fraction of the number of blocks in the matrix. This means each processor contained multiple blocks, denoted q, of the BCM.

For each block owned by a processor, the corresponding right-hand side also resides on that processor. This means that when solving the block diagonal matrix, each processor could perform the solve of its q blocks simultaneously. However, when solving the linear 24 system, multiplications by the Fourier matrix are needed. These are needed in order to: obtain the block diagonal matrix, modify the right-hand side vector, and modify the solution vector. This distribution means that multiplying by the Fourier matrix requires communication among the processors. Using the fact that block Fourier transforms can be decomposed into independent Fourier transforms, it performs an all-to-all communi- cation to give each processor the data needed to compute an independent FFT. They tested the algorithm for BCMs with m = 256 blocks of order n = 318, m = 128 blocks of order n = 189, and m = 64 blocks of order n = 93. This is where our assumptions diverge significantly, and as a result our algorithm differs significantly in implementation of the same inversion formula. 25

Chapter 3

Problem Formulation

3 Consider a rotationally symmetric vibrating structure, Ω  R . The rotational symmetry implies Ω can be constructed by rotations of a single element around a fixed

0 3 0 axis. Define Ω to be a structure in R , and let Ωθ represent the structure obtained by rotating Ω0 by angle θ. Then, supposing Ω has m rotational symmetries, Ω can be

0 0 0 0 written as Ω = Ω0 ∪ Ω 2π ∪ Ω 4π ∪ · · · ∪ Ω (m−1)2π ; that is, m m m

m−1 [ 0 Ω = Ω k2π . (3.1) m k=0

For example, for m = 4 the structure Ω can be written as

0 0 0 0 Ω = Ω ∪ Ω π ∪ Ω ∪ Ω 3π . (3.2) 0 2 π 2

Note, the angle θ is relative to an initial orientation of the structure. This means that the structure being rotated can have any initial orientation; as long as the rotation is around a fixed axis and the rotation angle is uniform, the constructed structure is rotationally symmetric. Figure 3.1 shows a real-world example of a structure containing three rotational symmetries. 26

Fig. 3.1 A propeller with three times rotational symmetry [37].

3.1 Coefficient Matrix Derivation

Before beginning the algebraic derivation, we first present the underlying intu- ition. Figure 3.2 shows a sketch of a propeller with four times rotational symmetry.

0 0 0 0 Consider the effect Ω has on Ω π , as well as the effect Ω π has on Ω . Because the blades 0 2 2 π 0 0 0 0 are identical and dist(Ω , Ω π ) = dist(Ω π , Ω ), the entries in the coefficient matrix which 0 2 2 π 0 0 0 0 describe the effect of Ω on Ω π and Ω π on Ω will be identical. This continues for the 0 2 2 π remaining interactions of this form; therefore, the entries of the coefficient matrix due

0 0 0 0 0 0 0 0 to the effect of Ω on Ω π ,Ω π on Ω ,Ω on Ω 3π , and Ω 3π on Ω will be identical. This 0 2 2 π π 2 2 0 same idea is used for all of the remaining interactions to finish populating the coefficient matrix. The equality between interactions due to symmetry is what leads to the block circulant structure of the coefficient matrix. 27

Fig. 3.2 A four times rotationally symmetric sketch of a propeller.

3 This decomposition of the initial structure in R into the union of rotated struc- tures gives insight into the structure of the coefficient matrix. Recall, in the derivation of the BEM, the solution over the boundary of the structure must first be solved in order

0 0 to obtain the solution in the exterior domain. Consider only the base element Ω = Ω0 before any rotations. For clarity, we suppose m = 2 and use the standard boundary integral formulations given by (1.4) and (1.5). The integral formulations which promise uniqueness follow in the same manner. Assuming a Neumann boundary condition and

0 rearranging into knowns and unknowns, the equation over the boundary of Ω0 is given 28 by

Z Z ∂G(p, q) 0  ∂u(q) 0  0 u(q)d ∂Ω0 − 2πu(p) = G(p, q)d ∂Ω0 , p  ∂Ω0. (3.3) 0 ∂n 0 ∂n ∂Ω0 q ∂Ω0 q

0 Next, consider the solution of u over the boundary element ∂Ω π ; that is, the 2 0 boundary surface obtained by rotating the base element Ω0 by 90 degrees. This yields the following boundary integral formulation

Z Z ∂G(p, q)  0  ∂u(q)  0  0 u(q)d ∂Ω π − 2πu(p) = G(p, q)d ∂Ω π , p  ∂Ω π . (3.4) 0 2 0 2 2 ∂Ω π ∂nq ∂Ω π ∂nq 2 2

0 0 As stand-alone structures, Ω and Ω π are identical aside from their orientation. The 0 2 0 0 boundaries, ∂Ω and ∂Ω π , are unaffected by rotations and are therefore identical. Equa- 0 2 tion (3.3) and (3.4) involve only points on the boundary and, therefore, assuming the

Neumann conditions are identical for both equations, equality holds. Note, by the uniqueness, and the equality for identical right-hand sides, it follows the left-hand sides must be identical.

0 Intuitively, (3.3) shows the relation between a point p on ∂Ω0 and all the points

0 0 0 q on ∂Ω0. If a point p is chosen on ∂Ω0, all of the points on ∂Ω0 contribute to the value of u at that point. In this sense, an N-body problem is being solved. Similarly, if a

0 0 point p is chosen on ∂Ω π , all of the points on ∂Ω π contribute to the value of u at that 2 2 0 0 point; however, ∂Ω are identical ∂Ω π . Therefore, under identical boundary conditions, 0 2 the same N-body problem is being solved. 29

Now, consider the solution of u over the boundary of the structure obtained by

0 0 combining the two aforementioned structures, Ω and Ω π . The boundary is then given 0 2 0 0 by ∂Ω = ∂Ω ∪ ∂Ω π and the integral equation is 0 2

Z ∂G(p, q) Z ∂u(q) u(q)d (∂Ω) − 2πu(p) = G(p, q)d (∂Ω) , p  ∂Ω. (3.5) ∂Ω ∂nq ∂Ω ∂nq

Using the rotational symmetries, equation (3.5) becomes

Z Z ∂G(p, q) 0  ∂G(p, q)  0  u(q)d ∂Ω0 + u(q)d ∂Ω π − 2πu(p) = (3.6) 0 ∂n 0 ∂n 2 ∂Ω0 q ∂Ω π q 2 Z Z ∂u(q) 0  ∂u(q)  0  0 0 G(p, q)d ∂Ω0 + G(p, q)d ∂Ω π , p  ∂Ω0 ∪ ∂Ω π . 0 ∂n 0 ∂n 2 2 ∂Ω0 q ∂Ω π q 2

0 0 Redefine v1(p) = u(p) for p  ∂Ω and v2(p) = u(p) for p  ∂Ω π . In addition, define 0 2

Z Z ∂G(p, q) 0 ∂G(p, q)  0   π Γ0[v1] = v1(q)d ∂Ω0 ,Γ [v2] = v2(q)d ∂Ω π 0 ∂n 2 0 ∂n 2 ∂Ω0 q ∂Ω π q 2

Z Z ∂v1(q) 0 ∂v2(q)  0   π Σ0 = G(p, q)d ∂Ω0 , and Σ = G(p, q)d ∂Ω π . 0 ∂n 2 0 ∂n 2 ∂Ω0 q ∂Ω π q 2

Note, the variables v (p) and v (p) are unknowns, and, therefore, Γ [v ] and Γ π [v ] are 1 2 0 1 2 2 defined as operators; whereas, Σ and Σ π are known quantities and are treated as known 0 2 values. Using the newly-defined quantities, (3.6) can be split into two simultaneous 30

0 0 equations over ∂Ω and ∂Ω π . 0 2

0 Γ [v ] + Γ π [v ] − 2πv (p) = Σ + Σ π , p  ∂Ω (3.7) 0 1 2 2 1 0 2 0

0 Γ0[v1] + Γ π [v2] − 2πv2(p) = Σ0 + Σ π , p  ∂Ω π . 2 2 2

Upon appropriate discretization, (3.7) can be written as the following linear system

      Γ0 − 2πI Γ π v1 Σ0 + Σ π  2     2      =   , (3.8)       Γ Γ π − 2πI v Σ + Σ π 0 2 2 0 2 where I is the identity matrix. Let A denote the coefficient matrix in (3.8) and consider   the entries (Γ − 2πI) and Γ π − 2πI . By the previous arguments in establishing the 0 2 equivalence of (3.3) and (3.4), it follows that

  (Γ − 2πI) = Γ π − 2πI . (3.9) 0 2

This is true even when the right-hand sides of (3.3) and (3.4) are not identical. With   this relation established, define A = (Γ − 2πI) = Γ π − 2πI . Similarly, consider the 1 0 2 entries Γ and Γ π . We would like to show Γ = Γ π . By definition, 0 2 0 2

Z ∂G(p, q) 0  Γ0[v1] = v1(q)d ∂Ω0 , (3.10) 0 ∂n ∂Ω0 q 31 and upon discretization as described in Section 1.2, we obtain

N Z ! X ∂G(p, q)  0  Γ [v ] = (v ) d ∂Ω  . (3.11) 0 1 1 i 0 i ∂Ω0 ∂nq i=1 [ 0]i

The quantity Γ0[v1] becomes the product Γ0v1, in which v1 is the discretization of the unknown v1(q), and Γ0 is a matrix of known quantities populated by integrating the

0 normal derivative of the Green’s function over the individual surface elements of ∂Ω0.

In considering the discretization of Γ π [v ], we obtain 2 2

  N Z X ∂G(p, q) h 0 i  Γ π [v2] = (v2)i    d ∂Ω π  . (3.12) 2 0 ∂n 2 i i=1 ∂Ω π q 2 i

Again, the quantity Γ π [v ] becomes the product Γ π v , in which v is the discretization of 2 2 2 2 2 the unknown v (q), and Γ π is a matrix of known quantities populated by integrating the 2 2

0 normal derivative of the Green’s function over the individual surface elements of ∂Ω π . 2 0 Assuming the discretization of the boundaries are the same, because the boundaries ∂Ω0

0 and ∂Ω π are identical, the values populating Γ0 and Γ π are identical, and thus Γ0 = Γ π . 2 2 2   Let A = Γ = Γ π , then with the previously established definition A = Γ π − 2πI = 2 0 2 1 2

(Γ0 − 2πI), the matrix, A, comprising the linear system (3.8) has the form

  A A  1 2  A =   , (3.13)   A2 A1 which is a 2 × 2 block circulant matrix. In general, given m rotational symmetries, an m × m block circulant matrix can be obtained. 32

3.2 Block Circulant Inversion

N×N Let N = nm. The coefficient matrix A  C arising from the BEM applied to an acoustic radiation problem with a rotationally symmetric boundary surface has the form   A A ··· A  1 2 m       A A ··· A   m 1 m−1      A =  A A ··· A  , (3.14)  m−1 m m−2     . . . .   . . .. .   . . .      A2 A3 ··· A1

n×n where each Aj, j = 1, . . . , m is contained in C and is dense. The matrix A is block circulant and therefore can be represented by circular shifts of its first block row. The circulant structure of A is contained in the m blocks forming the first block row of A.

Therefore, in order to perform block DFT operations, we need to scale the Fourier matrix

m×m N×N F  C to the block Fourier matrix Fb  C . The Fourier matrix F is defined as

  1 1 1 ··· 1        1 ω1 ω2 ··· ωm−1   m m m    1   F = √  1 ω2 ω4 ··· ω2(m−1)  , (3.15) m  m m m     . . . .   . . . .   . . . ··· .     (m−1) 2(m−1) (m−1)(m−1)  1 ωm ωm ··· ωm

i2π √ m √1 where ωm = e , i = −1, and normalizing by m makes F unitary. Scaling each element of F by the n × n identity matrix, In, produces the block Fourier matrix Fb. 33

This is equivalent to the Kronecker product F ⊗ In. After scaling, we have the block

Fourier matrix

  I I I ··· I  n n n n       I I ω1 I ω2 ··· I ωm−1   n n m n m n m  1    2 4 2(m−1)  Fb = √  I I ω I ω ··· I ω  . (3.16) m  n n m n m n m     . . . .   . . . .   . . . ··· .     (m−1) 2(m−1) (m−1)(m−1)  In Inωm Inωm ··· Inωm

N×n Next, the DFT relations needed for the inversion formula are established. Let X  C be the block column vector containing the first block row of A. The block DFT of X is

˜ given by X = FbX; that is,

      A˜ I I I ··· I A  1   n n n n   1               A˜   I I ω1 I ω2 ··· I ωm−1   A   2   n n m n m n m   2               A˜  =  I I ω2 I ω4 ··· I ω2(m−1)   A  , (3.17)  3   n n m n m n m   3         .   . . . .   .   .   . . . .   .   .   . . . ··· .   .              ˜ (m−1) 2(m−1) (m−1)(m−1) Am In Inωm Inωm ··· Inωm Am which is nothing more than a DFT of length m with n × n matrices as coefficients in the transform. Using the formulation of the inverse in [41], we have

−1 ˜ −1 ˜ −1 ˜ −1 ∗ A = Fbdiag{(A1) , (A2) ,..., (Am) }Fb , (3.18) 34

˜ −1 ˜ −1 ˜ −1 where diag{(A1) , (A2) ,..., (Am) } is a block diagonal matrix whose diagonal blocks are precisely the inverses of the blocks obtained from the DFT of the first block row of

A. From the formula, we can derive the algorithm for the solution of a linear system.

Consider the system Ax = b; multiplying by A−1 yields

x = A−1b. (3.19)

Substituting in the definition for A−1 from (3.18), we obtain

˜ −1 ˜ −1 ˜ −1 ∗ x = Fb diag{(A1) , (A2) ,..., (Am) }Fb b. (3.20)

Rearranging, yields

˜ ˜ ˜ ∗ ∗ diag{(A1), (A2),..., (Am)}Fb x = Fb b. (3.21)

∗ ˜ ∗ Letx ˜ = Fb x and b = Fb b. This yields

˜ ˜ ˜ ˜ diag{(A1), (A2),..., (Am)}x˜ = b. (3.22)

˜ Blocking the vectorsx ˜ and b to match the block sizes of each Aj, it is easy to see we obtain m independent linear systems to solve

˜ ˜ Ajx˜j = bj, j = 1, . . . , m. (3.23)

The steps for solution of the linear system Ax = b are given by Algorithm 3.1. Each

∗ multiplication by the matrix Fb or Fb represents a block DFT or inverse DFT (IDFT) 35 operation, respectively. It is worth noting that the system solves in line 3 of the algorithm are completely independent, and thus make the algorithm very amendable to parallel implementation as noted in [35].

Algorithm 3.1 Pseudocode for the sequential solution of a block circulant linear system. ˜ ∗ 1: Compute b = Fb b; ˜ 2: Compute X = FbX; ˜ ˜ 3: Solve Ajx˜j = bj, j = 1, . . . , m; 4: Compute x = Fbx˜;

3.3 Invertibility

Algorithm 3.1 requires the inversion of the blocks obtained from computing the

DFT of the first block row of A. Therefore, assumptions on the invertibility of these blocks are required by the algorithm. The section will show that if the initial matrix A is assumed to be nonsingular, then each diagonal block is also nonsingular.

In order to facilitate the proof, we first show that the block Fourier matrix given in (3.16) is unitary.

Lemma 3.1. The block Fourier matrix, Fb, as defined in (3.16) is unitary.

Proof. Recall, the N × N block Fourier matrix Fb can be constructed as a Kronecker product of the unitary m × m Fourier matrix F with the n × n identity matrix In. That is,

Fb = F ⊗ In. (3.24) 36

By the properties of Kronecker products [13] we have (A ⊗ B)∗ = A∗ ⊗ B∗. Therefore,

∗ ∗ ∗ ∗ ∗ Fb = (F ⊗ In) = F ⊗ In = F ⊗ In. (3.25)

∗ −1 So Fb is can be constructed in the same fashion. Now consider Fb . By the Kronecker product property (A ⊗ B)−1 = A−1 ⊗ B−1, for square nonsingular A and B, we have

−1 −1 −1 Fb = F ⊗ In . (3.26)

However, the Fourier matrix F is unitary, and thus

−1 ∗ Fb = F ⊗ In. (3.27)

∗ ∗ It has been established that Fb = F ⊗ In, and, therefore,

−1 ∗ Fb = Fb . (3.28)

Thus Fb is unitary.

Theorem 3.1. Given a nonsingular block circulant matrix A. The block diagonal matrix

˜ ˜ ˜ ˜ diag{A1, A2,..., Am} is nonsingular, where each Aj, j = 1, . . . , m are the blocks obtained by computing the block Fourier transform of the first block row of A.

Proof. Since A is block circulant we have

∗ ˜ ˜ ˜ A = Fb diag{A1, A2,..., Am}Fb. (3.29) 37

Taking the determinant yields

 ∗ ˜ ˜ ˜  det (A) = det Fb diag{A1, A2,..., Am}Fb . (3.30)

Using a property of determinants we obtain

∗  ˜ ˜ ˜  det (A) = det Fb det diag{A1, A2,..., Am} det (Fb) . (3.31)

∗ By Lemma 3.1, Fb is unitary, and, thus, det Fb = det (Fb) = 1; therefore,

 ˜ ˜ ˜  det (A) = det diag{A1, A2,..., Am} . (3.32)

Using the relation of the determinant of block diagonal matrices, we have

˜ ˜ ˜ det (A) = det(A1)det(A2) ... det(Am). (3.33)

˜ Because A is nonsingular, det(A) 6= 0, and, therefore, det(Aj) 6= 0 for j = 1, . . . , m. It

˜ ˜ ˜ ˜ follows that each Aj, j = 1, . . . , m is nonsingular, and, therefore, diag{A1, A2,..., Am} is nonsingular. 38

Chapter 4

Parallel Solution Algorithm

4.1 Block DFT Algorithm

While it is enticing to develop the algorithm around the Fast Fourier Transform

(FFT), the robustness of the algorithm will be lost. Recall that the length of the DFT is determined by the number of symmetries of the boundary surface. For problems involving real world structures, such as propellers or wind turbines, the number of symmetries will be small e.g., m ≤ 30. Indeed, even if a structure contained symmetries arising every one degree, i.e., m = 360, there must be at least one surface element in the discretization representing the symmetry, meaning n ≥ 360. This case is somewhat pathological, and, in general, we assume each symmetry has a large number of surface elements. This means it can be reasonably assumed that m  n. In addition, FFT’s make assumptions on the properties of m. The most common assumption being that m is a power of two.

While there are now FFT algorithms for any value of m [11, 26], the algorithms applied to feasible sizes of m have negligible benefits due to constants in the computation. We therefore designed our algorithm to be robust in the sense that it will work for any boundary surface input, and therefore we use a DFT approach.

We derive the algorithm in the context of computing the block DFT of the first block row of A, given by (3.17), as this computation is needed during the system solve.

Define P to be the number of processors and assume P = m. The initial data distribution 39 is obtained by assigning each submatrix Aj to processor Pj, for j = 1, . . . , m. The initial data distribution for P = m = 4 is illustrated in Figure 4.1.

Fig. 4.1 Initial data distribution assumed in the DFT computation for the case P = m =

4.

˜ Expanding the DFT relation X = FbX, we obtain

˜ A1 = A1 + A2 + A3 + ··· + Am

˜ 1 2 m−1 A2 = A1 + A2ωm + A3ωm + ··· + Amωm ˜ 2 4 2(m−1) (4.1) A3 = A1 + A2ωm + A3ωm + ··· + Amωm . . . .

˜ m−1 2(m−1) (m−1)(m−1) Am = A1 + A2ωm + A3ωm + ··· + Amωm .

˜ Given this initial data distribution, in the computation of A1, processor P1 already

˜ contains a portion of the summation, namely A1. In fact, in all of the Aj computations, each processor contains a scaled portion of the corresponding summation. In addition,

(k−1)(j−1) the scalar values ωm , for j, k = 1, . . . , m, are computable. This means that for the cost of scaling a submatrix by a term in Fourier matrix, we already have a portion

˜ of the computation of each Aj, j = 1, . . . , m. The algorithm expands on this idea to compute the entire summation. 40

Starting from the initial data distribution, each processor computes the portion of the summation that corresponds to the data owned. Then, each Pi cyclically sends its submatrix to Pi−1 (P1 sends its data to Pm). Each processor computes the corresponding term in the summation and propagates the submatrix. The computation completes after m − 1 communications. Figure 4.2 illustrates this process for the case P = m = 4.

Fig. 4.2 The DFT computation for the case P = m = 4. Each arrow indicates the com- munication of a processor’s owned submatrix to a neighboring processor in the direction of the arrow.

+ The algorithm can be generalized to the case with P = cm, where c  Z , by observing that a block DFT with a block size of n×n can be broken into n2 independent

DFTs with block size 1. To see this consider the kth summation taken from (4.1). We then have

˜ (k−1) 2(k−1) (m−1)(k−1) Ak = A1 + A2ωm + A3ωm + ··· + Amωm . (4.2) 41

n×n Recall that Aj  C , for each j = 1, . . . , m. For illustrative purposes, let n = 2. Then

(4.2) becomes

        a˜k a˜k a1 a1 a2 a2 a3 a3  11 12   11 12   11 12  (k−1)  11 12  2(k−1)   =   +   ωm +   ωm +  k k   1 1   2 2   3 3  a˜21 a˜22 a21 a22 a21 a22 a21 a22   am am  11 12  (m−1)(k−1) ··· +   ωm , (4.3)  m m  a21 a22

k where superscript k indicates aij is an element of Ak. From here the computation of the ˜ 2 elements of Ak can be written as the following n = 4 independent summations

k 1 2 (k−1) 3 2(k−1) m (m−1)(k−1) a˜11 = a11 + a11ωm + a11ωm + ··· + a11ωm

k 1 2 (k−1) 3 2(k−1) m (m−1)(k−1) a˜12 = a12 + a12ωm + a12ωm + ··· + a12ωm (4.4) k 1 2 (k−1) 3 2(k−1) m (m−1)(k−1) a˜21 = a21 + a21ωm + a21ωm + ··· + a21ωm

k 1 2 (k−1) 3 2(k−1) m (m−1)(k−1) a˜22 = a22 + a22ωm + a22ωm + ··· + a22ωm .

The independence of each summation permits us, given a sufficient number of processors, to perform these summations simultaneously. In a more general setting, this equates to partitioning each Aj, j = 1, . . . , m, into smaller block sizes, and then simultaneously performing block DFTs of this smaller block size.

Now that it has been established that a block DFT can be broken down into block

DFTs of smaller block size, we explain how to exploit this in the P = cm case. Let c = 4,

√n √n i.e., P = 4m, and partition each Aj, j = 1, . . . , m, into c = 4 blocks of size c × c . √ Note that the block size is arbitrary, therefore, if c is not an integer the submatrix is 42 simply split into c blocks with slightly different block size. The data decomposition can

k be seen in Figure 4.3, where again superscript k indicates Aij is a block of Ak.

Fig. 4.3 Parallel block DFT data decomposition for P > m.

√n √n We rewrite these as c = 4 independent block DFTs of block size c × c . We then group the processors into c = 4 processor groups of size m. Grouping the processors, we obtain four DFTs in the form presented when P = m. Figure 4.4 shows the processor group organization. We then apply the P = m DFT algorithm within each processor group simultaneously. Therefore, when P = cm, we can decompose each Aj into c independent block DFTs of smaller block size. This decomposition can proceed all the

2 way down until each Aj is decomposed into n independent DFTs of block size 1. In this case, c = n2, i.e., P = n2m, and n2 one-dimensional DFTs are being performed simultaneously.

Since the most expensive part of computing the blocked DFT is the communi- cation of the submatrices, it is desirable to overlap communication and computation as much as possible. With this in mind, we introduce asynchronous send/receives. Start- ing from the P = m initial data distribution, begin by the asynchronous send of the processor’s owned submatrix followed by the asynchronous receive of the neighboring 43

Fig. 4.4 Parallel block DFT data decomposition and processor groupings for P > m. processor’s submatrix. While the processor’s current submatrix data is being sent, a neighboring processor’s submatrix is being received. During this communication, the data being sent is still able to be used because no modifications are being made. The data being sent is then used to update the partial sum. Therefore, we are sending, receiving, and computing the partial sum simultaneously.

There is a cost associated with the communication overlap. The cost is in the amount of memory being used to enable this overlapped communication/computation.

Three times the amount of memory is now being used, the unmodified submatrix, neigh- boring processor’s unmodified submatrix, and running partial sum for the transformed submatrix. However, the amount of extra memory used can be managed by only com- municating portions of a submatrix at a time. While theoretically it is best to minimize the communication startups, in practice, for large volumes of data, it is beneficial to send the data spread over a number of smaller packets. This blocking factor for optimal 44 communication times is system dependent, but it also gives a parameter which can be modified when memory consumption becomes an issue.

˜ ˜ ∗ Note that this algorithm is used for both the X = FbX and X = Fb X. The only difference are the terms Fourier matrix. When referring to the parallel DFT algorithm,

∗ we differentiate the use of Fb and Fb as parallel DFT and IDFT, respectively.

4.2 Block FFT Algorithm

As mentioned in Section 4.1, the FFT is difficult to apply when considering an arbitrary number of rotational symmetries, m, because of its restriction on the value of m, i.e., power of two in the radix-2 algorithm. In certain cases however, when the FFT is applicable, it can effectively be used. A relevant example concerns acoustic radiation problems involving axisymmetric structures. These problems deal with structures ob- tained by rotating a two-dimensional object around a third fixed, orthogonal axis. For example, cyclinders or spheres are types of axisymmetric structures. By considering the structure of a propeller, or fan blade, it can be readily deduced that while all axisym- metric structures are rotationally symmetric, not all rotationally symmetric structures are axisymmetric; that is, axisymetric structures are a subset of rotationally symmetric structures. The advantage of axisymmetric structures come from the ability to choose the number of rotational symmetries in the discretization of the problem. Being able to choose the values of m means that the choice can be made to exploit the FFT.

Section 4.1 began by detailing a DFT algorithm for the P = m case. It then ex- tended the algorithm to the P = cm case by breaking the block DFT into c independent block DFTs of smaller blocksize. The algorithm then constructs c processor groups, each 45 with m processors, around the decompositions. It then uses the P = m algorithm within each processor group to simultaneously compute the block DFTs of smaller blocksize.

The FFT algorithm keeps the exact same framework as the DFT algorithm. The differ- ence arises in how the P = m algorithm computes the DFT; in this case, a distributed

FFT algorithm is used.

In order to derive the parallel algorithm, consider the sequential FFT algorithm given by Algorithm 1.2; the accompanying discussion in Section 1.3 concerned the pattern of interaction between elements of the initial input vector in producing the transformed vector. Indeed, this is the essence of the FFT. Figure 1.1 gave a visualization of the interaction pattern; in addition, it also showed how the data migrated to a bit reversed order. This is important. The parallel algorithm will distribute each element of the input vector onto different processors, and these element interactions will become com- munication patterns. The algorithm used to compute the distributed one-dimensional

FFT has been termed the binary exchange algorithm [24]. Only small modifications to

Algorithm 1.2 are needed to fit the parallel case.

As in Section 4.1, we present the algorithm in the context of computing the block

FFT of the first block row of A. Define P to be the number of processors and assume

P = m, the initial data distribution is obtained by assigning each submatrix Aj to processor Pj, for j = 1, . . . , m. The initial data distribution for P = m = 4 can again be seen in Figure 4.1.

Now, consider the parallel Algorithm 4.1, which is a the parallel FFT algorithm resulting from simple modifications to Algorithm 1.2. 46

Algorithm 4.1 Distributed Radix-2 FFT pseudocode [24]. 1: Y=Radix-2FFT(X,Y,n) 2: r = log n; 3: R = X; 4: for m = 0 to r − 1 do 5: S = R; 6: //Let (b0b1 . . . br−1) be the binary representation of pid 7: j = (b0 . . . bm−10bm+1 . . . br−1); 8: k = (b0 . . . bm−11bm+1 . . . br−1); 9: r = (bmbm−1 . . . b00 ... 0); 10: if pid == j then 11: Send A to processor k pid 12: Receive Ak from processor k 13: A = A + A ωr ; pid pid k n 14: else 15: Receive Aj from processor j 16: Send A to processor j pid 17: A = A + A ωr ; pid j pid n 18: end if 19: end for 20: Y = R;

The first difference to note is that the second loop in Algorithm 1.2 is no longer needed. The iterate variable i was for knowing two things: which element of the initial vector to update, and determining the other elements involved in the computation. As each processor only has one element, there is no question which element each processor is responsible for updating. The second property remains intact because each Ai is contained on the processor with pid = i, in which, pid is the processor id. During each iteration of Algorithm 4.1, each processor needs one extra piece of data to perform the update to the owned data. Each processor uses its processor id to compute which element it needs to complete the current computation. By determining the element number, the pid of the processor which owns the data is determined; this can then be used to set up the communication to obtain the data. Figure 4.5 illustrates this process. 47

Fig. 4.5 Process illustrating the distributed FFT. Lines crossing to different processors indicate communication from left to right. Note the output is in reverse bit-reversed order relative to numbering starting at zero; that is, A1 is element 0; A2 is element 1, etc.

The extension to the P = cm case is identical to the discussion in Section 4.1.

The block DFT can be decomposed into c independent blocks DFTs of smaller block size.

The processors then create c processor groupings and simultaneously perform the P = m

FFT algorithm. The advantage to computing using the distributed FFT in this way is that the number of communications is minimized. The parallel DFT algorithm requires

O(m) communications; whereas, in using the FFT, the number of communications is

O(log m). Although we have assumed m to be quite small, in the P = m case each communication requires that n2 data elements be sent. This means the packet sizes are quite large; therefore, any minimization to the number of communications is beneficial. 48

4.3 System Solves

The goal is to solve all systems in line 3 of Algorithm 3.1 simultaneously. In addition, the ScaLAPACK routine PZGESV is used to further parallelize each system solve. By using ScaLAPACK, we are forced to work within the limits of its required data distribution and processor organization. In particular, the matrix data must be distributed in a block cyclic fashion, and the processors logically arranged in a grid format [6]. Using these restrictions, the initial system is setup as follows. Assume

+ √ √ P = cm processors and c  Z . Now define m processor grids of size c × c and √ denote them by Gi, i = 1, . . . , m. If c is not an integer, the processors are arranged in a rectangular grid format such that the number of rows and columns are integers.

Figure 4.6 illustrates the grid creation process for P = 16 and m = 4.

Fig. 4.6 Processor grid creation for P=16 and m=4.

Next, each Aj and corresponding right-hand side bj are block cyclically distributed over process grid Gj for j = 1, . . . , m. We require that the block cyclic distribution be 49 performed using the same blocking factor for each Aj and bj. Each Gj is then in a position where it can solve a system involving Aj and bj. However, before these system solves can be performed the left and right-hand sides must be transformed by the DFT and IDFT respectively.

4.4 Parallel Algorithm

We have established an initial data distribution which can be used by ScaLAPACK and an algorithm for computing the DFT. Working within this data distribution and using the DFT or FFT algorithm we present the parallel algorithm.

+ Assume we have P = cm processors where c  Z . Define m processors grids

Gj, j = 1, . . . , m, and block cyclically distribute each Aj and bj onto processor grid Gj for j = 1, . . . , m. The first step is to perform the IDFT to the right-hand side b. Each bj is distributed onto their respective processor grid of c processors. Because each bj was distributed over its corresponding processor grid using the same blocking factor, the distribution process is identical to decomposing each bj into c smaller size blocks.

Therefore, we can create c processor groupings of size m where each processor group is composed of one processor from each grid. That is, processor group 1 will be formed by taking each Gj’s first element; group 2 will be formed by taking each Gj’s second element, and this process continues until we have c processor groupings. These processor groupings create c independent IDFTs of smaller blocksize which can use the DFT/FFT algorithm. Therefore, the IDFT involving each bj has been decomposed into c IDFTs of smaller size which can be done simultaneously. Using the DFT/FFT algorithm, we

˜ perform the IDFT of bj transforming each bj into bj. In the same way, we transform each 50

˜ Aj to Aj. Now note that the data distribution has not changed and each Gj now has the

˜ ˜ system Ajx˜j = bj, which are precisely the systems that need to be solved. Also note, if the

FFT algorithm is used, the data has migrated into a bit reversed order during the IDFT transformations; however, both sides of the equation have migrated into a bit reversed order and the correct systems are still obtained. More precisely, if we let rev(j) denote

˜ ˜ the bit reversal of j, after the IDFT transformations of Aj and bj each system Ajx˜j = bj resides on process grid Grev(j), for j = 1, . . . , m. Each Gj calls the ScaLAPACK routine

˜ PZGESV and solves its respective system. PZGESV overwrites bj with the solutionx ˜j.

˜ Because the solution overwrites the entries of bj, the data distribution has not changed, and we simply use the DFT/FFT algorithm again to transform eachx ˜j to xj. Thus we have the solution of the original linear system. If the FFT algorithm was used,x ˜j would be in bit reversed order, that is,x ˜j is contained in grid Grev(j) for j = 1, . . . , m; however, when transforming back to xj the bit reversed order is negated. Therefore, xj is contained on grid Gj, j = 1, . . . , m, and the solution vector is in the same form as if the DFT algorithm had been used. Algorithm 4.2 shows the pseudocode for the parallel algorithm as six concise steps.

Algorithm 4.2 Pseudocode for the parallel solution of a block circulant linear system, assuming P = cm. √ √ 1: Define m c × c process grids. 2: Block cyclically distribute each Aj and bj onto grid Gj in an identical fashion. ˜ 3: Perform c simultaneous IDFTs transforming bj to bj. ˜ 4: Perform c simultaneous DFTs transforming Aj to Aj ˜ ˜ 5: Simultaneously solve each Ajx˜j = bj in parallel using PZGESV. 6: Perform c simultaneous DFTs transformingx ˜j to xj. 51

Chapter 5

Theoretical Timing Analysis

In this chapter, the theoretical runtime analysis for the parallel implementations discussed in Chapter 4 are developed. Algorithm 4.2 contains two core operations: par- allel computation of the DFT and the parallel linear system solve. Therefore, the parallel runtime, denoted TP (n, m), can be expressed as:

TP (n, m) = TFT (n, m) + TLS(n, m), (5.1)

where TFT (n, m) denotes the parallel runtime in computing the DFT, and TLS(n, m) denotes the runtime of the parallel linear system solve. Chapter 4 presented two different implementations of the DFT, and, therefore, two parallel runtimes will be developed.

Let A be a block circulant matrix with m blocks of order n, and let X contain

A’s first block row; that is,   A  1       A   2  X =   .  .   .   .      Am

Further, let b be a single column vector and right-hand side of the linear system Ax = b. 52

5.1 Parallel Linear System Solve

The parallel linear system solves are performed by ScaLAPACK which conve- niently provides the theoretical analysis of the implementation [6]. The term TLS(n, m) is then given by

3 1 2n (3 + 4 log2 P ) 2 TLS(n, m) = tf + √ n tv + (6 + log P ) tm, (5.2) 3P P 2

where tf is the time per complex floating point operation, tm is the startup time for each communication, and tv is the time per data item sent. In general, tm > tv; thus, the number of communication startups should be minimized. Equation (5.2) can be broken into three parts: the first term in the summation is the computation term; the second term is the communication cost concerning the quantity of data items sent, and the last term corresponds to the number of communication startups.

The variable P in (5.2) is used to denote all processors; however, in the general case where P = cm, the parallel implementation contains m simultaneous system solves,

P each with c = m processors devoted to the parallel system solves. Therefore, the term

P in (5.2) should be replaced by c obtaining

2n3 (3 + 1 log c) T (n, m) = t + √4 2 n2t + (6 + log c) t . (5.3) LS 3c f c v 2 m

Note that (5.3) is the parallel runtime for all of the m linear system solves. Due to the concurrency of the m linear system solves, solving m linear systems with P = cm processors is equivalent to solving one linear system with c processors. This overlap in 53 parallelized operations is what makes the inversion formulation so amendable to parallel solution.

5.2 Block DFT using the DFT Algorithm

In this section, the runtime analysis of Algorithm 4.2 is considered when the block

DFT algorithm (see Section 4.1) is used. There are three transformations which use the

˜ ˜ DFT algorithm presented in Section 4.1: the transformation of Aj to Aj, bj to bj, and the solution vectorx ˜j to xj, for j = 1, . . . , m. Each of these transformations requires m − 1

˜ communications. When transforming Aj to Aj, for j = 1, . . . , m, each communication

2 ˜ involves messages of size n ; similarly, the transformation of bj to bj, andx ˜j to xj, for j = 1, . . . , m, both involve messages of size n. Using this, the communication term in the analysis, denoted To(n, m), can be constructed. Accounting for the communications needed by these transformations To(n, m) is given by

2 To(n, m) = 3(m − 1)tm + (m − 1)(n + 2n)tv, (5.4)

where, again, tm is the time to initialize a communication, and tv is the time per data item sent.

The computational term in the analysis is relatively straightforward. During each step of the algorithm, each processor multiplies the data it currently owns and adds it

˜ to its running sum. When transforming Aj to Aj, for j = 1, . . . , m, each processor scales n2 elements by a term in the Fourier matrix and adds it to the running sum; therefore,

2 2 ˜ we have n m multiplications plus n (m − 1) additions in the transformation of Aj to Aj, 54

˜ for j = 1, . . . , m. Similarly, the transformation of bj to bj, andx ˜j to xj, for j = 1, . . . , m, both involve nm multiplications and n(m − 1) additions. Combining the computational and communication terms yields

2 2 2 TDFT (n, m) = (m − 1)(n + 2n)tf + m(n + 2n)tf + 3(m − 1)tm + (m − 1)(n + 2n)tv.

(5.5)

The analysis can easily be extended to the P = cm case. Recall, the P = cm

DFT algorithm creates c DFTs of smaller blocksize and arranges c processor groups.

Using these processor groups, c simultaneous P = m DFTs of smaller blocksize are then performed. While the same number of communication startups are still needed, the size

1 of the messages as well as the amount of computation are reduced by c ; therefore, by dividing the appropriate terms in (5.5) by c the P = cm case is obtained

(m − 1)(n2 + 2n) + m(n2 + 2n) (m − 1)(n2 + 2n) T (n, m) = t + 3(m − 1)t + t . DFT c f m c v (5.6)

More compactly,

(2m − 1)(n2 + 2n) (m − 1)(n2 + 2n) T (n, m) = t + 3(m − 1)t + t . (5.7) DFT c f m c v 55

By combining (5.3) and (5.7), the parallel runtime for Algorithm 4.2, which is given by

(5.8), is obtained:

(2m − 1)(n2 + 2n) (m − 1)(n2 + 2n) 2n3 T (n, m) = t + 3(m − 1)t + t + t P 1 c f m c v 3c f

(3 + 1 log c) + √4 2 n2t + (6 + log c)t . (5.8) c v 2 m

By rearranging (5.8) and grouping computation and communication-specific constants, the final parallel runtime using the DFT algorithm is given by

" # 2n3 (2m − 1)(n2 + 2n) T (n, m) = + t + [3(m − 1) + (6 + log c)] t P 1 3c c f 2 m

" # (m − 1)(n2 + 2n) (3 + 1 log c) + + √4 2 n2 t . (5.9) c c v

5.3 Block DFT Using the FFT Algorithm

The FFT timing analysis follows directly from Section 5.2. Recall the main differ- ence between the DFT algorithm and the FFT algorithm is the communication pattern.

Whereas the DFT required m − 1 communications, the FFT only requires log2 m, for m a power of two. Consider the DFT implementation’s communication term (5.4). By substituting log2 m for the appropriate communication terms (5.4) becomes

2 To(n, m) = 3 log2(m)tm + log2(m)(n + 2n)tv (5.10)

when the FFT algorithm is used. In the FFT case, after each communication, each pro- cessor scales a portion of its owned data by a term in the Fourier matrix. This modified 56 data is then added to the processor’s running sum; therefore, log2 m communications implies log2 m multiplications and log2 m additions are performed. This is reflected in the computational term. Note, these are the only terms that change relative to the anal- ysis involving the DFT algorithm. By proceeding in the same manner as Section 5.2, we obtain

" # 2n3 2 log (m)(n2 + 2n) T (n, m) = + 2 t + [3 log m + (6 + log c)] t P 2 3c c f 2 2 m

" # log (m)(n2 + 2n) (3 + 1 log c) + 2 + √4 2 n2 t (5.11) c c v for the final runtime of Algorithm 4.2 when using the FFT algorithm.

5.4 Bounds

Constructing the parallel complexity analysis allows us to find the dominating term in both parallel algorithms. Recall the assumptions in the development of the parallel algorithms in Chapter 4, namely, n  m; that is, the order of each block in the coefficient matrix is large relative to the number of blocks. In general, it was

2n3 assumed m < 30. Looking at (5.9), it is clear that the first term, i.e., 3c , dominates the computation. Similarly, by considering (5.11), it follows that both expressions have the same dominating term. Therefore, we obtain

! n3 T (n, m) = O (5.12) P 1 c 57 and

! n3 T (n, m) = O . (5.13) P 2 c

Recalling that N = nm, for N large, the term which dominates arises from the ScaLA-

PACK linear system solve. Therefore, under our assumptions, the most expensive part of the computation is offloaded to the ScaLAPACK routine. This means that although large packets of data must be communicated between processors in computing the DFT, when N is large, the dominating term comes from the linear system solves. While this is not extremely surprising, the implication is that the communication terms in the devel- oped algorithms do not overwhelm the overall algorithm. As a result, the computational portion of the linear system solves dominates. This is also the result reached in the

ScaLAPACK user guide [6] where only the parallel linear system solve is analyzed. This is considered advantageous because the linear system solves are computed via ScaLA-

PACK which are optimized for scalability. 58

Chapter 6

Numerical Experiments

All experiments were run using the Intel Nehelam processors of the Cyberstar compute cluster [1] running at 2.66 Ghz with 24 GB of RAM. We implemented the parallel algorithm in 90 and used the ScaLAPACK and MPI libraries. A blocking factor of 50 was used for the block cyclic distribution of each Aj and bj onto the their respective processor grids.

The FFT and DFT parallel algorithms differ in the communication routines used.

The DFT algorithm broke the communications into blocks of size 4000 which were sent and received asynchronously using MPI’s ISEND/IRECV functions. In the case that a processor does not contain 4000 elements, all of its data is sent in one communication.

The blocking of the communications also parsed each matrix columnwise to work within

FORTRAN’s column major data storage format. Whereas, the FFT algorithm did not perform asynchronous sends/receives and used the standard BLACS routines for sending

2D blocks of data.

6.1 Experiment 1

First, we look at the runtime, speedup, and efficiency for a vibrating structure with four times rotational symmetry for both the DFT algorithm and FFT algorithm. 59

In each case, the number of processors P and matrix size N are varied. The number of processors is varied from 4 to 48 and N is varied from roughly 13, 000 to 24, 000.

6.2 Experiment 2

We look at the runtime, speedup, and efficiency for a vibrating structure with eight times rotational symmetry for both the DFT algorithm and FFT algorithm. In each case, the number of processors P and matrix size N are varied. The number of processors is varied from 8 to 48 and N is varied from roughly 13, 000 to 24, 000.

6.3 Numerical Results

6.3.1 Experiment 1

First, consider the algorithm’s behavior when a structure with four times rota- tional symmetry, m = 4, is examined using the DFT algorithm as well as the FFT algorithm. Figure 6.1 shows the runtimes when using the DFT algorithm; a sharp de- cline in runtime can be seen as the number of processors increase for various N. The runtimes using the FFT implementation are given in Figure 6.2 showing similar trends and runtimes as their DFT counterpart. The runtime improvements are also apparent when looking at the speedup, which are given in Figures 6.3 and 6.4. The oscillations in the speedups can be explained by looking more closely at the values of the runtimes.

Figures 6.1 and 6.2 show that the wall clock times are quite low, and small benign vari- ances in the runtime for large P cause large oscillations in the speedup. This is why the 60 oscillations are flushed out for larger problems. Therefore as N increases, the oscillations are dampened, and the speedups become more linear.

Fig. 6.1 Runtime comparison using the DFT algorithm for varying P and N with m = 4.

Fig. 6.2 Runtime comparison using the FFT algorithm for varying P and N with m = 4. 61

Fig. 6.3 Speedups using the DFT algorithm for varying P and N with m = 4.

Fig. 6.4 Speedups using the FFT algorithm for varying P and N with m = 4. 62

The most important category in parallel algorithm analysis is probably efficiency.

Efficiency is a measure of useful work done by a parallel algorithm and gives insight into how much time the algorithm spends waiting on communication. Ideally, we would like the efficiency to be as close to 1 as possible, which means all of the work is use- ful. However, we are restricted by the underlying parallelism of the computation being performed. Here we look at the behavior of the efficiency for varying P and N. Fig- ures 6.5 and 6.6 show the efficiency as a function of problem size for the DFT and FFT implementations, respectively. For nearly all processor numbers, excluding P = 4 which simply remains efficient, the algorithm becomes more efficient as the problem size in- creases. This tells us that as the problem size increases, the amount of time spent doing useful work increases. Because N = nm, for m fixed, an increase in problem size directly correlates to an increase in n. Recall the discussion in Section 5.4; for fixed m such that n  m, the dominating term comes from the computational portion of the ScaLAPACK linear system solve. This fact is seen in Figures 6.5 and 6.6; as a function of problem size, the amount of time spent computing grows faster than the amount of time spent communicating.

Notice, generally the larger the number of processors, the lower the efficiency.

Although the DFT computations are ideally parallel and are able to overlap communi- cations resulting from additional processors, the linear system solve computations are not. More processors imply that the communication term in the linear system solve will contribute more to the overall runtime; however, as the size of the linear system grows, the computational portion of the ScaLAPACK solve begins to dominate. That is, more processors means the efficiency will be lower for problems of the same size, but 63 as the computational term of the linear system solve begins to dominate, the efficiency increases. Therefore, even though the efficiencies for different processors are decreasing with P in Figure 6.5, it is only because the data point, i.e., the value of N, is fixed.

Fig. 6.5 Efficiency using the DFT algorithm for varying N and P with m = 4.

An interesting remark with respect to the two different implementations is to note the similarity in their performance. Recall, the main difference in the algorithms is the number of communications needed when computing the DFT; that is, the DFT algorithm, i.e., matrix multiplication, or the FFT algorithm. For the case m = 4, the number of communications is negligible; however, m = 4 also means the linear systems are larger. Therefore, the computational term in the ScaLAPACK linear system solve will dominate the computation more, making the algorithms behave in a similar fashion.

This was also alluded to in the theoretical analysis given in Chapter 5. 64

Fig. 6.6 Efficiency using the FFT algorithm for varying N and P with m = 4.

6.3.2 Experiment 2

Now, consider the performance of the algorithm when the number of rotational symmetries m = 8. Figures 6.7 and 6.8 show the runtime analysis for the DFT and FFT implementation. The trend in the runtimes is similar to that of the four times rotational symmetry case. The main difference are the runtime values. Consider the largest case,

N = 24, 000; Figure 6.1 shows the runtime for P = 8 is roughly 38 seconds. Whereas,

Figure 6.7 shows that for the same values of P and N, the computation time is only 12 seconds. As m increases, the size of the linear system decreases. This means that the most expensive part of the computation, which is the linear system solve, decreases with m, and results in a faster overall runtime. Even though the number of communications in the DFT/FFT algorithms grows with m, the messages per communication are smaller. 65

Fig. 6.7 Runtime comparison using the DFT algorithm for varying P and N with m = 8.

Fig. 6.8 Runtime comparison using the FFT algorithm for varying P and N with m = 8. 66

Figure 6.9 shows the speedup for the DFT algorithm in the eight times rotational symmetry case. What is interesting is that for smaller problems the speedup begins leveling off past a certain point. The DFT and FFT algorithms have no increasing dependence on P in their communication terms, and, therefore, this must be due to the size of the linear systems. This shows that for a fixed problem size, the advantage of additional processors becomes negligible after a certain point due to the ratio of computation to communication in the linear system solve. What is important is that until this point of leveling off, the speedup increases nearly linearly. This means that the extra communications in the DFT/FFT algorithm, which are due to the increase in m, do not overwhelm the algorithm. Indeed, by considering the larger values of N, Figure 6.9 shows that the speedups are nearly linear for larger problem sizes. The speedup in the case of the FFT algorithm is give in Figure 6.10. The trend for the smaller values of

N appears to extend further than in Figure 6.9 and is most likely due to savings of the

FFT.

Lastly, efficiency is considered and is shown in Figures 6.11 and 6.12 for varying values of P and N. Again, it is found that the efficiency increases for increasing problem size. It can be seen that the overall value of the efficiency is slightly less than the m = 4 case; this is due to two things: the increase in communications due to the

DFT/FFT algorithms, and the size of the linear system solve. However, because m is

fixed, the number of communications by the DFT/FFT algorithm will not grow with

N, even though the message size will grow. Therefore, as the problem size increases, the computational term of the linear system solve will again begin to dominate, and the efficiency can be expected to increase. 67

Fig. 6.9 Speedup comparison using the DFT algorithm for varying P and N when m = 8.

Fig. 6.10 Speedup comparison using the FFT algorithm for varying P and N when m = 8. 68

Fig. 6.11 Efficiency comparison using the DFT algorithm for varying P and N when m = 8.

The effect of using the FFT over the DFT can be seen by comparing the first data point N = 13, 000 of Figures 6.11 and 6.12. The efficiencies for the FFT algorithm are higher at this data point. In this instance, the linear systems are still relatively small, and the computational term of the ScaLAPACK solve does not yet dominate. This is because for the given value of N = 13, 000, the communications of the DFT transformations contribute more. Because the FFT implementation uses fewer communications, the efficiencies are higher for smaller N.

As in the four times rotational symmetry case, we find that the DFT and FFT implementations perform similarly. The observed benefits for using the FFT appeared at the lower bound of our experimental values. The FFT algorithm showed a higher efficiency when N was small. In this instance, the linear system solves did not yet dominate, and, therefore, the communications contribute more. However, in both the 69

Fig. 6.12 Efficiency comparison using the FFT algorithm for varying P and N when m = 8. m = 4 and m = 8 cases, as N increases, the algorithms exhibit similar performance. In our case, the assumptions rely on the solution of larger linear systems. In the case of smaller linear systems and larger m, the FFT algorithm could be expected to produce better performance results. 70

Chapter 7

Conclusions

We have proposed a parallel algorithm for the solution of block circulant linear systems arising from acoustic radiation problems with rotationally symmetric boundary surfaces. A derivation of the linear system was given along with conditions for application of the algorithm. The algorithm takes advantage of the ScaLAPACK library and exploits the embarrassingly parallel nature of block DFTs within ScaLAPACK’s required data distributions. In addition, by exploiting the block circulant structure of the matrix in the context of the parallel algorithm, the memory requirements are reduced. The reduction in the memory requirements allows for the solution of larger block circulant linear systems. Because the size of the matrix directly correlates with the number of surface elements in the discretization, problems which require a finer discretization, i.e., higher frequency problems, can be explored. In addition, problems with larger overall structures can be investigated.

The behavior of the DFT and FFT algorithms was similar for large N. The exper- imental results show near linear speedup for varying problem sizes and that the speedups become more linear for increasingly large N. We also showed that the efficiency of the algorithm increases as a function of problem size. The theoretical analysis coupled with the experimental results showed that in both cases the algorithm becomes dominated by the ScaLAPACK linear system solve portion of the algorithm. Given the requirements 71 of the problem, i.e., n  m with m ≤ 30, it is found that for larger problems, the difference in the two algorithms is negligible. It has also been established that the block

DFT transformations can be performed within the ScaLAPACK data distribution, and that the necessary communications for the DFT transformations do not overwhelm the algorithm’s runtime.

In addition, because we developed an algorithm using a matrix multiplication

DFT approach, it can be applied to any rotationally symmetric structure. The parallel algorithm therefore permits the efficient computation of larger acoustic radiation prob- lems with rotationally symmetric boundary surfaces. While small gains exist by choosing the FFT algorithm over the developed DFT algorithm, these gains are negligible given our assumptions on N and m. The FFT also places additional requirements on the values of m, i.e., m is a power of 2. Indeed, for the assumption m ≤ 30, there are only four viable values of m, namely, 2, 4, 8, and 16. Nevertheless, small gains do exist, and, therefore, one avenue for further investigation is the development of a robust algorithm which uses FFTs within the context of using ScaLAPACK for the linear systems. If an elegant domain decomposition can be devised, and if a robust FFT algorithm, such as

Bluestein’s FFT algorithm [7, 8], can be fitted to the problem, the algorithm could be further improved. 72

Appendix

BEM Code

The modified code, in its most general form, has four different cases:

1. Sequential with no rotational symmetries.

2. Sequential with rotational symmetries.

3. Parallel with no rotational symmetries.

4. Parallel with rotational symmetries.

Therefore, in the main program, logic exists to direct the program flow through one of the four cases given above. There are five core functions which have been modified to support the cases given above. These are:

1. STATIC MULTIPOLE ARRAYS

2. COEFF MATRIX

3. SOURCE AMPLITUDES MODES

4. SOURCE POWER

5. MODAL RESISTANCE

Before describing each function individually, we first define some frequently used ter- minology. When using the term “distributed data structure”, we are referring to each 73 processor containing a portion of a global data structure. For example, assume we are given a matrix A and we have P processors. Instead of one processor containing all of the matrix A, the elements are split up, and each processor has a data structure which contains these portions of the matrix. We refer to the collection of these data structures as a “distributed data structure” and denote it as sub[A]. This is because when all processors combine their corresponding sub[A], we obtain the global data structure A.

A.1 STATIC MULTIPOLE ARRAYS

This function uses multipole expansions to approximate values which will end up populating the coefficient matrix. It attempts to speed up future runs by storing the approximated values in a file. The function initially checks for the existence of the file. If it is not there, the function proceeds in computing the approximations and creating the

file. If, however, a file containing the approximations exists, the function immediately returns, performing no computations.

A.1.1 Sequential

A.1.1.1 General Case

In the sequential case, the BEM code does not change with respect to the original code. The function generates the approximations and writes the data to a file or returns.

The pseudocode for this case is given by Algorithm A.1. 74

Algorithm A.1 Pseudocode for the STATIC MULTIPOLE ARRAYS general sequen- tial case. if (Multipole data file exists) then return; else Compute multipole expansion approximations; Write multipole expansion data to file; end if

A.1.1.2 Rotationally Symmetric

Rotational symmetry plays no role in the sequential computation, and the function performs as it does in the general case (see Section A.1.1.1 and Algorithm A.1).

A.1.2 Parallel

A.1.2.1 General Case

The parallel code behaves differently than the sequential code. A distributed data structure, called sub[U], is created to store the data in a distributed setting. This data structure is a three-dimensional array. The first two dimensions vary with respect to the total number of acoustic elements in the BEM computation. The third dimension has a

fixed value of 5, which corresponds to the number of terms in the multipole expansion.

Each processor then, simultaneously, populates its data structure. When all processors have populated the corresponding sub[U] data structure, the function returns. The main difference in the computation is that no file is generated in the parallel case.

The multipole expansion data is instead held in memory distributed over the available processors. The pseudocode is given in Algorithm A.2. 75

Algorithm A.2 Pseudocode for the STATIC MULTIPOLE ARRAYS general parallel case. Define sub[U]n and sub[U]m to be the number of rows and columns of the proces- sor’s owned sub[U] data structure respectively.

for i = 1 to sub[U]n do for j = 1 to sub[U]m do Compute multipole expansion approximation; Assign sub[U](i, j); end for end for

A.1.2.2 Rotationally Symmetric

The multipole file is written out in this case. Due to the way the computa- tion proceeds in the generation of the coefficient matrix (see Section A.2), the parallel rotationally symmetric case behaves in the same fashion as the sequential case (see Al- gorithm A.1). That is, a file containing the multipole approximations is written out if no such file already exists, or the function returns. Because the generation of the multipole expansion data is not time consuming, the benefits for computing the multipole expan- sions in parallel are lost in the communication back to a single processor for the writing of the file. Therefore, one processor computes the multipole expansion data and writes the data out to a file. All other processors wait for the processor performing the cal- culations and file creation to finish. Once the working processor finishes, the remaining processors continue with the computation.

A.2 COEFF MATRIX

The COEFF MATRIX routine populates the coefficient matrix A to be used in the computation of Ax = b. It now uses the multipole data which was computed in the

STATIC MULTIPOLE ARRAYS routine. 76

A.2.1 Sequential

A.2.1.1 General Case

This routine is the same as the original; it loops through each entry of the ma- trix reading one row of the multipole data at a time and populates the matrix. Note, the multipole data is only used if the distance between the points on the surface are sufficiently far; however, even when the multipole data is not used, the file is still read.

Algorithm A.3 Pseudocode for the COEFF MATRIX general sequential case. Define N to be the number of rows and columns of A. for i = 1 to N do Read row i of multipole data in from file; for j = 1 to N do if dist(pi, pj) > threshold then Compute using multipole data if; else Compute without multipole data; end if Assign A(i, j); end for end for

A.2.1.2 Rotationally Symmetric

The coefficient matrix for this case is block circulant. As noted previously, block circulant matrices can be uniquely represented by their first block row. Therefore, only the first block row of the matrix is generated by this routine. It proceeds in the same manner as the general sequential version; however, it does not fill in the matrix beyond the first block row. 77

Algorithm A.4 Pseudocode for the COEFF MATRIX rotationally symmetric sequen- tial case. Define N to be the number of rows and columns of A . In addition, define m to be the number of symmetries. N for i = 1 to m do Read row i of multipole data from file; for j = 1 to N do if dist(pi, pj) > threshold then Compute using multipole data if; else Compute without multipole data; end if Assign A(i, j); end for end for

A.2.2 Parallel

A.2.2.1 General Case

In this case, each processor contains a distributed data structure containing the global matrix A, and is denoted by sub[A]. Each processor’s data structure contains only a portion of the data contained in the entire coefficient matrix A. Each processor then populates its data structure, sub[A], simultaneously. The simultaneous population of the matrix is due to the sub[U] data structure populated in STATIC MULTIPOLE ARRAYS.

Without this distributed data structure, the file containing the multipole data would have to be opened and read sequentially.

A.2.2.2 Rotationally Symmetric

Again, the coefficient matrix can be uniquely defined by its first block row. The

first block row will contain m blocks each of order n. In this case, the number of processors, defined by P , is assumed to be some multiple of m. That is, P = cm for

+ c  Z . From here, m processor grids are defined; these are denoted by Gi for i = 1, . . . , m. 78

Algorithm A.5 Pseudocode for the COEFF MATRIX general parallel case. Define sub[A]N and sub[A]M to be the number of rows and columns of sub[A] respectively.

for i = 1 to sub[A]N do for j = 1 to sub[A]M do if dist(pi, pj) > threshold then Compute using multipole data in sub[U]; else Compute without multipole data; end if Assign sub[A](i, j); end for end for

√ √ √ Each Gi contains c processors and is of dimension c × c. If c is not an integer, the closest rectangular grid is formed. In addition to defining the grids, each processor defines variables called pId and gId. The variable pId is the processor number, and gId identifies which of the m processor grids a processor is a part of; their existence is acknowledged only because they are used in the coefficient matrix generation. At this point, m processor grids have been defined. In addition, the first block row of

A is composed of m blocks of order n. Therefore, each of the m blocks in the first block row of A will be distributed onto a corresponding processor grid. That is, block

Ai is distributed over Gi for i = 1, . . . , m. In order to distribute Ai onto grid Gi for i = 1, . . . , m, the processors belonging to grid Gi must define a distributed data structure for the corresponding Ai. The distributed data structure is denoted by sub[Ai]. Note, only the processors which are a part of Gi contain the data structure sub[Ai]. That is, processors belonging to G1 use the distributed data structure sub[A1]; processors belonging to G2 use the distributed data structure sub[A2], and so on and so forth.

Each grid is populated simultaneously, but not the distributed data structures, sub[Ai], i = 1, . . . , m, within the grid. This is due to the file containing the multipole 79 approximations. The file must be read sequentially, and only one row is read at a time.

Therefore, the loops are of length n, which is the order of each Ai, i = 1, . . . , m. The computation proceeds as follows, for element Ai(j, k), j, k = 1, . . . , n, the function BC-

CMPT L INDX is called and returns the processor whose local data structure, sub[Ai], contains element Ai(j, k). In addition, the function returns the index into the local data structure, denoted by (lj, lk). Therefore, in iteration (j, k), sub[Ai](lj, lk) is populated, and this happens simultaneously for each grid. Algorithm A.6 shows the pseudocode for the routine.

Algorithm A.6 Pseudocode for the COEFF MATRIX rotationally symmetric parallel case. The variable aP denotes the processor whose data structure will be assigned in a given iteration. Note, the variable kG allows the grids to be populated in parallel. Define processor grids; for j = 1 to n do Read row of multipole data in from file; for k = 1 to n do kG = n ∗ gId + k; aP = processor containing AgId(j, kG); Compute index (lj, lkG) into sub[AgId] using global index (j, kG); if pId == ap then Assign sub[AgId](lj, lkG); end if end for end for 80

A.3 SOURCE AMPLITUDES MODES

A.3.1 Sequential

A.3.1.1 General Case

The general sequential case makes no changes to the original routine. The right- hand side, b, is populated by a double for loop. It should be noted that, in general, there will be multiple right-hand sides. That is, b will not be a single column vector. Following this, the system Ax = b is solved using the LAPACK routine ZGESV. Algorithm A.7 gives the pseudocode.

Algorithm A.7 Pseudocode for the SOURCE AMPLITUDES MODES general sequen- tial case. Let N and rhsn denotes the number of rows and columns of b respectively. for i = 1 to N do for j = 1 to rhsn do Assign b(i, j); end for end for Solve Ax = b using LAPACK routine ZGESV;

A.3.1.2 Rotationally Symmetric

In the initial section of the routine, the right-hand side, b, is populated from a simple double for loop. Following this, the solve of the system Ax = b is performed. The rotationally symmetric system solve has been discussed in detail in Section 3.2; therefore, the discussion in this section will be somewhat terse. The inversion formula for a block circulant matrix A is given by

−1 ∗ ˜ −1 ˜ −1 ˜ −1 A = Fb diag{(A1) , (A2) ,..., (Am) }Fb, (A.1) 81

˜ −1 ˜ −1 ˜ −1 where Fb is the block Fourier matrix, and diag{(A1) , (A2) ,..., (Am) } is a block diagonal matrix. In the context of solving the linear system Ax = b, we obtain

˜ ˜ ˜ ∗ ∗ diag{(A1), (A2),..., (Am)}Fb x = Fb b. (A.2)

Now, let X˜ be a block column vector constructed from the elements of the block diagonal matrix in (A.2). That is,   A˜  1       A˜   2      X˜ =  A˜  . (A.3)  3     .   .   .      ˜ Am

In addition, let X be the column vector of the first block row of A. We then have the

˜ relation X = FbX; therefore, the elements of the block diagonal matrix are precisely the values obtained from computing the block DFT of the first block row of A. With these relations, the solve is computed by the following steps:

˜ ∗ 1. Compute b = Fb b.

˜ 2. Compute X = FbX.

˜ ˜ 3. Solve Ajx˜j = bj, j = 1, . . . , m.

4. Compute x =xF ˜ b.

The pseudocode for this case of the SOURCE AMPLITUDES MODES routine given by

Algorithm A.8. 82

Algorithm A.8 Pseudocode for the SOURCE AMPLITUDES MODES rotationally symmetric sequential case. Let N and rhsn denotes the number of rows and columns of b, respectively. In addition, let m be the number of blocks in the first block row of A. for i = 1 to N do for j = 1 to rhsn do Assign b(i, j); end for end for ˜ ∗ Compute inverse DFT of left-hand side by b = Fb b; ˜ ˜ ˜ ˜ ∗ Compute the elements of diag{(A1), (A2),..., (Am)} by X = Fb X; for k = 1 to m do ˜ ˜ Solve Akx˜k = bk using LAPACK routine ZGESV; end for Compute DFT of solution vectorx ˜ by x = Fbx˜;

A.3.2 Parallel

A.3.2.1 General Case

The general parallel case is very similar to the sequential general case. The only difference is that the global matrix A has been distributed over the processors and resides in the distributed data structure sub[A]. Recall that this data structure was populated by the COEFF MATRIX routine (see Section A.2.2.1). In addition, a distributed data structure, sub[b], is defined for the global right-hand side b. Each processor simultaneously populates its corresponding data structure. The pseudocode is given in Algorithm A.9. Note the existence of the variable x in line 6. This variable is only for clarity of presentation of the algorithm. The routine PZGESV overwrites the distributed data structure sub[b] with the result. In this way, there is no need to maintain a distributed data structure for the variable x. 83

Algorithm A.9 Pseudocode for the SOURCE AMPLITUDES MODES general parallel case. Let sub[b]n and sub[b]m denote the number of rows and columns of b, respectively.

1: for i = 1 to sub[b]n do 2: for j = 1 to sub[b]m do 3: Assign sub[b](i, j); 4: end for 5: end for 6: Solve sub[A]x = sub[b] using ScaLAPACK routine PZGESV;

A.3.2.2 Rotationally Symmetric

With the exception of one additional operation, this routine performs the same operations as the preceding cases. That is, it populates the right-hand side, and solves the linear system. The extra operation comes from moving the distributed solution vector into a different distributed format needed for a later parallel matrix multiplication. The parallel rotationally symmetric system solve is discussed in detail and is the main topic of the paper; therefore, this section will not discuss the details of the solve. Rather, this section will detail the population of the right-hand side in a way which is amendable to the parallel block circulant system solve. Recall the routine in Section A.2.2.2 defined m processor grids, Gi, for i = 1, . . . , m. In addition, the routine distributed each Ai onto grid Gi. In a similar fashion, this routine will block b into m blocks of corresponding size, denoted by bi, i = 1, . . . , m, and distribute each bi onto Gi for i = 1, . . . , m. Each

n×rhsn bi is in C where n is the order of each block Ai, and rhsn denotes the number of right-hand sides, i.e., columns of b. In order to distribute each bi onto grid Gi for i = 1, . . . , m, a distributed data structure sub[bi] is defined. Note, only the processors which are part of Gi contain the data structure sub[bi]. That is, processors belonging to G1 use the distributed data structure sub[b1]; processors belonging to G2 use the distributed data structure sub[b2], and so on and so forth. All of the distributed data 84 structures are populated simultaneously by looping over their corresponding distributed data structures. At this point, each Ai and bi reside on grid Gi for i = 1, . . . , m, and this is precisely the setting which is needed for the parallel block circulant system solve. Once the block circulant linear system has been solved, the solution will reside in each sub[bi] for i = 1, . . . , m. However, following this , a parallel matrix vector multiplication will be performed using the solution vector. The multiplication is performed in the context of one large process grid containing all processors and therefore, the right-hand side, b, is distributed over the processor grid using a distributed data structure sub[b]. Since the solution resides on the distributed data structures sub[bi] for i = 1, . . . , m, the data needs to be communicated to the appropriate format for sub[b].

In order to perform this reorganization of data, the routine uses a double for loop to loop through the global b, computes which processor currently owns it, and which processor needs it, and performs the communication. Once this reorganization is finished, the routine is finished.

A.4 SOURCE POWER

At this point in the program, the system Ax = b has been solved, and the resultant vector, x, has been obtained. Because LAPACK and ScaLAPACK overwrite b with the solution vector x, the data structures containing b now have the solution x. Therefore, any data structures previously denoted by a b will be denoted by x. The matrix A and the right-hand side b are no longer needed (in terms their initial values). This routine populates a matrix S with the intent of computing s = x∗Sx, in which x is the solution obtained from the SOURCE AMPLITUDES MODES routine. 85

Algorithm A.10 Pseudocode for the SOURCE AMPLITUDES MODES rotationally symmetric parallel case. The variables pId and gId denotes the processor number and the grid the processor belongs to respectively. Let sub[bgId]n and sub[bgId]m denote the number of rows and columns of b respectively.

for j = 1 to sub[bgId]n do for k = 1 to sub[bgId]m do Assign sub[bgId](j, k); end for end for Solve Ax = b using the block circulant solve (Algorithm 4.2); for j = 1 to N do for k = 1 to rhsn do SendP roc=Processor which has data b(j, k) in sub[bgId]; RecvP roc=Process which needs b(j, k) data; Compute index into sub[bgId]; denote by (sj, sk); Compute index into sub[b]; denote by (lj, lk); if SendP roc == RecvP roc then sub[b](lj, lk) = sub[bgId](sj, sk) else if pId == SendP roc then Send sub[bgId](sj, sk) to processor RecvProc; else if pId == RecvP roc then Recv temp = sub[bSendP roc](sj, sk) from processor SendProc ; sub[b](lj, lk) = temp; end if end if end for end for 86

A.4.1 Sequential

A.4.1.1 General Case

The general sequential case simply populates the matrix; however, the method of population differs from the previous routines. Instead of populating the matrix by looping over each element in the matrix, the routine loops over the sources used in the overall computation for populating S. That is, for each source, it computes which element, S(i, j), uses that source, and adds the source’s contribution to S(i, j). In general, there can be multiple sources per element and, therefore, S(i, j) will be updated multiple times.

There are three different types of sources: simple, dipole, and a coupled simple and dipole source which will be called a tripole source. The contribution of each source type is done separately. That is, the simple source contributions are computed, followed by computation of the dipole sources, and finally by computation of the tripole sources. In the SOURCE POWER routine, there is a separate routine for each source type; however, the algorithmic idea for populating the matrix S is the same in all cases.

In addition, the matrix S is Hermitian, and so only the upper triangular portion is computed using the source contributions discussed above. After the upper triangular portion is populated, the routine fills in the second half of the matrix by copying the conjugate of the elements into the lower triangular portion of the matrix. Algorithm A.11 gives the pseudocode for the routine. 87

Algorithm A.11 Pseudocode for the SOURCE POWER general sequential case. Let N1, N2, and N3 be the number of simple, dipole, and tripole sources respectively. Let N be the number of rows and columns of S. //Fill upper triangular portion for S for l = 1 to 3 do for k = 1 to Nl do Let (i, j) be the element to which source k contributes; if i ≤ j then Update S(i, j); end if end for end for //S is Hermitian, copy the data for i = 2 to N do for j = 1 to i − 1 do S(i, j) = Conj(S(j, i)); end for end for

A.4.1.2 Rotationally Symmetric

In the rotationally symmetric case, the matrix is also block circulant. Because the matrix is block circulant, only the first block row of S is filled. Then, using the first block row of S, the matrix is filled. The fact that S is Hermitian is also used, but in this case, it is used only for the first block in the first block row of S. The pseudocode given in Algorithm A.12 is very similar to Algorithm A.11 except for a change in the bounds.

A.4.2 Parallel

A.4.2.1 General Case

Essentially, this routine populates the matrix S in parallel. It reuses the dis- tributed data structure sub[A], which will now be denoted as sub[S], and populates it by having each processor simultaneously loop over their corresponding data structures.

However, as discussed in the sequential cases, the original routines loop over sources, 88

Algorithm A.12 Pseudocode for the SOURCE POWER rotationally symmetric se- quential case. Let N1, N2, and N3 be the number of simple, dipole, and tripole sources, respectively. Let m be the number of blocks in the first block row of S and N be the number of rows and columns of S. //Fill the first block rows of S except for the lower triangular portion of the first block. for l = 1 to 3 do for k = 1 to Nl do Let (i, j) be the element to which source k contributes; N if i ≤ j and i ≤ m then Update S(i, j); end if end for end for //The first block of S is Hermitian, copy the data N Let n = m ; for i = 1 to n do for j = 1 to i do S(i, j) = Conj(S(j, i)); end for end for //Fill the remainder of S knowing it is block circulant for k = 1 to m − 1 do for i = 1 to n do l = n ∗ k + i; for j = 1 to N do t = nk + j; if t > N then t = t − N; end if S(l, t) = S(i, j); end for end for end for 89 not matrix elements. Therefore, this routine’s computations proceed by taking a pro- cessor’s local indices, (li, lj), into sub[S], converting the local indices into global matrix indices, (i, j), finding which sources are owned by this S(i, j), and looping through these sources to compute their contributions to S(i, j). As in the sequential case, there are three types of sources and, therefore, there are three separate routines for computing their contributions. However, the overall algorithmic structure is the same.

Again, the matrix S is Hermitian. In contrast to the sequential case, instead of computing only the upper triangular portion of S, all of S is computed using the source contributions. While this adds some extra computation, the computation is being done in parallel. If data were to be copied from the upper triangular section to the lower triangular section, a large number of communications would have to take place and would result in a bottleneck.

Algorithm A.13 Pseudocode for the SOURCE POWER general parallel case. Let m be the number of blocks in the first block row of S, and let sub[S]N and sub[S]M be the number of rows and columns of sub[S], respectively.

for li = 1 to sub[S]N do for lj = 1 to sub[S]M do Compute global indices (i, j) corresponding to (li, lj); Let SourceList = Sources corresponding to S(i, j); for each source type t do for each source of type t in SourceList do Update sub[S](li, lj) end for end for end for end for 90

A.4.2.2 Rotationally Symmetric

The rotationally symmetric case is identical to the general parallel case in Sec- tion A.4.2.1 with one modification. Because the matrix S is block circulant, the global indices are modified to stay within the first block row of S when accessing the source list. For example, say a processor’s local index corresponds to an entry residing in the

first block of the second block row. Knowing the matrix is block circulant, the second block row is only a circular shift of the first block row. This means that the first block of the second row is the last block of the first row. Therefore, the indices corresponding to the first block of the second row are modified to point to the last block of the first row.

In performing the computations this way, no communication between the processors is necessary to fill in the matrix S. The pseudocode for the algorithm, including the index modifications, is given by Algorithm A.14.

Algorithm A.14 Pseudocode for the SOURCE POWER rotationally symmetric par- allel case. Let N be the number of rows and columns of S, m the number of blocks in the first block row of S, and n the order of each block. In addition, define sub[S]N and sub[S]M to be the number of rows and columns of sub[S] respectively.

for li = 1 to sub[S]N do for lj = 1 to sub[S]M do Compute global indices (i, j) corresponding to (li, lj);  i−1  j = m − n ∗ n − j if j > N then j = j − N; end if i = mod(i − 1, n) + 1; Let SourceList = Sources corresponding to S(i, j); for each source type in SourceList do t = SourceT ype; for each source of type t in SourceList do Update sub[S](li, lj) end for end for end for end for 91

A.5 MODAL RESISTANCE

This routine is quite straight forward. Using the solution vector x obtained from the SOURCE AMPLITUDES MODES routine (see Section A.3) and the matrix S from the SOURCE POWER routine (see Section A.4), compute s = x∗Sx. Because previous routines have already populated the needed data structures, this routine simply performs the needed multiplications using LAPACK or ScaLAPACK.

A.5.1 Sequential

A.5.1.1 General Case

The sequential routine performs two multiplications and contains one intermediate data structure, W , to hold the first multiplication, i.e., W = Sx. After computing the first multiplication, the second multiplication S = x∗W is performed reusing the data structure S to hold the solution. The LAPACK routine ZGEMM is used for the multiplications. The ZGEMM routine corresponds to matrix-matrix multiplications and is used in this instance because, in general, the solution vector x will contain multiple columns. For completeness, the pseudocode for this operation is given by Algorithm A.15

Algorithm A.15 pseudocode for the MODAL RESISTANCE general sequential case. Compute W = Sx using LAPACK routine ZGEMM; Compute S = x∗W using LAPACK routine ZGEMM; 92

A.5.2 Rotationally Symmetric

This behaves in exactly the same as the general sequential case (see Section A.5.1.1).

A.5.3 Parallel

A.5.3.1 General Case

As in the sequential case, an additional data structure is required for the inter- mediate multiplication. Therefore, the distributed data structure sub[W ] is defined. Be- cause the distributed data structures needed for the parallel multiplications, i.e., sub[S] and sub[x], have already been populated, this routine simply uses ScaLAPACK to per- form the parallel matrix multiplications. Therefore, the routine calls PZGEMM to com- pute the multiplication W = Sx using the distributed data structures. Following the

first multiplication, S = x∗W is computed which reuses the distributed data structure sub[S] to store the solution. Algorithm A.16 shows the pseudocode for the routine.

Algorithm A.16 pseudocode for the MODAL RESISTANCE general parallel case. Compute sub[W ] = sub[S]sub[x] using ScaLAPACK routine PZGEMM; Compute sub[S] = sub[x∗]sub[W ] using ScaLAPACK routine PZGEMM;

A.5.3.2 Rotationally Symmetric

The parallel rotationally symmetric case is exactly the same as the general parallel case (see Section A.5.3.1 and Algorithm A.16). 93 Bibliography

[1] The Cyberstar compute cluster. http://www.ics.psu.edu/infrast/specs.html.

[2] H. Akaike. Block Toeplitz matrix inversion. Society for Inustrial and Applied Math-

ematics Journal on Applied Mathematics, 24:234–241, 1973.

[3] P. Alonso, J.M. Badia, and A.M. Vidal. An efficient parallel algorithm to solve

block Toeplitz systems. The Journal of Supercomputing, 32:251–278, 2005.

[4] S. Amini. An iterative method for the boundary element solution of the exterior

acoustic problem. Journal of Computational and Applied Mathematics, 20:109–117,

1987.

[5] S. Amini and C. Ke. Conjugate gradient method for second kind integral equations -

Applications to the exterior acoustic problem. Engineering Analysis with Boundary

Elements, 6, 1989.

[6] L.S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don-

garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C.

Whaley. ScaLAPACK User’s Guide. SIAM Press, 1997.

[7] L.I. Bluestein. A linear filtering approach to the computation of the discrete Fourier

transform. Northeast Electronics Research and Engineering Meeting Record, 10:218–

219, 1968. 94

[8] L.I. Bluestein. A linear filtering approach to the computation of discrete Fourier

transform. IEEE Transactions on Audio and Electroacoustics, AU-18:451–455, 1970.

[9] A.J. Burton and G.F Miller. The application of integral equation methods to the

numerical solution of some exterior boundary-value problems. Proceedings of the

Royal Society of London. Series A, Mathematical and Physical Sciences, 323:201–

210, 1971.

[10] M. Chen. On the solution of circulant linear systems. Society for Inustrial and

Applied Mathematics Journal on Numerical Analysis, 24:668–683, 1987.

[11] E. Chu and A. George. Inside the FFT Black Box: Serial and Parallel Fast Fourier

Transform Algorithms. CRC Press, Boca Raton, 2000.

[12] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex

fourier series. Mathematics of Computation, 19:297–301, 1965.

[13] P.J. Davis. Circulant Matrices. Chelsea Publishing, 1994.

[14] H.A. El-Mikati and J.B. Davies. Improved boundary element techniques for two-

dimensional scattering problems with circular boundaries. IEEE Transactions on

Antennas and Propagation, AP-35:539–544, 1987.

[15] S.M. El-Sayed. A direct method for solving circulant tridiagonal block systems of

linear equations. Applied Mathematics and Computation, 165:23–30, 2005.

[16] S.M. El-Sayed, I.G. Ivanov, and M.G. Petkov. A new modification of the rojo

method for solving symmetric circulant five-diagonal systems of linear equations.

Computers & Mathematics with Applications, 35:35–44, 1998. 95

[17] G. Fairweather and A. Karageorghis. The method of fundamental solutions for

elliptic boundary value problems. Advances in Computational Mathematics, 9:69–

95, 1998.

[18] K.M. Fauske. Example: Radix-2 fft signal flow. Online, 12 2006.

http://www.texample.net/tikz/examples/radix2fft/.

[19] J.-Y Hwang and S.-C Chang. A retracted boundary integral equation for exte-

rior acoustic problem with unique solution for all wave numbers. Journal of the

Acoustical Society of America, 90:1167–1180, 1991.

[20] C.C. Ioannidi and H.T. Anastassiu. Circulant adaptive integral method (CAIM) for

electromagnetic scattering from large targets of arbitrary shape. IEEE Transactions

on Magnetics, 45:1308–1311, 2009.

[21] A. Karageorghis and G. Fairweather. The method of fundamental solutions for

axisymmetric potential problems. International Journal for Numerical Methods in

Engineering, 44:1653–1669, 1999.

[22] A. Karageorghis and Y.-S. Smyrlis. MFS algorithms for elas-

ticity and thermo-elasticity problems in axisymmetric domains. Journal of Compu-

tational and Applied Mathematics, 206:774–795, 2007.

[23] A. Karageorghis, Y.-S. Symyrlis, and T. Tsangarsi. A matrix decomposition MFS

algorithm for certain linear elasticity problems. Numerical Algorithms, 43:123–149,

2006. 96

[24] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Com-

puting Design and Analysis of Algorithms. The Benjamin/Cummings Publishing

Company, 1994.

[25] L D. Lirkov, S. D. Margenov, and P. S. Vassilevski. Circulant block-factorization

preconditioners for elliptic problems. Computing, 53:59–74, 1994.

[26] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society

for Industrial and Applied Mathematics, 1992.

[27] T. De Mazancourt and D. Gerlic. The inverse of a block-circulant matrix. IEEE

Transactions on Antennas and Propagation, AP-31:808–810, 1983.

[28] M. Ochmann, A. Homm, S. Makarov, and S. Semenov. An iterative GMRES-

based boundary element solver for acoustic scattering. Engineering Analysis with

Boundary Elements, 27:717–725, 2003.

[29] A. Padiy and M.Neytcheva. On a parallel solver for boundary electric current com-

putations, Report 9726. Technical report, Department of Mathematics, University

of Nijmegen, The Netherlands., 1997.

[30] P.J. Papakanellos, N.L. Tsitsas, and H.T. Anastassiu. Efficient modeling of radiation

and scattering for a large array of loops. IEEE Transactions on Antennas and

Propagation, 58:999–1002, 2010.

[31] S. Rjasanow. Effective algorithms with circulant-block matrices.

and Its Applications, 202:55–69, 1994. 97

[32] O. Rojo. A new method for solving symmetric circulant tridiagonal systems of linear

equations. Computers and Mathematics with Applications, 20:61–67, 1990.

[33] S. Rjasanow S. Kurz, O. Rain. Application of the adaptive cross approximation

technique for the coupled BE-FE solution of symmetric electromagnetic problems.

Computational Mechanics, 32:423–429, 2003.

[34] H.A. Schenck. Improved integral formulation for acoustic radiation problems. Jour-

nal of the Acoustical Society of America, 44:41–58, 1968.

[35] Y.-S. Smyrlis and A. Karageorghis. A matrix decomposition MFS algorithm for

axisymmetric potential problems. Engineering Analysis with Boundary Elements,

28:463–474, 2004.

[36] Y.-S. Smyrlis and A. Karageorghis. The method of fundamental solutions for sta-

tionary heat conduction problems in rotationally symmetric domains. Society for

Inustrial and Applied Mathematics Journal of Scientific Computing, 27:1493–1512,

2006.

[37] Parad SS. Technical seminar on propeller. Online, August 2008.

http://aplonset.blogspot.com/2008/08/prasads-propeller.html.

[38] Th. Tsangaris, Y.-S Symrlis, and A. Karegeorghis. A matrix decomposition MFS

algorithm for problems in hollow axisymmetric domains. Journal of Scientific Com-

puting, 28:31–50, 2006. 98

[39] N.L. Tsitsas and G.H. Kalogeropoulos. A recursive algorithm for the inversion of

matrices with circulant blocks. Applied Mathematics and Computation, 188:877–

894, 2007.

[40] H. Tsuboi, A. Sakurai, and T. Naito. A simplification of boundary element model

with rotational symmetry in electromagnetic field analysis. IEEE Transactions on

Magnetics, 26:2771–2773, 1990.

[41] R. Vescovo. Inversion of block-circulant matrices and circular array approach. IEEE

Transactions on Antennas and Propagation, 45:1565–1567, 1997.

[42] L. Wright, S.Robinson, V. Humphrey, P. Harris, and G. Hayman. The application

of boundary element methods to nearfield acoustic measurements on cylindrical

surfaces at NPL. Technical report, NPL REPORT, 2005.