Parallel Boundary Element Solutions of Block

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering PARALLEL BOUNDARY ELEMENT SOLUTIONS OF BLOCK CIRCULANT LINEAR SYSTEMS FOR ACOUSTIC RADIATION PROBLEMS WITH ROTATIONALLY SYMMETRIC BOUNDARY SURFACES A Thesis in Computer Science and Engineering by Kenneth D. Czuprynski c 2012 Kenneth D. Czuprynski Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science May 2012 The thesis of Kenneth D. Czuprynski was reviewed and approved* by the following: Suzanne M. Shontz Assistant Professor of Computer Science and Engineering Thesis Adviser Jesse L. Barlow Professor of Computer Science and Engineering John B. Fahnline Assistant Professor of Acoustics Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering *Signatures are on file in the Graduate School. iii Abstract Coupled finite element/boundary element (FE/BE) formulations are commonly used to solve structural-acoustic problems where a vibrating structure is idealized as being submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE) formulations, the structural analysis is performed using the finite element method, and the acoustic analysis is performed using the boundary element method. In general, the problem is solved frequency by frequency, and the coefficient matrix for the boundary element analysis is fully populated and little can be done to alleviate the storage and computational requirements. Because acoustic boundary element calculations require approximately six elements per wavelength to produce accurate solutions, the boundary element formulation is limited to relatively low frequencies. However, when the outer surface of the structure is rotationally symmetric, the system of linear equations becomes block circulant. We propose a parallel algorithm for distributed memory systems which takes advantage of the underlying concurrency of the inversion formula for block circulant matrices. By using the structure of the coefficient matrix in tandem with a distributed memory system setting, we show that the storage and computational requirements are substantially lessened. iv Table of Contents List of Figures ::::::::::::::::::::::::::::::::::::: viii Chapter 1. Introduction :::::::::::::::::::::::::::::::: 1 1.1 Acoustic Radiation Problems . 3 1.2 Boundary Element Method . 4 1.3 The Fourier Matrix and Fast Fourier Transform . 10 1.4 Circulant Matrices . 18 Chapter 2. Literature Review ::::::::::::::::::::::::::::: 21 Chapter 3. Problem Formulation ::::::::::::::::::::::::::: 25 3.1 Coefficient Matrix Derivation . 26 3.2 Block Circulant Inversion . 32 3.3 Invertibility . 35 Chapter 4. Parallel Solution Algorithm ::::::::::::::::::::::: 38 4.1 Block DFT Algorithm . 38 4.2 Block FFT Algorithm . 44 4.3 System Solves . 48 4.4 Parallel Algorithm . 49 Chapter 5. Theoretical Timing Analysis ::::::::::::::::::::::: 51 5.1 Parallel Linear System Solve . 52 v 5.2 Block DFT using the DFT Algorithm . 53 5.3 Block DFT Using the FFT Algorithm . 55 5.4 Bounds . 56 Chapter 6. Numerical Experiments ::::::::::::::::::::::::: 58 6.1 Experiment 1 . 58 6.2 Experiment 2 . 59 6.3 Numerical Results . 59 6.3.1 Experiment 1 . 59 6.3.2 Experiment 2 . 64 Chapter 7. Conclusions :::::::::::::::::::::::::::::::: 70 Appendix. BEM Code ::::::::::::::::::::::::::::::::: 72 A.1 STATIC MULTIPOLE ARRAYS . 73 A.1.1 Sequential . 73 A.1.1.1 General Case . 73 A.1.1.2 Rotationally Symmetric . 74 A.1.2 Parallel . 74 A.1.2.1 General Case . 74 A.1.2.2 Rotationally Symmetric . 75 A.2 COEFF MATRIX . 75 A.2.1 Sequential . 76 A.2.1.1 General Case . 76 vi A.2.1.2 Rotationally Symmetric . 76 A.2.2 Parallel . 77 A.2.2.1 General Case . 77 A.2.2.2 Rotationally Symmetric . 77 A.3 SOURCE AMPLITUDES MODES . 80 A.3.1 Sequential . 80 A.3.1.1 General Case . 80 A.3.1.2 Rotationally Symmetric . 80 A.3.2 Parallel . 82 A.3.2.1 General Case . 82 A.3.2.2 Rotationally Symmetric . 83 A.4 SOURCE POWER . 84 A.4.1 Sequential . 86 A.4.1.1 General Case . 86 A.4.1.2 Rotationally Symmetric . 87 A.4.2 Parallel . 87 A.4.2.1 General Case . 87 A.4.2.2 Rotationally Symmetric . 90 A.5 MODAL RESISTANCE . 91 A.5.1 Sequential . 91 A.5.1.1 General Case . 91 A.5.2 Rotationally Symmetric . 92 A.5.3 Parallel . 92 vii A.5.3.1 General Case . 92 A.5.3.2 Rotationally Symmetric . 92 References :::::::::::::::::::::::::::::::::::::::: 93 viii List of Figures 1.1 Radix 2 element interaction pattern obtained from [18]. 16 3.1 A propeller with three times rotational symmetry [37]. 26 3.2 A four times rotationally symmetric sketch of a propeller. 27 4.1 Initial data distribution assumed in the DFT computation for the case P = m =4................................... 39 4.2 The DFT computation for the case P = m = 4. Each arrow indicates the communication of a processor's owned submatrix to a neighboring processor in the direction of the arrow. 40 4.3 Parallel block DFT data decomposition for P > m. 42 4.4 Parallel block DFT data decomposition and processor groupings for P > m. 43 4.5 Process illustrating the distributed FFT. Lines crossing to different pro- cessors indicate communication from left to right. Note the output is in reverse bit-reversed order relative to numbering starting at zero; that is, A1 is element 0; A2 is element 1, etc. 47 4.6 Processor grid creation for P=16 and m=4. 48 6.1 Runtime comparison using the DFT algorithm for varying P and N with m =4...................................... 60 6.2 Runtime comparison using the FFT algorithm for varying P and N with m =4...................................... 60 ix 6.3 Speedups using the DFT algorithm for varying P and N with m = 4. 61 6.4 Speedups using the FFT algorithm for varying P and N with m = 4. 61 6.5 Efficiency using the DFT algorithm for varying N and P with m = 4. 63 6.6 Efficiency using the FFT algorithm for varying N and P with m = 4. 64 6.7 Runtime comparison using the DFT algorithm for varying P and N with m =8...................................... 65 6.8 Runtime comparison using the FFT algorithm for varying P and N with m =8...................................... 65 6.9 Speedup comparison using the DFT algorithm for varying P and N when m =8...................................... 67 6.10 Speedup comparison using the FFT algorithm for varying P and N when m =8...................................... 67 6.11 Efficiency comparison using the DFT algorithm for varying P and N when m =8. ................................. 68 6.12 Efficiency comparison using the FFT algorithm for varying P and N when m =8. ................................. 69 1 Chapter 1 Introduction Coupled finite element/boundary element (FE/BE) formulations are commonly used to solve structural-acoustic problems where a vibrating structure is idealized as being submerged in a fluid that extends to infinity in all directions. Typically in (FE/BE) formulations, the structural analysis is performed using the finite element method, and the acoustic analysis is performed using the boundary element method (BEM). The boundary element formulation is advantageous for the acoustic radiation problem because only the outer surface of the structure in contact with the acoustic medium is discretized. This formulation also allows us to neglect meshing the infinite fluid exterior to the structure, as would be required if the finite element method were used instead. Using the BEM, we compute the radiated sound field of a vibrating structure 3 Ω ⊂ R . The main obstacle in computing the sound radiation is solving the linear system of equations to enforce the specified boundary conditions. In the context of the BEM, this requires the solution of a dense, complex linear system. In general, the problem is solved frequency by frequency, and the coefficient matrix for the boundary element analysis is fully populated and exhibits no exploitable structure. The size, N 2, of the coefficient matrix is directly correlated with the level of discretization, N, used for the surface in question. Because acoustic boundary element calculations require approximately six elements per wavelength to produce accurate solutions, the boundary 2 element formulation is limited to relatively low frequencies. For high frequency problems, and for problems which involve large and/or complex surfaces, these matrices are large, dense, and unstructured; therefore, there is little which can be done to alleviate the storage and computational requirements. Iterative solvers and preconditioners have been investigated [4, 5, 28] and are a natural choice for large problems because the cost of direct solvers can become prohibitive. While the computational requirements can be lessened by iterative methods, the storage requirements can still present a problem. One obvious solution is to perform the solve in a distributed memory parallel setting. A distributed memory parallel algorithm distributes the workload and allows the storage of the matrix to be split between many individual systems with local memories, thereby increasing the total available memory. In addition, because linear systems are ubiquitous throughout scientific computation, libraries exist for their efficient parallel solution. In particular, because the matrix is dense, Scalable LAPACK (ScaLAPACK) [6] is a favored choice. While in general these matrices exhibit no exploitable structure, when the boundary surface is rotationally symmetric, the coefficient matrix is block circulant. Circulant matrices are defined as each row being a circular shift of the row above it. One property of circulant matrices is that they are all diagonalizable by the Fourier matrix. There- fore, the Discrete or Fast Fourier Transform (D/FFT) can be used in the solution of the system. These results generalize to the block case and can be used in the solution of block circulant linear systems arising from acoustic radiation problems involving rotationally symmetric boundary surfaces. In addition, the inversion formula for block circulant matrices is highly amendable to parallel computation.

Parallel Boundary Element Solutions of Block

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

Over-Scalapack.Pdf

Prospectus for the Next LAPACK and Scalapack Libraries

Jack Dongarra: Supercomputing Expert and Mathematical Software Specialist

Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA

LAPACK WORKING NOTE 195: SCALAPACK's MRRR ALGORITHM 1. Introduction. Since 2005, the National Science Foundation Has Been Fund

Reciprocity Relations

On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus

Best Practice Guide