Single and Dual-Gpu Generalized Sparse Eigenvalue Solvers
Total Page:16
File Type:pdf, Size:1020Kb
930 A. DZIEKONSKI, M. MROZOWSKI, SINGLE AND DUAL-GPU GENERALIZED SPARSE EIGENVALUE SOLVERS . Single and Dual-GPU Generalized Sparse Eigenvalue Solvers for Finding a Few Low-Order Resonances of a Microwave Cavity Using the Finite-Element Method Adam DZIEKONSKI, Michal MROZOWSKI Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Narutowicza 11/12, 80-233 Gdańsk, Poland [email protected], [email protected] Submitted August 30, 2018 / Accepted September 3, 2018 Abstract. This paper presents two fast generalized eigen- was considered for the finite-difference time-domain (FDTD) value solvers for sparse symmetric matrices that arise when method [1–4]. In recent years, some effort has also been de- electromagnetic cavity resonances are investigated using the voted to taking advantage of the processing power of graph- higher-order finite element method (FEM). To find a few low- ics accelerators for other techniques, such as the method of order resonances, the locally optimal block preconditioned moments [5–7], the multilevel fast multipole (MFML) algo- conjugate gradient (LOBPCG) algorithm with null-space de- rithm [8], and the discontinuous Galekin (DG) method [9], flation is applied. The computations are expedited by using as well as other techniques [10]. In this paper, we consider one or two graphical processing units (GPUs) as acceler- GPU-acceleration of the finite element Method—one of the ators. The performance of the solver is tested for single most powerful and versatile numerical techniques, which is and dual GPU hardware setups, making use of two types often applied to the solution of boundary value problems that of GPU: NVIDIA Kepler K40s and NVIDIA Pascal P100s. arise in electromagnetics. To date, most publications on FEM The speed of the GPU-accelerated solvers is compared to in the context of electromagnets have been related to the solu- a multithreaded implementation of the same algorithm using tion of a linear system of equations [11–13] and FEM matrix a multicore central processing unit (CPU, Intel Xeon E5-2680 generation and assembly [13–15]. In this paper, we concen- v3 with twelve cores). It was found that, even for the least trate on finding the solution of the generalized eigenvalue efficient setups, the GPU-accelerated code is approximately problems that emerge when FEM is applied to simulate the twice as fast as a parallel CPU-only implementation. free oscillation of microwave cavities. GPU-acceleration of eigenvalue solvers is less frequently disused in the literature, and in most cases standard, rather than generalized, eigen- value problems are considered [16–19]. Solvers for standard eigenvalue problems are not suitable for FEM analysis, so Keywords we focus on generalized eigenvalue problems, extending our recent work on this topic [20]. FEM, generalized eigenvalue problem, GPU, Maxwell’s equations, resonators 2. Finite Element Method Let us consider a lossless dielectric-loaded electromag- 1. Introduction netic cavity Ω enclosed by a perfect electric conductor bound- ary S. The behavior of the electric field inside Ω and the Computational electromagnetics (CEM) concerns the frequencies of free oscillations inside are determined by the design and implementation of fast and accurate solvers of vector Helmholtz equation: Maxwell’s equations. The ultimate goal is to solve com- r × r × ® − 2 ® plex problems in the shortest possible time. To achieve this E k0rE = 0 (1) goal, algorithms have to be considered in the context of the ® hardware they will be implemented on. As hardware evolves, where E is the electric field, k0 = !/c is the wavenumber, known numerical techniques need to be reexamined and mod- r is the relative permittivity, ! is the angular frequency, and ified in order to take advantage of the possibilities offered c is the speed of light in vacuum. The nonzero frequencies for by new hardware. GPU-computing is an emerging trend in which the above equation has a nontrivial solution determine computational electromagnetics. Initially, GPU-acceleration the resonances of the cavity. DOI: 10.13164/re.2018.0930 FEATURE ARTICLE RADIOENGINEERING, VOL. 27, NO. 4, DECEMBER 2018 931 To solve the vector Helmholtz equation in 3D, we use during each iteration of LOBPCG [25]. This process can be the finite-element method with a tetrahedral mesh and higher- called null-space filtering, as imposing the divergence-free order basis functions [21], [22]. To this end, we work with condition is, in practice, implemented as in a deflation us- the weak formulation of (1): ing a projector that ensures that the projected vectors are M-orthogonal to the null-space. The projected vectors thus have no components from the null-space, and are then used ¹ ® 2 ® to construct model fields (eigenvectors) that correspond to r × w® · r × E − k0 w® · rE dΩ = 0 (2) ® Ω nonzero eigenvalues and fulfill r · tE = 0. where w® is a vector testing function. We then use the Galerkin The LOBPCG method with null-space filtration is method with hierarchical vector basis functions up to the third shown in Algorithm 1. The null-space filtering is applied in order [23], arriving at: lines 2 and 13. In practice, this is implemented as a solution of a relatively small system of sparse equations with multiple right-hand sides. The system matrix for null-space filtration 2 ¹K − k0Mº e = 0 (3) is identical to the system matrix obtained when solving the Poisson equation with the Lagrange finite elements. The ma- T where K, M 2 Rn×n are large and sparse real-valued stiffness trix is given by Y MYº, with Y being a discrete gradient op- and mass matrices, respectively, e 2 Rn is a vector of expan- erator with only two nonzero elements per row (1 or -1) [25]. sion coefficients for the basis functions, and n is the number To solve the Poisson equation, we use the conjugate gradient of degrees of freedom (DoF). Additionally, the matrix M is method with a hierarchical multilevel preconditioner operat- positive definite. Equation (3) defines the generalized eigen- ing in a V-cycle (Algorithm 2 [22]. The hierarchical mul- value problems. Since matrices are real and symmetric, and tilevel preconditioner (Hie-ML) is shown in lines P.1–P.11 one of them is positive definite, the eigenvalues are real and of Algorithm 2. The preconditioner takes advantage of nonnegative. the hierarchy of the basis functions used in FEM. In our case, the basis functions up to the third order are considered (QTCuN) [23], so we used three levels in a V-cycle. On the lowest level (P.4), the direct solution of a linear system 3. Finding Small Nonzero Eigenvalues is needed. However, since on this level only the first-order basis functions are involved, the system is very small and The system matrix that emerges from FEM using its solution does not pose any problem, For the smoothing higher-order basis functions is large and sparse. This means that is performed when transfers between levels occur during that direct solution methods for eigenproblems cannot be restriction and prolongation (lines P.6–P.10), we used a few applied. The only way to find the resonances and modal weighted Jacobi iterations. fields is to use iterative techniques. In practice, we are in- terested in a few low-order resonances that correspond to Another important element is the precondi- eigenvalues with small magnitudes. One algorithm recom- tioner P used in the LOBPCG (Algorithm 1, line 12). mended for finding a few algebraically smallest eigenvalues In this paper, we employ for this a sequence of operations of symmetric positive definite matrix pencils is the locally approximately equal to the inverse of the matrix A = K− κM, optimal block preconditioned conjugate gradient (LOBPCG) on a vector or a block of vectors (κ is a real scalar chosen to method proposed by Knyazev [24]. With a good precon- be smaller than, but close to, the smallest nonzero eigenvalue ditioner, this method converges rapidly. Moreover, it does of the matrix pencil ¹K; Mº. This is implemented as a few not require any matrix factorization and, because it involves iterations of the preconditioned conjugate gradient method a three-term recurrence, its memory requirements are rela- (PCG-V)—the same method used for null-space filtering. In tively low. Finally, the basic operation in the algorithm is other words, in Step 12 of the LOPCG algorithm, instead of a sparse matrix–vector product. These features are desirable −1 computing Hk = P Rk , we used an approximate solution from the point of view of efficiently using the massive par- of the matrix system ¹K − κMºHk = Rk . allelization capabilities offered by state-of-the-art graphics processing units. The LOBPCG algorithm cannot be ap- Note that, except for the direct solution on the low- plied directly to the FEM problems resulting from the weak est level (Algorithm 2, line P.4), almost all other steps form (2), as the r × r × ¹·º operator yields zero when ap- in the LOBPCG algorithm can be expressed in terms of plied to the gradient of any scalar function. As a result, a matrix-times-vector operation or simple BLAS1-type op- the eigenproblem (3) has multiple zero eigenvalues. Such erations. Also, the preconditioned conjugate-gradient algo- eigenvalues would be found by LOBPCG, but they are of rithm involves a sparse matrix–vector product (SpMV). This no interest, since they are not physical solutions fulfilling is typical of all iterative techniques based on Krylov-spaces. Maxwell’s equations. More precisely, the model field associ- In order to execute the iterations rapidly, the performance of ated with each zero eigenvalue does not fulfill the condition the sparse matrix–times vector product should be increased. ® that r ·tE = 0 .