2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

Discrete and Cosine Transform and Helmholtz Equation Solver on GPU

Mingming Ren, Yuyang Gao∗, Gang Wang, Xiaoguang Liu TJ Key Lab of NDST, College of Computer Science, Nankai University, Tianjin, China {renmingming, gaoyy, wgzwp, liuxg}@nbjl.nankai.edu.cn

Abstract—Helmholtz equation is a special kind of elliptic hardware and software technology, there are quite a lot of partial differential equation. Solving Helmholtz equation is often resources available for discrete signal processing. For example, needed in many scientific and engineering problems. The efficient the calculation of discrete-time of periodic approach to solving Helmholtz equation is through using (FFT). In practice, boundary conditions must signals (i.e. Discrete Fourier transform, DFT [9]) is usually be considered, and several discrete Fourier transforms such as computed by Fast Fourier Transform (FFT). Many famous Discrete Sine and Cosine Transforms (DST, DCT) are needed for software packages such as MATLAB, Intel MKL and FFTW solving problems with different boundary conditions. [10] provide corresponding implementations. Nowadays, with the development of Compute Unified De- Using Fourier based method to solve Helmholtz equation vice Architecture (CUDA) technology in recent years, many researchers use Graphics Processing Units (GPUs) to accelerate can handle not only the periodic boundary conditions, but their programs. In view of the importance of FFT, NVIDIA also the other different kinds of boundary conditions, such officially provides the library cuFFT, but it does not include as Dirichlet or Neumann boundary conditions. By using the the calculation of DST and DCT, which brings inconvenience to finite difference method, the solution can be obtained through users. varieties of DFT, such as Discrete Sine Transform (DST) and In this paper, we present cuHelmholtz which is designed and implemented based on CUDA. It can be used to compute Discrete Cosine Transform (DCT) [9], [11]. Different variants several kinds of three-dimensional DST and DCT, and to solve of DFT correspond to different homogeneous boundary con- three-dimensional Helmholtz equations with various boundary ditions. Furthermore, non-homogeneous boundary conditions conditions. The experimental results show that compared with can be transformed into homogeneous boundary conditions. It FISHPACK, a FORTRAN software package which can be used is worth mentioning that the type 2 DCT is widely used in to solve three-dimensional Helmholtz equation, we achieve about 10× to 30× speedup. , image processing and video coding [12], Index Terms—Helmholtz equation, CUDA, GPU, Discrete Sine [13], and it is called “the” DCT in these fields. Transform, Discrete Cosine Transform Since Nvidia Corporation launched Compute Unified De- vice Architecture (CUDA), Graphics Processing Units (GPUs) I. INTRODUCTION have been playing an increasingly important role in the field of Helmholtz equation is a special kind of elliptic partial scientific and engineering computing in the past decade. Many differential equation. In the process of solving many mathe- studies have used GPUs to optimize the computation process matical and physical problems (such as solving wave equation, in order to reduce the run time. Nvidia also provides a library potential distribution in electrostatics and incompressible flow named cuFFT for computing FFT. However, the library does problem [1], [2], etc.) one may need to solve Helmholtz not contain the calculation of DST and DCT, which makes it equation (or its special cases, such as Laplace equation and inconvenient to solve Helmholtz equation with non-periodic Poisson equation). Furthermore, many problems need to be boundary conditions on GPU. solved iteratively, and in each iteration one may need to solve To the best of authors’ knowledge, there is no literature a Helmholtz equation [3], [4]. Therefore, it is helpful to solve to solve all kinds of common DST and DCT on GPU, nor Helmholtz equation as quickly as possible. to solve three-dimensional Helmholtz equation with several Fourier-related transform has important applications in solv- common boundary conditions on GPU. Here we introduce ing partial differential equations [5]–[7]. Many partial differ- some related works. Wu et al. [14], [15] computed solution ential equations including Helmholtz equations with constant of three-dimensional Possion equation on GPU platform for coefficients can be solved efficiently by Fourier-related trans- one Neumann boundary condition (the other two are still forms. periodic boundary conditions). In [16], Ghetia et al. computed Fourier-related transform has also been widely used in type 2 DCT on GPU platform in two-dimensional case. Their many other fields [8]. According to whether the signal is method requires extra GPU memory and is not suitable for the continuous or discrete in time, periodic or non-periodic, it calculation of large-scale three-dimensional transforms. can be divided into four categories: continuous time Fourier In this paper we design and implement algorithms for transform, discrete time Fourier transform, continuous time computing several kinds of three-dimensional DST and DCT Fourier series and discrete time Fourier series. Thanks to the on GPU, and use them to solve three-dimensional Helmholtz rapid development of digital technology, especially computer equations with various boundary conditions. The code library

978-0-7381-3199-3/20/$31.00 ©2020 IEEE 57 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00034 we implemented is named cuHelmholtz. The main organi- plan, which corresponds to different data layouts and batched zational structure of this paper is as follows. We briefly executions. Next, we can call execution functions such as introduce some basics of CUDA and cuFFT in Section II. cufftExecR2C() to do corresponding transformations (the Section III describes how to solve Helmholtz equation by finite data layout was specified when creating plan). The plan can be difference method and Fourier transform method. Section IV used to transform different data several times. When the plan describes the design and implementation of cuHelmholtz. In is no longer needed, the cufftDestroy() can be Section V, we test and discuss the accuracy and performance called to release the relevant resources. of cuHelmholtz. Section VI is a brief summary. III. SOLUTION OF HELMHOLTZ EQUATIONS II. CUDA PROGRAMMING MODEL AND CUFFT The three-dimensional Helmholtz equation in Cartesian A. CUDA Programming Model coordinate is CUDA makes GPU very easy to use in general scientific and u x, y, z λu x, y, z f x, y, z engineering computation. In the CUDA programming model, Δ ( )+ ( )= ( ) (1) CPU is host, and GPU is co-processor or device.Ifwelaunch where u is an unknown function, λ is a constant value, the a kernel function on the GPU (device) side, it will be executed source term f is a known function, Δ is the Laplace operator by a large number of GPU threads, which are organized into ∂2 ∂2 ∂2 ∂x2 + ∂y2 + ∂z2 . Function u and f are defined on a rectangular GPU thread blocks. The thread blocks form a grid. The threads area xs ≤ x ≤ xf ,ys ≤ y ≤ yf ,zs ≤ z ≤ zf . λ is generally are scheduled by the GPU thread scheduler to run on some less than or equal to 0. For λ>0 and for λ =0in some cases, Streaming Multiprocessors (SMs) in the unit of warps, which the Helmholtz equation may not have a solution. If λ =0,it are successive 32 threads. is called Poisson equation, and if further f equals 0, it is a Host and device have separate memory space. GPU threads Laplace’s equation. can access data from different types of device memory (includ- In order to numerically solve the Helmholtz equation, we ing global memory, shared memory, register, constant memory, use the finite difference method. We put a mesh in the etc.), of which global memory is the largest and slowest. solution area. The mesh has Nx,Ny,Nz mesh points in x, y, z Global memory is usually used to store intermediate results, directions respectively. Our purpose is to find the u value at the which are generated by previous kernel launch and consumed mesh points, that is, we use the value of the function on some by subsequent kernel calls. Shared memory is an on-chip discrete mesh points to represent the function. The denser the memory that is shared between threads of the same thread mesh (by increasing Nx,Ny,Nz), the more accurate the solu- block. Registers are also on-chip memory and allocated by tion we can usually get. We denote function value u(x, y, z) each thread. Constant memory can only store constant values, at mesh point (xs +iΔx,ys +jΔy,zs +kΔz) as ui,j,k, where and the values in constant memory will be cached. xf −xs x, y, z x y Δ Δ Δ are the grid widths, i.e. Δ = Nx , Δ = Shared memory is often used in optimization for memory yf −ys zf −zs ,Δz = . The discretized Helmholtz equation can constrained CUDA applications. In the current hardware speci- Ny Nz be obtained by using the 7-point central difference scheme as fication, shared memory usually has 32 banks. When accessing follow shared memory, bank conflicts need to be considered. Bank u − u u u − u u conflict occurs if two threads in the same warp access different i+1,j,k 2 i,j,k + i−1,j,k i,j+1,k 2 i,j,k + i,j−1,k 2 + 2 addresses in the same bank, then the shared memory access (Δx) (Δy) u − u u requests have to be serially performed, and it will cause i,j,k+1 2 i,j,k + i,j,k−1 λu f + 2 + i,j,k = i,j,k performance degradation. (Δz) CUDA also provides warp shuffle functions which can In this way, we form a linear system which usually is very exchange data between threads in a warp without using shared large and sparse, then by solving this linear system we can memory. They can be used to implement reduction efficiently. get the numerical solution to the original equation. Since B. cuFFT Helmholtz equation has a special structure, it can be solved by the Fourier based methods. Using Fourier based methods cuFFT is the official FFT library provided by Nvidia. It to solve Helmholtz equation is usually much faster [9]. is designed to provide high performance on Nvidia GPUs. It can calculate three types of discrete Fourier transform: A. Fourier based method – periodic boundary complex to complex (C2C), real to complex (R2C), complex to For brevity, we illustrate it in one-dimensional case. At this real (C2R). The one to three dimensional transformations are moment, the 7-point difference scheme becomes the 3-point supported, and the batched transformations are also supported. scheme as follow The transformation can be in-place or out-of-place. u − u u To calculate DFT with cuFFT, we need to create a plan first. j+1 2 j + j−1 λu f 2 + j = j (2) The plan interface provides a way of reusing configuration. If Δ the configuration of the transformation is unchanged, and only here Δ denotes the grid width and we use index j rather than i the array that needs to be transformed has changed, the plan √since in this section i is used to denote the imaginary number can be reused. cuFFT provides multiple interfaces to create a −1.

58 We illustrate how to solve Helmholtz equation with periodic 1) Case 1 – Dirichlet-Dirichlet: In this case, the values boundary condition by Fourier based method. In this case, of function at two boundaries are given. For homogeneous we make use of the following discrete Fourier series of one boundary case, that is u0 = uN =0. Instead of (3) we need dimensional data {uj} and {fj} to use an expansion in sine waves N−1 N−1 N−1 πkj i 2π kj i 2π kj u u uj = uˆke N ,fj = fˆke N (3) j = ˆk sin N (6) k=0 k=0 k=1 This time instead of (4) we find uˆk by If we substitute expressions (3) in (2), we find     πk i 2π k i 2π (−k) − e N e N − 2cos N 2 + 2 uk fˆk/ λ (7) uˆk + λ = fˆk ˆ = 2 + Δ2 Δ or Thus the strategy for solving equation (2) by Fourier based   2πk method is 2cos N − 2 uˆk = fˆk/ + λ (4) (a) Compute fˆk as the Discrete Sine Transform (DST1) Δ2 N−1 πkj fˆ f Thus the strategy for solving equation (2) by Fourier based k = j sin N (8) method is j=1 (a) Compute fˆk as the Discrete Fourier Transform (DFT) (b) Compute uˆk from equation (7). u N−1 (c) Compute j by the inverse Discrete Sine Transform (6), −i 2π kj N/ fˆk = fke N (5) which is DST1 itself when multiplied by 2. k=0 2) Case 2 – Dirichlet-Neumann: In this case, the value of function at left boundary and derivative at right boundary (b) Compute uˆk from equation (4). are given. For homogeneous boundary case, that is u0 = (c) Compute uj by the inverse Discrete Fourier Transform (3) ∇u (IDFT). Also, computing IDFT (3) followed by its inverse ( )N =0. Instead of (3) we need to use an expansion in (5) yields the original array scaled by N, so the results sine waves N   need to be divided by N.  π u u k − 1 j j = ˆk sin N ( ) (9) B. Homogeneous Dirichlet boundary condition and Neumann k=1 2 Boundary condition This time instead of (4) we find uˆk by   Boundary condition is essential in solving partial differential π(k− 1 ) 2 − equation. Except the periodic boundary condition, there are u fˆ / 2cos N 2 λ ˆk = k 2 + (10) two common boundary conditions used in numerical solution. Δ They are Dirichlet boundary condition and Neumann boundary Thus the strategy for solving equation (2) by Fourier based condition. The former specifies the value of the function to be method is solved at the boundary, and the latter specifies its (normal) (a) Compute fˆk as the type 3 Discrete Sine Transform (DST3) derivative at the boundary. Here homogeneous means that the   k−1 N−1 given value or derivative at the boundary is zero. (−1) π 1 fˆk = fN + fj sin j(k − ) (11) There are two boundaries in each direction (left boundary 2 N 2 and right boundary) and two boundary conditions for each j=1 boundary (Dirichlet, Neumann). Therefore, plus the periodic (b) Compute uˆk from equation (10). boundary condition, there are five possibilities for boundary (c) Compute uj by the type 2 Discrete Sine Transform (9) conditions in each direction (periodic boundary condition (DST2), which is the inverse of DST3 when multiplied is specified in both ends). These cases are represented by by N/2. numbers 0 to 4 respectively as follow 3) Case 3 – Neumann-Neumann: In this case, the derivative • Case 0: periodic boundary value of function at two boundaries are given. For homoge- • Case 1: left and right are both Dirichlet boundaries neous boundary case, that is (∇u)0 =(∇u)N =0. Instead of • Case 2: left is Dirichlet boundary, right is Neumann (3) we need to use an expansion in cosine waves boundary N−1  πkj • Case 3: left and right are both Neumann boundaries u u j = ˆk cos N (12) • Case 4: left is Neumann boundary, right is Dirichlet k=1 boundary This time instead of (4) we find uˆk by For case 0, solution by FFT is illustrated in Section III-A.   πk − For other cases, we need to use DST and DCT to solve the 2cos N 2 uˆk = fˆk/ + λ (13) equation. We still take one-dimensional case as example. Δ2

59 Thus the strategy for solving equation (2) by Fourier based Algorithm 1 Solver for three-dimensional Helmholtz equation method is Input: f defines right-hand term; λ is the coefficient in (1); (a) Compute fˆk as the Discrete Cosine Transform (DCT1) xs,xf ,ys,yf ,zs,zf define the solution area; Nx,Ny,Nz define number of mesh points in each direction; bx,by,bz define the N−1 πkj fˆ f boundary conditions; other necessary boundary values k = j cos N (14) Output: solution u (which is stored in f) j=1 1: Initialize global variables on CPU side u 2: Copy global variables to GPU constant memory (b) Compute ˆk from equation (13). d {x, y, z} u 3: for in do (c) Compute j by the inverse Discrete Cosine Transform 4: Deal with inhomogeneous boundary condition in d axis. f (12), which is DCT1 itself when multiplied by N/2. will be combined with the contribution of the inhomogeneous 4) Case 4 – Neumann-Dirichlet: In this case, the derivative boundary term 5: end for of function at left boundary and value at right boundary are ex ∇u 6: Allocate GPU global memory to accommodate Fourier trans- given. For homogeneous boundary case, that is ( )0 = form uN =0. Instead of (3) we need to use an expansion in cosine 7: Copy f to ex waves 8: if Nx = Ny = Nz then N−1   9: Prepare in-place transform π 1 10: else uj = uˆk cos (k + )j (15) N 2 11: Prepare out-of-place transform k=0 12: end if This time instead of (4) we find uˆk by 13: Call forward transform(ex) to do three-dimensional transform  1  14: Divide in frequency domain as (4),(7),(10),(13),(16) in Section π(k+ 2 ) 2cos N − 2 III uˆk = fˆk/ + λ (16) Δ2 15: Call backward transform(ex) to do three-dimensional inverse transform Thus the strategy for solving equation (2) by Fourier based 16: Copy ex back to f method is 17: if bx =0or by =0or bz =0then (a) Compute fˆk as the type 3 Discrete Cosine Transform 18: Assign values of right boundary for each periodic direction 19: end if (DCT3) N−1   1 π 1 fˆk = f0 + fj cos j(k + ) (17) Lines 3 to 5 of Algorithm 1 deal with inhomogeneous 2 N 2 j=1 boundary conditions. Because cuFFT needs a slightly larger (b) Compute uˆk from equation (16). array than the transformed data for real to complex DFT, lines (c) Compute uj by the type 2 Discrete Cosine Transform (15) 6 to 7 allocate this space on the GPU side and copy the data (DST2), which is the inverse of DCT3 when multiplied by into it. Lines 8 to 12 do some preparation for in-place or out- N/2. of-place transform. For example, when the number of grids in three directions is not equal, the out-of-place transformation C. Inhomogeneous boundary conditions is needed, and an additional array of the same input size will For problems with inhomogeneous boundary conditions, be allocated. the boundary conditions can be merged into the right-hand Line 13 performs the three-dimensional transformation (see term and transformed into homogeneous boundary conditions. Section IV-B). Actually we get fˆ from f in this step. Line 14 We follow the Chapter 20 of Numerical Recipes [9] for this performs division, and uˆ is calculated from fˆ. If the three transformation and omit the details here. boundaries include at least one periodic boundary, special treatment is needed at this step. This is because for the IV. DESIGN AND IMPLEMENTATION OF CUHELMHOLTZ periodic case, only half of the complex values are stored in the A. Helmholtz solver Fourier transform, which requires different processing in the According to Section III, we design Algorithm 1 for solving division step. Line 15 performs the three-dimensional inverse u three-dimensional Helmholtz equation by using GPU. transformation (also see Section IV-B). After this, we get u u f Lines 1 to 2 of Algorithm 1 initialize some global variables from ˆ. Lines 16 to 19 copy to the output array . If there that need to be used in computation and copy them to is a periodic boundary, the right end of the periodic boundary constant memory on the GPU side. These variables include needs to be assigned. 2 2 2 Nx,Ny,Nz, Δx, Δy, Δz,xs,xe,ys,ye,zs,ze, Δx, Δy, Δz B. CUDA library of three-dimensional transforms and so on. Some variables are included for the purpose 1) Approach: We have seen that the core of solving three of reducing the computation time on the GPU side, such dimensional Helmholtz equation by Fourier based method is as 2 , 2, 2. Storing these global variables in constant Δx Δy Δz to carry out two three-dimensional Fourier transforms, and memory helps to reduce register usage and improve different boundary conditions correspond to different Fourier performance. transforms. In the process of solving Helmholtz equation, eight

60 different types of Fourier transforms are involved, they are the pre-processing step is to construct an auxiliary array dk, DFT and IDFT (case 0, periodic), DST1 (case 1, Dirichlet- d , Dirichlet), DCT1 (case 3, Neumann-Neumann), DST3 and 0 =0 kπ 1 DST2 (case 2, Dirichlet-Neumann), and DCT3 and DCT2 dk =sin( )(xk + xN−k)+ (xk − xN−k), N 2 (case 4, Neumann-Dirichlet). These transforms are all invert- k , , ··· ,N − ible. The inverse of DFT is IDFT and vice versa, of DST1 =1 2 1 is DST1, of DST3 is DST2 and vice versa, of DCT1 is This array is of the same size as the original. We then compute DCT1, of DCT3 is DCT2 and vice versa. But the form of DFT of dk (real to complex, R2C) and get a complex sequence each transform we listed before is unnormalized, computing Xk = Rk + iIk. The DST1 of sequence xk can be obtained a transform followed by its inverse yields the original array by post-processing sequence Xk as follows: scaled by N/2 except for DFT and IDFT which is scaled by c ,c 1 R , N. 0 =0 1 = 2 0 Each transform requires that the sequence to be transformed c2j = −Ij,c2j+1 = c2j−1 + Rj, has some certain symmetry. Depending on the choice of N j =1, 2, ··· , 2 − 1 symmetric points at both ends, the symmetric requirements for sequence in different literatures may be inconsistent. The We can see that the pre-processing can be efficiently vector- symmetric requirements we adopted in this paper are listed in ized, but the post-processing has a recurrence computation Table I. which is not efficiently vectorized. There is an alternative approach leads to an algorithm in TABLE I: Symmetric requirements for eight different kinds of which pre- and post-processing are all efficiently vectorized. discrete Fourier transforms Since the inverse of DST1 is DST1 itself scaled by N/2, the Transform Symmetric requirements for the sequence computation of DST1 can be obtained by reversing the order of the calculations. The pre-processing and post-processing DFT periodic sequence are all invertible, but reversing DFT (R2C) between pre- and IDFT periodic sequence post-processing we get a DFT (complex to real, C2R) scaled DST1 odd around j =0and odd around j = N by 1/N . So we need to divide each element of calculated j = 1 j = N + 1 sequence by 2 to obtain DST1 of original sequence. The DST2 odd around 2 and odd around 2 advantage of this reversed order computation is that pre- and DST3 odd around j =0and even around j = N post-processing are all efficiently vectorized since there is no j =0 j = N DCT1 even around and even around recurrence relation in pre- and post-processing. j = − 1 j = N − 1 DCT2 even around 2 and even around 2 We summarize and sort out the calculation of pre- and post- DCT3 even around j =0and odd around j = N processing for each DST and DCT in some literatures [9], [17]. If the forward computation has low concurrency, we adjust it to A three-dimensional transform can be obtained by trans- reverse computation in order to get a more suitable algorithm forming each dimension in turn, so the core of the problem for running on GPU. We list all the formulas for computing is how to calculate the one-dimensional transformation. A each transform in Table II. It can be seen that only DST3 and simple method is extending the sequence according to its DCT3 are efficiently vectorized, and the others are all needed symmetry required. For instance, in order to compute DST1 of to improve concurrency through reverse computing. sequence s =(0,x1,x2, ···, xN−1, 0), we can simply extend 2) Algorithm: Algorithm 2 lists the algorithm of three xkind ykind it to s =(0,x1,x2, ··· ,xN−1, 0, −xN−1, ··· ,,−x1).By dimensional in-place transform, where , and zkind calculating the DFT of s through FFT, we can get the DST1 are ENUM variables, and their values are one of the of s. This method introduces a factor of two inefficiency. eight transforms listed in Table I. In the case of out-of- For one-dimensional problem, this seems to be acceptable. place transform, the implementation is similar, except only But when we work with three-dimensional case, the factor the transpose is out-of-place. becomes eight which is usually unbearable. Furthermore, if Lines 1 to 5 of Algorithm 2 create the required batched plan. using this extending method, DST3 and DST3 will introduce When the transform is in-place, we can share plans between a factor of four inefficiency. different directions because only the data to be transformed x Another way [17] to calculate each type of DST or is different. Line 6 makes the transformation of direction, DCT is to transform it into a DFT calculation of the same which is batched one-dimensional transforms. According to xkind size sequence, these methods need pre-processing and post- the value of , the corresponding calculation in Table II xkind processing steps. In the implementation of cuHelmholtz, we is carried out. If is DFT R2C or DFT C2R, that is, adopt this approach. boundary condition in this direction is periodic, the corre- Take DST1 as an example, sponding R2C or C2R calculation is executed without pre- processing or post-processing. N−1 πkj c x ,j , , ··· ,N − Then we transform y direction in lines 7 to 9. Since the j = k sin N =0 1 1 y x y k=1 data in direction data is discontinuous, and directions

61 TABLE II: Computing Discrete Sine and Cosine Transforms

Transform Pre-processing DFT Post-processing

R0 =2x1,RN = −2xN−1 2 1 1 cn = 4 (dn − dN−n)+ 8sin nπ (dn + dN−n) Ij = −2x2j N DST1 C2R n =1, 2, ··· ,N − 1 Rj = x2j+1 − x2j−1 N c0 =0 j =1, 2, ··· , 2 − 1

R0 = x1,RN = −xN nπ nπ 2 (dn−dN−n)cos 2N +(dn+dN−n)sin 2N cn , x2k+1−x2k = 2 Rk = 2 DST2 C2R n =1, 2, ··· ,N x2k+1+x2k Ik = − 2 c0 N =0 k =1, 2, ··· , 2 − 1

1 1 c0 =0,c1 = R0,cN = − R N 2 2 2 d x − x nπ x x nπ 1 n =( n N−n)sin 2N +( n + N−n)cos 2N c2k = − 2 (Rk + Ik) DST3 n , , ··· ,N − R2C 1 =0 1 1 c2k+1 = 2 (Rk − Ik) N−1 k =1, 2, ··· , 2

R0 = x0,RN = xN 2 1 1 cn = 4 (dn + dN−n) − 8sin nπ (dn − dN−n) Rj = x2j N n =1, 2, ··· ,N − 1 Ij = x2j−1 − x2j+1 DCT1 C2R d0 N c0 = 2 + β j =1, 2, ··· , 2 − 1 d0  N cN = − β 2 −1 2 β = k=0 x2k+1

R0 = x0,RN = xN−1 nπ nπ 2 (dn+dN−n)cos 2N +(dn−dN−n)sin 2N cn , x2k−1+x2k = 2 Rk = 2 DCT2 C2R n =1, 2, ··· ,N − 1 x2k−1−x2k Ik = − 2 c0 d0 N = k =1, 2, ··· , 2 − 1

1 1 c0 = R0,cN−1 = R N 2 2 2 d x x nπ x − x nπ 1 n =( n + N−n)sin 2N +( n N−n)cos 2N c2k = 2 (Rk + Ik) DCT3 n , , ··· ,N − R2C 1 =0 1 1 c2k−1 = 2 (Rk − Ik) N k =1, 2, ··· , 2 − 1

need to be transposed first, then the transform can be carried Two paired elements can be processed by the same thread out, and finally we need to transpose it back. Lines 10 to 12 without even using shared memory. We take the calculation transform in z direction, which is similar to the computation of pre-processing for DCT1 as an example to illustrate the in y direction. implementation of pre- and post-processing in cuHelmholtz.

3) Implementation and Optimization: As can be seen from For three-dimensional array, each row needs to be pre- Table II, the complexity of pre- and post-processing for processed and post-processed. Because the direction to be each transform is O(N). In some pre- and post-processing transformed has been transposed to x direction, the data calculations, two elements can be calculated in pairs. High need to be processed is continuous. cuHelmholtz adopts a efficiency can be achieved when implemented on the GPU. simple task assignment strategy, in which each thread block

62 Algorithm 2 Algorithm for three-dimensional in-place trans- 0 123 N-1 N+1 form sh_mem Input: d data defines input array; Nx,Ny,Nz define number of 012 − mesh points in each direction; xkind, ykind, zkind indicate the kind of transform in each direction bid 1 1 1 1 1 1 1 1 Output: d data contains the transformed array 1: if xkind or ykind or zkind is DST3 or DCT3 or DFT R2C ++++ Step 1 then 2 2 2 2 2: Create a CUFFT batched real to complex plan ++ 3: else Step 2 4: Create a CUFFT batched complex to real plan 4 4 5: end if + Step 3 x xkind 6: Do direction transform according to value of _ 8 7: Transpose d data in x and y direction 8: Do y direction transform according to value of ykind 9: Transpose d data in y and x direction Fig. 1: Pre-processing of DCT1 10: Transpose d data in x and z direction 11: Do z direction transform according to value of zkind d data z x 12: Transpose in and direction There are also some code optimizations, such as using sincos() or sincospi() (one function call) instead of sin() and cos() (two function calls), which will reduce the is responsible for one row of processing and a certain number calculation time of DST2, DST3, DCT2 and DCT3 (see Ta- of thread blocks are used according to the size of the other ble II). Some possible optimization directions include reducing two dimensions. the number of transposes (for the case of solving Helmholtz In each thread block, the pre-processing of DCT1 is to equation, we do not need to transpose the array back to its N N transform +1 real numbers into 2 +1 complex numbers. original state, which can further reduce one transpose), using Let’s assume that N is even and the array subscript starts at 0. warp shuffle instructions for reduction, etc. As shown in Figure 1, the pre-processing of DCT1 keeps the even indexed items untouched, it loads the odd indexed items V. E VALUATION AND DISCUSSION to shared memory first, and then updates the odd indexed items In the experiments, we compare cuHelmholtz with FISH- of the original array according to the formula of pre-processing PACK [19] which is widely used in many researches [2], of DCT1 (which is not shown in Figure 1). Since there is no [3], [20], [21]. FISHPACK is an efficient Fortran package data dependency, this step can be executed concurrently and for solving separable elliptic partial differential equations, and efficiently. Next, we need to calculate β, which is a reduction it includes a subroutine hw3crt for solving three-dimensional of odd indexed terms in the original array. The needed data Helmholtz equation. has been stored in shared memory continuously. Reduction computation is shown in Figure 1, which guarantees high A. Experimental Platform bandwidth for shared memory access and there is no bank Experiments are conducted on a platform equipped with conflict. However, this requires that the length of the array be 1 Intel i7 3770 CPU and 1 Nvidia Tesla K20 GPU. We a power of 2. At present, cuHelmholtz first finds the smallest use compilers GCC and NVCC provided by Nvidia CUDA power of 2 (assuming that it is N) not less than the length of Toolkit. The CUDA version used is CUDA 6.0. the array in shared memory, then expands the length of the We use compiler GCC (version 4.8.4) with the -O2 flag array to N (zero-padding), so as to ensure that the length of to compile the CPU side source code and NVCC with -O2 the reduction must be a power of 2. -arch sm_35 flags to compile the GPU side source code. The operation of DFT R2C or C2R between pre-processing For the comparison with FISHPACK, we compile FISHPACK and post-processing is computed by calling functions in using gfortran and the -O2 -fdefault-real-8 flags. We cuFFT. The transpose operation is done by cuTranspose [18]. also implement a C wrapper for the FISHPACK function We have also implemented the corresponding transpose op- hw3crt in order to reuse the same test code. All the source code eration, and the performance is comparable to that of cu- is available at https://github.com/rmingming/cuHelmholtz/. Transpose. For the in-place transpose, our implementation is B. Experimental Results slightly better, but cuTranspose performs better in out-of-place transpose. The reason we adopt cuTranspose is that lines 9 to We have tested different functions with different boundary 10 of Algorithm 2 can be optimized by cuTranspose. cuTrans- conditions. In this paper we only illustrate the tests for one pose supports the transpose of three directions called rotation function. Other tesing examples can be found in the source transpose, i.e. one kernel launch transpose three directions code. We choose the testing function as simultaneously, so that two successive involution transposes 1 u = sin(2πx)sin(2πy)sin(2πz) in lines 9 to 10 (x ↔ y and x ↔ z) can be simplified to a 3 single rotation transpose (x → y → z → x).

63 TABLE III: Maximum error and order of error for in-place cuHelmholtz. N represents the grid we use is N × N × N. Case ijk means x boundary is case i, y boundary is case j, z boundary is case k. See Section III-B.

Case 000 Case 111 Case 222 Case 333 Case 444 N Error Order Error Order Error Order Error Order Error Order 32 0.001063976 2.002058067 0.001063976 2.002058067 0.001870815 1.992185600 0.001620738 1.991624427 0.001870815 1.992185600 64 0.000265615 2.000514253 0.000265615 2.000514253 0.000470244 1.998049270 0.000407544 1.997909148 0.000470244 1.998049270 128 0.000066380 2.000128544 0.000066380 2.000128544 0.000117720 1.999512479 0.000102034 1.999455602 0.000117720 1.999512479 256 0.000016594 2.000032195 0.000016594 2.000032195 0.000029440 1.999877725 0.000025518 1.999868509 0.000029440 1.999877725 512 0.000004148 — 0.000004148 — 0.000007361 — 0.000006380 — 0.000007361 — Case 120 Case 134 Case 204 Case 300 Case 432 N Error Order Error Order Error Order Error Order Error Order 32 0.001064916 2.002079421 0.000987213 1.994081380 0.001065451 2.002098837 0.000983521 1.993753685 0.001736370 1.991464813 64 0.000265846 2.000519573 0.000247818 2.000490251 0.000265975 2.000524409 0.000246947 2.000481201 0.000436668 1.998049039 128 0.000066437 2.000129874 0.000061933 2.000122570 0.000066470 2.000131084 0.000061716 2.000120305 0.000109315 1.999480812 256 0.000016608 2.000032490 0.000015482 2.000030659 0.000016616 2.000032752 0.000015428 2.000030140 0.000027339 1.999877334 512 0.000004152 — 0.000003870 — 0.000004154 — 0.000003857 — 0.000006835 —

TABLE IV: Maximum error and order of error for out-of-place cuHelmholtz. Case ijk has the same meaning as in Table III.

Case 120 Case 134 Case 204 Case 300 Nx × Ny × Nz Error Order Error Order Error Order Error Order 20 × 30 × 40 0.001531321 1.994556899 0.001472749 2.002066505 0.002868435 1.965554046 0.002691936 1.966039070 40 × 60 × 80 0.000384277 2.000588875 0.000367660 1.998811576 0.000734437 1.993901092 0.000689014 1.993323537 80 × 120 × 160 0.000096030 2.000147216 0.000091991 2.000129094 0.000184387 1.998464307 0.000173052 1.998320055 160 × 240 × 320 0.000024005 2.000036814 0.000022996 2.000032261 0.000046146 1.999603956 0.000043314 1.999579326 320 × 480 × 640 0.000006001 — 0.000005749 — 0.000011540 — 0.000010832 — 128 × 1024 × 1024 0.000024499 — 0.000027532 — 0.000081145 — 0.000077959 —

≤ x ≤ , ≤ y ≤ , ≤ 1 1 1 and let the solution domain be 0 1 0 1 0 for case 128 × 1024 × 1024,itisO( 1282 + 10242 + 10242 ), z ≤ λ − 1 1 1 1, be 1, so the Helmholtz equation would be which is greater than O( 3202 + 4802 + 6402 ). 2) Performance: Figure 2 shows the speedup of in-place 1 2 Δu − u = − (1 + 12π )sin(2πx)sin(2πy)sin(2πz). version cuHelmholtz vs. FISHPACK. When Nx = Ny = Nz, 3 cuHelmholtz takes the in-place computation, in this case no 1) Accuracy: In this section, we illustrate that our im- additional GPU memory is required. It can be seen that when plementation gives accurate solution. Since the central dif- N is large, the speedups are between 10 and 20 (non-periodic ference method we adopted is a second order method, the boundary cases) or between 20 to 30 (periodic boundary 2 2 2 discretization error is O(Δx +Δy +Δz). For our testing cases). function, if we choose N = Nx = Ny = Nz (in this Figure 3 shows the speedup of out-of-place version case the in-place implementation of cuHelmholtz is used, cuHelmholtz vs. FISHPACK. If Nx,Ny,Nz are not all equal, otherwise the out-of-place version is used), we have Δ= cuHelmholtz takes the out-of-place computation, in this case 2 Δx =Δy =Δz, the discretization error is O(Δ ). Let additional GPU memory is required and we need to allocate it N =32, 64, 128, 256, 512, respectively, Table III lists the in advance. It can be seen that when N is large, the speedups maximum error (at all grid points) and order of maximum error are almost the same as in the in-place version. for the in-place cuHelmholtz. Table IV lists the corresponding We list the run times of each step for different kinds of DSTs results for the out-of-place version. It shows that the solution and DCTs in Table V. For each in-place transform, pre- and is accurate enough and the maximum error is indeed second post-processing need to be executed three times (each for one order. The order of error is computed through the following axis), so does the CUFFT R2C/CUFFT C2R step. Involution formula: transposes x ↔ y and x ↔ z are executed once. Rotation Error(N) transpose x → y → z → x also is executed once. For the Order of error(N)=log2 Error(2N) out-of-place transform, everything is the same except for the transpose. This time involution transposes x ↔ y and x ↔ z From the underlined results in Table IV, we can see that are not needed, but rotation transpose x → y → z → x needs although case 128 × 1024 × 1024 uses more grid points than to be executed three times. The grid size we use is 512×512× case 320×480×640, the maximum error is not as small as the 512, so pre-processing, post-processing and transpose are all 2 2 2 latter. This is because the maximum error is O(Δx+Δy +Δz), needed to read and write about 2GB of data (512 × 512 ×

64 30 Case 000 Case 111 25 Case 222 Case 333 Case 444 20 15 10 Speedup 5 0

24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 264 280 296 312 328 344 360 376 392 408 424 440 456 472 488 504 520 536

Nx (Ny = Nz = Nx)

Fig. 2: Speedup of in-place cuHelmholtz vs. FISHPACK. The grid we use id Nx × Ny × Nz. Nx is the value of horizontal axis and Ny = Nz = Nx.

30 Case 000 Case 111 25 Case 222 Case 333 Case 444 20 15 10 Speedup 5 0

20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360

Nx (Ny =1.6 × Nx, Nz =2× Nx.)

Fig. 3: Speedup of out-of-place cuHelmholtz vs. FISHPACK. The grid size we use is Nx × Ny × Nz. Nx is the value of horizontal axis and Ny and Nz are computed as Ny =1.6 × Nx and Nz =2× Nx respectively.

512 × 8(double) × 2(read and write)). The device to device up twice as much GPU memory, so the default configuration bandwidth of the GPU we use is about 140GB/s, that is to for cuHelmholtz is to use the in-place version if Nx = Ny = say, we need at least 2/140 second (about 14 milliseconds) to Nz. complete these computations. From Table V we see that the We list the run times of in-place cuHelmholtz and FISH- run time of pre-processing or post-processing is close to this PACK in Table VI. It can be seen that the Fourier based limit, and the run time of transpose is comparable to this limit. method is very efficient for solving Helmholtz equation. For the problem with grid size of 5123, it only takes 0.5 to 0.7 TABLE V: The run times (in millisecond) of each step for six seconds to solve on GPU (compare to more than 10 seconds kinds of DSTs and DCTs. on CPU). For problems requiring hundreds of thousands 512 × 512 × 512 DCT1 DCT2 DCT3 DST1 DST2 DST3 iterations, this obviously saves a lot of time. pre-processing 14.5 14.5 16.7 14.5 14.4 16.6 post-processing 16.7 16.2 15.4 16.2 16.3 16.3 CUFFT R2C — — 44.2 — — 44.2 C. Discussion and future work CUFFT C2R 46.5 46.6 — 46.6 46.6 — x ↔ y 23.2 23.1 23.1 23.2 23.2 23.2 In the case of periodic boundary conditions, there are x ↔ z 22.5 22.5 22.5 22.5 22.5 22.4 still many possibilities for optimization. In fact, if all three x → y → z → x directions are periodic boundary conditions, three-dimensional 31.1 31.1 31.1 31.1 31.1 31.1 (in-place) FFT function provided by cuFFT can be directly invoked. If x → y → z → x there are periodic boundary conditions in two directions, the 21.1 21.1 21.1 21.1 21.1 21.1 (out-of-place) batched two-dimensional transform provided by cuFFT can also be used. These should be more efficient than the batched Although the out-of-place cuHelmholtz requires additional one-dimensional transform used in current implementation. GPU memory allocation, it performs transposes faster than For large-scale problems which are limited by the memory in the in-place version. As can be seen in Table V, rotation size of a single GPU device, we need to use multiple GPUs transpose is much faster when in out-of-place transform. As a to solve them. Currently, cuFFT provides partial support result, in many cases, the out-of-place cuHelmholtz is slightly for multi-GPU FFT computation. This is one of the further faster. The disadvantage of out-of-place version is that it takes research directions.

65 TABLE VI: Run times (in seconds) of in-place cuHelmholtz and FISHPACK. N represents the grid size we use is N ×N ×N. Case ijk has the same meaning as in Table III.

Case 000 Case 111 Case 222 Case 333 Case 444 N FISHPACK cuHelmholtz FISHPACK cuHelmholtz FISHPACK cuHelmholtz FISHPACK cuHelmholtz FISHPACK cuHelmholtz 32 0.001506 0.000954 0.00289 0.00113 0.004246 0.001153 0.004241 0.001272 0.003532 0.00112 64 0.011747 0.002726 0.013213 0.003407 0.013169 0.003558 0.012932 0.003743 0.012548 0.003392 128 0.108342 0.00985 0.115394 0.013529 0.113257 0.013832 0.110791 0.014432 0.107593 0.013367 256 0.958591 0.064529 0.975956 0.090115 0.927669 0.091725 0.906317 0.093427 0.901174 0.090732 512 13.707784 0.490883 11.612114 0.697392 11.111694 0.69984 10.945038 0.704277 10.803941 0.694797

VI. CONCLUSIONS [6] Ronald F. Boisvert. A fourth-order-accurate fourier method for the In this paper, we have designed and implemented a library helmholtz equation in three dimensions. Acm Transactions on Math- ematical Software, 13(3):221–234, 1985. for solving three-dimensional Helmholtz equation with various [7] E. Braverman, M. Israeli, and A. Averbuch. A fast spectral solver boundary conditions. Since we use the Fourier based method for a 3d helmholtz equation. Siam Journal on Scientific Computing, to solve Helmholtz equations with Dirichlet and Neumann 20(6):2237–2260, 2006. boundary conditions, we also have designed and implemented [8] R. N. Bracewell. The fourier transform and its applications. Electronics algorithm for computing different kinds of three-dimensional & Power, 11(10):357, 2009. [9] William Press, Saul Teukolsky, William Vetterling, and Brian Flannery. DST and DCT on GPU. The experimental results have verified Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cam- the correctness of the developed library. Compared with FISH- bridge University Press, 2007. PACK, another software package that can solve Helmholtz [10] M Frigo and S. G Johnson. The design and implementation of fftw3. equations with various boundary conditions, we achieve 10 Proceedings of the IEEE, 93(2):216–231, 2005. [12] Y L Chan and W C Siu. Variable temporal-length 3-d discrete cosine to 30 times speedup. . IEEE Transactions on Image Processing, 6(5):758– 763, 1997. ACKNOWLEDGMENT [13] T. H. Lai and Guan Ling. Video coding algorithm using 3-d dct and This work is partially supported by NSF of China vector quantization. In International Conference on Image Processing, 2002. (61373018, 61602266) and Science and Technology Devel- [14] Jing Wu, Joseph Jaja, and Elias Balaras. An optimized fft-based opment Plan of Tianjin (17JCYBJC15300, 16JCYBJC41900). direct poisson solver on cuda gpus. IEEE Transactions on Parallel & Distributed Systems, 25(3):550–559, 2014. REFERENCES [15] Jing Wu and Jaja Joseph. Optimized fft computations on heterogeneous [1] Jia Zhao and Qi Wang. Three-dimensional numerical simulations of platforms with application to the poisson equation. Journal of Parallel biofilm dynamics with quorum sensing in a flow cell. Bulletin of & Distributed Computing, 74(8):2745–2756, 2014. Mathematical Biology, 79(4):884–919, Apr 2017. [16] Shivang Ghetia, Nagendra Gajjar, and Ruchi Gajjar. Implementation of [2] J. Rafael Pacheco, Kang Ping Chen, Arturo Pacheco-Vega, Baisong 2-d discrete cosine transform algorithm on gpu. International Journal Chen, and Mark A. Hayes. Chaotic mixing enhancement in electro- of Advanced Research in Electrical Electronics & Instrumentation osmotic flows by random period modulation. Physics Letters A, Engineering, 2(7):1–35(35), 2013. 372(7):1001–1008, 2008. [17] Garry Rodrigue. Parallel Computations. Academic Press, 1982. [3] Adriani Giulia, De Tullio Marco Donato, Ferrari Mauro, Hussain Fazle, [18] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. Efficient 3d Pascazio Giuseppe, Liu Xuewu, and Decuzzi Paolo. The preferential transpositions in graphics processing units. International Journal of targeting of the diseased microvasculature by disk-like particles. Bio- Parallel Programming, 43(5):876–891, Oct 2015. materials, 33(22):5504–5513, 2012. [19] Paul N. Swarztrauber. A direct method for the discrete solution of [4] Chen Chen, Shuyu Hou, Dacheng Ren, Mingming Ren, and Wang separable elliptic equations. Siam Journal on Numerical Analysis, Qi. 3-d spatio–temporal structures of biofilms in a water channel. 11(6):1136–1150, 1974. Mathematical Methods in the Applied Sciences, 38(18):4461–4478, 2016. [20] Tullio M. D De, A Cristallo, E Balaras, and R Verzicco. Direct numerical [5] E. N. Houstis and T. S. Papatheodorou. High-order fast elliptic equation simulation of the pulsatile flow through an aortic bileaflet mechanical solvers. Acm Transactions on Mathematical Software, 5(4):431–441, heart valve. Journal of Fluid Mechanics, 622(622):259–290, 2009. 1979. [21] Ludovic Metivier,´ Aude Allain, Romain Brossier, Quentin Merigot,´ [11] Stephen A. Martucci. Symmetric convolution and the discrete sine and Edouard Oudet, and Jean Virieux. Optimal transport for mitigating cycle cosine transforms. Signal Processing IEEE Transactions on, 42(5):1038– skipping in full waveform inversion: a graph space transform approach. 1051, 1994. Geophysics, pages 1–84, 2018.

66