Discrete Sine and Cosine Transform and Helmholtz Equation Solver on GPU

2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) Discrete Sine and Cosine Transform and Helmholtz Equation Solver on GPU Mingming Ren, Yuyang Gao∗, Gang Wang, Xiaoguang Liu TJ Key Lab of NDST, College of Computer Science, Nankai University, Tianjin, China {renmingming, gaoyy, wgzwp, liuxg}@nbjl.nankai.edu.cn Abstract—Helmholtz equation is a special kind of elliptic hardware and software technology, there are quite a lot of partial differential equation. Solving Helmholtz equation is often resources available for discrete signal processing. For example, needed in many scientific and engineering problems. The efficient the calculation of discrete-time Fourier series of periodic approach to solving Helmholtz equation is through using Fast Fourier Transform (FFT). In practice, boundary conditions must signals (i.e. Discrete Fourier transform, DFT [9]) is usually be considered, and several discrete Fourier transforms such as computed by Fast Fourier Transform (FFT). Many famous Discrete Sine and Cosine Transforms (DST, DCT) are needed for software packages such as MATLAB, Intel MKL and FFTW solving problems with different boundary conditions. [10] provide corresponding implementations. Nowadays, with the development of Compute Unified De- Using Fourier based method to solve Helmholtz equation vice Architecture (CUDA) technology in recent years, many researchers use Graphics Processing Units (GPUs) to accelerate can handle not only the periodic boundary conditions, but their programs. In view of the importance of FFT, NVIDIA also the other different kinds of boundary conditions, such officially provides the library cuFFT, but it does not include as Dirichlet or Neumann boundary conditions. By using the the calculation of DST and DCT, which brings inconvenience to finite difference method, the solution can be obtained through users. varieties of DFT, such as Discrete Sine Transform (DST) and In this paper, we present cuHelmholtz which is designed and implemented based on CUDA. It can be used to compute Discrete Cosine Transform (DCT) [9], [11]. Different variants several kinds of three-dimensional DST and DCT, and to solve of DFT correspond to different homogeneous boundary con- three-dimensional Helmholtz equations with various boundary ditions. Furthermore, non-homogeneous boundary conditions conditions. The experimental results show that compared with can be transformed into homogeneous boundary conditions. It FISHPACK, a FORTRAN software package which can be used is worth mentioning that the type 2 DCT is widely used in to solve three-dimensional Helmholtz equation, we achieve about 10× to 30× speedup. data compression, image processing and video coding [12], Index Terms—Helmholtz equation, CUDA, GPU, Discrete Sine [13], and it is called “the” DCT in these fields. Transform, Discrete Cosine Transform Since Nvidia Corporation launched Compute Unified De- vice Architecture (CUDA), Graphics Processing Units (GPUs) I. INTRODUCTION have been playing an increasingly important role in the field of Helmholtz equation is a special kind of elliptic partial scientific and engineering computing in the past decade. Many differential equation. In the process of solving many mathe- studies have used GPUs to optimize the computation process matical and physical problems (such as solving wave equation, in order to reduce the run time. Nvidia also provides a library potential distribution in electrostatics and incompressible flow named cuFFT for computing FFT. However, the library does problem [1], [2], etc.) one may need to solve Helmholtz not contain the calculation of DST and DCT, which makes it equation (or its special cases, such as Laplace equation and inconvenient to solve Helmholtz equation with non-periodic Poisson equation). Furthermore, many problems need to be boundary conditions on GPU. solved iteratively, and in each iteration one may need to solve To the best of authors’ knowledge, there is no literature a Helmholtz equation [3], [4]. Therefore, it is helpful to solve to solve all kinds of common DST and DCT on GPU, nor Helmholtz equation as quickly as possible. to solve three-dimensional Helmholtz equation with several Fourier-related transform has important applications in solv- common boundary conditions on GPU. Here we introduce ing partial differential equations [5]–[7]. Many partial differ- some related works. Wu et al. [14], [15] computed solution ential equations including Helmholtz equations with constant of three-dimensional Possion equation on GPU platform for coefficients can be solved efficiently by Fourier-related trans- one Neumann boundary condition (the other two are still forms. periodic boundary conditions). In [16], Ghetia et al. computed Fourier-related transform has also been widely used in type 2 DCT on GPU platform in two-dimensional case. Their many other fields [8]. According to whether the signal is method requires extra GPU memory and is not suitable for the continuous or discrete in time, periodic or non-periodic, it calculation of large-scale three-dimensional transforms. can be divided into four categories: continuous time Fourier In this paper we design and implement algorithms for transform, discrete time Fourier transform, continuous time computing several kinds of three-dimensional DST and DCT Fourier series and discrete time Fourier series. Thanks to the on GPU, and use them to solve three-dimensional Helmholtz rapid development of digital technology, especially computer equations with various boundary conditions. The code library 978-0-7381-3199-3/20/$31.00 ©2020 IEEE 57 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00034 we implemented is named cuHelmholtz. The main organi- plan, which corresponds to different data layouts and batched zational structure of this paper is as follows. We briefly executions. Next, we can call execution functions such as introduce some basics of CUDA and cuFFT in Section II. cufftExecR2C() to do corresponding transformations (the Section III describes how to solve Helmholtz equation by finite data layout was specified when creating plan). The plan can be difference method and Fourier transform method. Section IV used to transform different data several times. When the plan describes the design and implementation of cuHelmholtz. In is no longer needed, the function cufftDestroy() can be Section V, we test and discuss the accuracy and performance called to release the relevant resources. of cuHelmholtz. Section VI is a brief summary. III. SOLUTION OF HELMHOLTZ EQUATIONS II. CUDA PROGRAMMING MODEL AND CUFFT The three-dimensional Helmholtz equation in Cartesian A. CUDA Programming Model coordinate is CUDA makes GPU very easy to use in general scientific and u x, y, z λu x, y, z f x, y, z engineering computation. In the CUDA programming model, Δ ( )+ ( )= ( ) (1) CPU is host, and GPU is co-processor or device.Ifwelaunch where u is an unknown function, λ is a constant value, the a kernel function on the GPU (device) side, it will be executed source term f is a known function, Δ is the Laplace operator by a large number of GPU threads, which are organized into ∂2 ∂2 ∂2 ∂x2 + ∂y2 + ∂z2 . Function u and f are defined on a rectangular GPU thread blocks. The thread blocks form a grid. The threads area xs ≤ x ≤ xf ,ys ≤ y ≤ yf ,zs ≤ z ≤ zf . λ is generally are scheduled by the GPU thread scheduler to run on some less than or equal to 0. For λ>0 and for λ =0in some cases, Streaming Multiprocessors (SMs) in the unit of warps, which the Helmholtz equation may not have a solution. If λ =0,it are successive 32 threads. is called Poisson equation, and if further f equals 0, it is a Host and device have separate memory space. GPU threads Laplace’s equation. can access data from different types of device memory (includ- In order to numerically solve the Helmholtz equation, we ing global memory, shared memory, register, constant memory, use the finite difference method. We put a mesh in the etc.), of which global memory is the largest and slowest. solution area. The mesh has Nx,Ny,Nz mesh points in x, y, z Global memory is usually used to store intermediate results, directions respectively. Our purpose is to find the u value at the which are generated by previous kernel launch and consumed mesh points, that is, we use the value of the function on some by subsequent kernel calls. Shared memory is an on-chip discrete mesh points to represent the function. The denser the memory that is shared between threads of the same thread mesh (by increasing Nx,Ny,Nz), the more accurate the solu- block. Registers are also on-chip memory and allocated by tion we can usually get. We denote function value u(x, y, z) each thread. Constant memory can only store constant values, at mesh point (xs +iΔx,ys +jΔy,zs +kΔz) as ui,j,k, where and the values in constant memory will be cached. xf −xs x, y, z x y Δ Δ Δ are the grid widths, i.e. Δ = Nx , Δ = Shared memory is often used in optimization for memory yf −ys zf −zs ,Δz = . The discretized Helmholtz equation can constrained CUDA applications. In the current hardware speci- Ny Nz be obtained by using the 7-point central difference scheme as fication, shared memory usually has 32 banks. When accessing follow shared memory, bank conflicts need to be considered. Bank u − u u u − u u conflict occurs if two threads in the same warp access different i+1,j,k 2 i,j,k + i−1,j,k i,j+1,k 2 i,j,k + i,j−1,k 2 + 2 addresses in the same bank, then the shared memory access (Δx) (Δy) u − u u requests have to be serially performed, and it will cause i,j,k+1 2 i,j,k + i,j,k−1 λu f + 2 + i,j,k = i,j,k performance degradation.

Discrete Sine and Cosine Transform and Helmholtz Equation Solver on GPU

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE Optimized AV1 Inter

Video Compression Optimized for Racing Drones

Answers to Exercises

An Overview of Emerging Video Coding Standards

State of the Art and Future Trends in Data Reduction for High-Performance Computing

Appendix a Information Theory

Towards Prediction Optimality in Video Compression and Networking

Region Based X-Ray Image Retrieval Using Transform Techniques *Dr

Symmetric Trigonometric Transforms to Dual-Root Lattice Fourier–Weyl Transforms

DCT-Only and DST-Only WINDOWED UPDATE ALGORITHMS

An Optimized Template Matching Approach to Intra Coding in Video/Image Compression

On the Distribution of the Sample Autocorrelation Coefficients