Journal of Computational Vol.5, No.3, 2011 Science and Technology Implementation of GPU-FFT into Planewave Based First Principles Calculation Method∗

Hidekazu TOMONO∗∗, Masaru AOKI∗∗∗,∗∗, Toshiaki IITAKA† and Kazuo TSUMURAYA∗∗ ∗∗ Department of Mechanical Engineering Informatics, School of Science and Technology, Meiji University 1–1–1 Higashimita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan E-mail: [email protected] (KT) ∗∗∗ School of Management, Shizuoka Sangyo University 1572-1 Ohwara, Iwata, Shizuoka 438-0043, Japan † RIKEN (The Institute of Physical and Chemical Research) 2-1 Hirosawa, Wako, Saitama 351-0198, Japan

Abstract We present an implementation of a GPU based FFT routine ( based Fast Fourier Transformation) into a CPU based ab initio periodic DFT (Den- sity Functional Theory) calculation code. The FFT calculation in the CPU based DFT codes is the most time-consuming part; for the 128 silicon system, the fraction of time of a CPU FFT calculation amounts to 0.64 of the whole periodic DFT calculation. The replacement of a double precision FFT in the periodic PWscf code with a single precision FFT gives no appreciable differences in both the numerical total energies and the interatomic forces, guaranteeing the use of a single precision GPU based FFT, CUFFT, for the code. The use of the CUFFT reduces the fraction to 0.20 of the whole PWscf code; the replacement speedups a factor of 2.2 for single CPU system. The use of the multi-CPU system with the GPU FFT accelerates by 2.2 f , where f is the acceleration factor of the multi-CPU system. The single precision GPU calculation is implementable in any self-consistent electronic structure code, except for the eigen- solver part in the DFT codes.

Key words : GPU, First Principles Calculation, DFT, Planewaves, GPGPU, CUFFT

1. Introduction

First principles or ab initio electronic structure calculation methods are a robust technique to calculate and predict the properties of materials; the methods enable us not only to predict the solid state properties but also to design new materials. The methods, however, require a high computation cost. There have been two approach to accelerate the computation; one is an improvement of the hardware system and the other is the one of the system. The devices, with higher mobility of carriers such as GaAs compound than silicon devices, have been attempted to use for the CPU (Central Processing Unit) devices to accelerate the calculations. Since the GaAs compound is a two component system, it has been difficult to manufacture defect- free GaAs devices. At present we are unable to find commercial computers using the GaAs device. So the silicon devices with finer spaced wiring have been manufactured instead of the two-component CPU devices to accelerate the computations. The decrease of the spacing of the wiring, however, has been saturated due to the limit of the use of the short wavelength of rights. These lead to the stagnation of the acceleration of the calculations with single CPU’s. On one hand, the MPI (Message Passing Interface) system has enabled us to accelerate ∗ Received 30 May, 2011 (No. 11-0305) the calculations to overcome the stagnation. This is the parallel computing. On the other hand, [DOI: 10.1299/jcst.5.89] Copyright c 2011 by JSME the GPU devices, which have been used for the fast outputting the data into graphic display,

89 Journal of Computational Vol.5, No.3, 2011 Science and Technology

have begun to be used for the scientific numerical calculations. This is another challenge to overcome the stagnation. This is the GPGPU (General Purpose Graphics Processing Units). The application of the GPGPU to the planewave based first principles electronic structure code is essential for accelerating the code. The code allows us to solve a partial differential Kohn-Sham equation

HKSΨi = iΨi. (1)

The Ψ is the wavefunction, i is the eigenstates, and is the eigenvalues. The HKS is a Kohn- Sham Hamiltonian, which is given by ∇2 H = − + V (r) + V (r) + V (r)(2) KS 2 ion H xc

in Hartree energy Eh, and Bohr radius a0 units. The first term is the kinetic operator for , the second the electron-atom potential, the third the Hartree potential, and the fourth the exchange and correlation (XC) potential of the . The XC potential is given by δE [ρ] V = xc , (3) xc δρ(r) where ρ is the charge density. The planewave based DFT methods expand the wavefunctions into the planewaves to solve the equation. The following process of iteration is needed to reach the self-consistent solution of the Kohn-Sham equation. The use of the charge density ρ calculated from the wavefunction Ψ, which is the solution of the Eq. (1), allows us to cal- culates the Hartree potential and the XC potential. Substituting the old potential in Eq. (1) for the new potential, which is the sum of the Hartree potential, the XC potential, and the pseu- dopotentials, enables us to solve the equation which will gives another new potential. This is an improvement process of the potential. This is a self-consistent process of the Kohn-Sham equation. Since the XC potential is defined in a real space, we need to transform the charge density in spectral space into the real space, to evaluate the XC potential, and to transform back to the spectral space. These processes use the fast Fourier transformation (FFT) rou- tines. To obtain the wavefunction Ψ, which is the eigenvectors of the equation, we are able to use an iterative procedure for the large size of the rectangular Hamiltonian matrices. This is the CP (Car-Parrinello) method(1). The planewave based self-consistent process uses a number of the FFT and inverse FFT routines to reach the self-consistent solutions. So the FFT calculations spend a large fraction of the time of the total calculation in the first principles calculations. As will be shown later, the FFT spends 0.65 of the fraction of the total time of the calculation for the system with two silicon atoms in a rhombohedral unitcell and the FFT spends 0.64 for the system with 128 silicon atoms. The implementation of GPU FFT routines instead of the CPU FFT routines is expected to accelerate the total computation time of the self-consistent process. In this paper, we accelerate the code using the GPU FFT instead of the CPU FFT. An earlier, primitive version of this paper has been presented at elsewhere(2). While Goedecker’s group has accel- erated their realspace BIGDFT code with a GPU based transformation routine and Intel compiler in 2009(3), there has been no report on the implementation of the GPU based FFT code in the reciprocal space, periodic, DFT code. 2. GPU and GPGPU

2.1. Graphics Processing Unit: GPU The GPU’s are specialized devices for rendering and accelerating graphics operations and are commercial and mass products for general consumers including gamers. In a broad sense, on one hand, GPU is a graphics card. On the other hand, in a narrow sense, GPU means the processor on the graphic card. In this paper, we define GPU as the former broad sense. The GPU has the following features: First, the GPU is a processor system to calculate floating point operations. Almost all the GPU’s are for four-byte floating point operations,

90 Journal of Computational Vol.5, No.3, 2011 Science and Technology

which is faster in computation than the eight-byte computation. Second, the GPU’s are de- signed as Single Instruction Multiple Data (SIMD) system. We are able to exploit multiple data stream against a single instruction stream to perform operations which are naturally par- allelized. The GPU’s have advantages over CPU in primitive operations for graphics display output such as matrix multiplication and trigonometric functions and are able to draw directly to screen much faster than CPU. Third, the GPU’s incorporate many processor cores; for instance, GeForce GTX 285 contains 240 processor cores which are used for special mathe- matical operations such as trigonometrical functions and matrix operations. Fourth, the GPU’s have three types of memories; They are local, shared, and global memories; each of the mi- crochips uses all the types of memories. The microchips have high memory bandwidth. While the bandwidth of a CPU system is from 2.0 to 25.6 GB/s, the one of the GPU system amounts to 10.0 GB/s to over 177 GB/s. These four features enable us to extend the original use of the GPU’s to general computational science and engineering. This application is the GPGPU.

2.2. GPU as heterogeneous multicore The processing of the computer has been accelerated by the development of the CPU in the following sequence; scalar, vector, parallel, and multicore parallel processing. In the past a single computer contained a single processor; To accelerate the computations, the CPU has changed from the scalar CPU to the vector CPU and to the parallel CPU. After that, more than single processor have been integrated into the single computer. This is the multicore processors. The processors have enabled us to accelerate the computations in the numerical calculations. This has been forced by the stagnation of the development of high speed devices with high mobilities of carriers. The next stage of the acceleration of the computations has been a heterogeneous mul- ticore processing. One of the processing was the GRAPE system, for instance. The system uses a hardware to accelerate computations of the gravitational interaction which spends a large fraction of the total computation time(4).

2.3. GPGPU Since the GPGPU is the method to use the GPU’s for the numerical calculations, not for the computer graphics(5), this is another heterogeneous multicore processing. The GPGPU allows us to accelerate the computations of the numerical calculations than any other previous heterogeneous multicore system or the multicore processing, especially for the matrix multi- plications, FFT’s, N-body problems(6), Vortex methods(7). Matsuoka et al. have deployed 680 GPU’s for multi-CPU system, Tokyo Tech’s TSUBAME 1.2 supercomputer, and demonstrate the GPGPU accelerations(8). The GPGPU has also been applied to the realspace electronic structure analysis, in which the two-electron integrals have been accelerated(9).Ufimtsevet al. also accelerated the re- alspace electronic structure analysis(10) – (12). All of these earlier application to the DFT code are the realspace methods.

2.4. CUDA : Compute Unified Device Architecture The CUDA is an integrated development environment for the GPU’s. In the beginning of the GPU programming, the programmers used assembly languages to access the GPU from the main functions. In 2007, NVIDIA Co. released the CUDA(13), (14). The CUDA permits us to use a standard C language to access the GPU from the main functions. The CUDA has standard numerical libraries for the FFT (Fast Fourier transform) and BLAS (Basic Linear Algebra Subroutines). 3. Four bottlenecks

There have, however, been several bottlenecks for the use of GPU in the codes of the first principles electronic structure analyses. Since almost all the codes have been written in

91 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fortran language, it is necessary to develop a tool to access to the GPU from the code written in Fortran language. This is the first bottleneck. Second, to accelerate the computations in the calculations with the GPU we first allocate the memory area and transfer the data from the CPU (host) to the GPU (device); after finishing the calculations in the GPU, we transfer back the data form the GPU to the host and deallocate the memory area. The time for the preprocessing and postprocessing amounts to a substantial fraction in the total calculation time of the codes. Third, the operations in the GPU are limited to only trigonometric functions and matrix-matrix multiplications. Fourth, almost all the operations in the GPU are single precision. While the GPU’s with double precision have been released in the market, the prices of them are still high and the use of them is slower than the use of the single precision GPU’s. This will be discussed later. These bottlenecks have to be overcome before the application of the GPU to the codes of the first principles electronic structure analyses. In the present paper, we exploit the details of the procedures to remove the four bottle- necks; For the first bottleneck, the use of a wrapper function enables us to access to the GPU from the Fortran language. Second, the use of a pinned memory removes the preprocess- ing and the postprocessing operations. Third, since the operations in the GPU are limited to the trigonometric functions and the matrix-matrix multiplications, we replace only the CPU based FFT with the GPU based FFT. Hence we use CUFFT in the CUDA for the GPU based FFT, after evaluating the speed of the CUFFT and comparing the speed with the one of the CPU FFT. Fourth, we compare the accuracies of the total energies and interatomic forces calculated with the single precision FFT with those with double precision FFT and show the validity of the use of the single precision CUFFT with the GPU in the FFT calculation of the code. The elimination of these bottlenecks will provide the possibility of accelerating the planewave based first principles calculations. 4. Methods

4.1. Hardware and Software Our computer used is a desktop personal computer. The specifications are as follows; mother board: Intel X58 chipset, CPU: Intel Core i7 Quad 920 (2.66 GHz) / 4.8 GT / sQPI / cash 8M, and main memory: DDR3 1066, 3 GB. In the present implementation, we use a single CPU with single core. The GPU used is NVIDIA GeForce GTX285 1 GB, which provides a single precision calculation for the floating-point variables. The followings are our used in the acceleration; Operating system is open- SUSE 11.1 x86 64(15). The first principles code used in this study is PWscf (Plane-Wave Self- Consistent Field) in the package espresso 4.0.4, which is GNU General Public License(16). The source code of the PWscf is compiled and linked with -O3 option by g95 version 0.91 March 2008(17) and gcc version 4.3.2 in the distribution openSUSE 11.1. The source code is written in Fortran 90 using a double precision for the floating point variables. The CUDA used is a version 2.1(14). Since the GPU GTX285 provides a single precision calculation for the floating-point variables, we convert the eight-byte variables into the four-bytes of the variables before transfer the data to the GPU FFT and convert the latter type of the variables into the former type of the variables after the GPU FFT calculation. We will define these as cast. We will check the accuracy of the use of the single precision in the GPU FFT through evaluating the total energies and the interatomic forces with a system. The computation time required in each routine is counted in millisecond using the system clock which is a Fortran 90 intrinsic subroutine. The FFT libraries used are CUFFT 1.1 of CUDA(14). We compare the extent of the acceleration of GPU CUFFT with that of CPU FFTW(18) (Fast Fourier Transform in the west) 3.2.1 with option FFTW ESTIMATE. The FFTW routines are the fastest CPU based FFT routine available. The API (Application Programming Interface) of CUFFT is almost the same as that of FFTW, and enables us to calculate superparallel FFT’s on the GPU, where the parallel calculations on the GPU has been called superparallel. The driver parameters for

92 Journal of Computational Vol.5, No.3, 2011 Science and Technology

the use of GPU are given by the CUFFT. First we compare the extent of the acceleration on the CUFFT with that of FFTW. Second, we check the difference of the total energies and the interatomic forces between the PWscf code with the single precision CUFFT and that with the double precision FFTW. Third, we compare the extent of the acceleration of the PWscf with FFTW with that with CUFFT.

4.2. System investigated To compare the computational times of GPU with those of CPU we select silicon system

with diamond type rhombohedral structure. Expanding the cell with lattice constant 10.21 a0, which contains two silicon atoms, we prepare supercells of 1 × 1 × 1, 2 × 2 × 2, and 4 × 4 × 4; the respective supercells contain 2, 16, and 128 atoms. The for the silicon pseudoatom is a norm conserving version by Hamann(19). The XC potential is calculated within the framework of local-density approx- imation by Perdew and Zunger(20).Thek-points in the Brillouin zone are sampled with Monkhorst-Pack method(21).Thek-points with 8 × 8 × 8 points are selected for the two silicon atoms for the rhombohedral unit cell; For the supercells containing 16 and 128 silicon atoms, low densities of the k-points are adapted to conserve the same accuracy as the two silicon system; 4 × 4 × 4 points for 16 atoms and 2 × 2 × 2 points for 128 atoms. The optimization method of the electron system is DIIS (Direct Inversion of the Iterative Subspace)(22).The −6 convergence threshold for the electronic self-consistency is set to be 5.0 × 10 Eh/atom, for which the numbers of the electronic SCF iterations have been five times for all the cases.

4.3. Number of FFT meshes While the maximum component GWF of the wavefunction relates to the cutoff energy WF Ec of the wavefunction as   WF = WF 2. Ec Gc (4) CD ff CD The maximum component G of the charge density relates to the cuto energy Ec of the charge density as   CD = WF 2 Ec 2Gc = WF, 4Ec (5) since the charge density ρ(G) is given by the product between the wavefunctions and their CD conjugates. So the number NPW of the planewaves in the sphere within the radius 2Gc becomes   3 4 π WF 2Gc NCD = 3 PW (2π)3 Ω   3 4 π WF 2 4Ec = 3 . (6) (2π)3 Ω

We have set the cutoff energy of the wavefunction to be 9.0 Eh for two silicon atoms in the . CD rhombohedral unitcell with a lattice constant 10 21 a0. The corresponding number NPW of the , CD planewaves becomes 2 733 as shown in Table 1, in which we show the other numbers NPW for the systems containing 16, 53, 128, and 256 atoms in the supercells. Confining the number of the 3D FFT cubic meshes to radix-2,

m m m NFFT = 2 × 2 × 2 , (7)

we are able to calculate the numbers NFFT of the total meshes in the supercells to accommodate the spheres with the 2Gc. The use of the FFT meshes with the radix-2 is due to that since the CUFFT and FFTW libraries use different algorithms for the other radixes, this disturbs our CD comparison of the computation times of the CUFFT and FFTW. The smallest number NFFT

93 Journal of Computational Vol.5, No.3, 2011 Science and Technology

CD ff Table 1 The numbers NPW of the planewaves for the supercells with the cuto energy Ec = 9.0 Eh and the corresponding minimum FFT mesh sizes NFFT with radix- 2 to accommodate the spheres with the radius 2Gc.

CD CD Natom NPW NFFT 22,733 32,768 = 323 16 22,075 262,144 = 643 54 74,129 262,144 = 643 128 175,215 2,097,152 = 1283 256 342,133 2,097,152 = 1283

CD = , 3 of 3D FFT mesh size to accommodate the number NPW 2 733 is 32 . Table 1 shows the other 3D FFT mesh sizes for 16, 54, 128, and 256 silicon atoms in the supercells. We choose the same cutoff energies for all the supercells to conserve the accuracies for all cases.

4.4. Wrapper function While the FFTW has both the C and Fortran interfaces, the CUDA has only the C inter- face. For the PWscf code written in Fortran to access the CUDA, there may be the following methods. One is to executes the following series of processes to access a CUDA FFT func- tion, i.e. CUFFT; allocation of the memory, transferring of the data to GPU, FFT operation of the data, transferring of the data back to CPU, and deallocation of the memory. Since these preprocessings and postprocessings are required for every access of the function, these pro- cessings spend substantial times apart from the FFT operation. The cost of the processings is small when the number of the CUDA FFT accesses is small enough. However the present PWscf code, into which we are going to implement the CUDA FFT, accesses to the FFT rou- tine as often as 1,000 times a single iteration of the electron optimization in the two silicon system and 60,000 times in the 128 silicon system. This frequent accesses allows us to select another efficient method. Reducing the numbers of the preprocessing and the postprocessing operations enables us to use a wrapper method. There will be two ways for the method. One is to use a commercial software such as the Portland Group FORTRAN compiler and the other is to program a wrapper function on our own way. The function enables a single precision vari- able in Fortran to correspond to a float* type address in the CUDA, and among others. The wrapper function includes not only the single precision variables but also all the other types of variables in the Fortran except the double precision variable. No cost for the preprocessing or postprocessing is required for the use of the CUDA FFT function other than the setting of the types only at the initial and final parts of the PWscf code. The use of this wrapper function removes the first bottleneck.

4.5. Pinned memory The pinned memory is one of the CUDA libraries and a method for storing data in the CPU memory for the rapid transfer of the data between the GPU and the CPU memories. The optimization of the configuration of the data in the CPU memory region enables us to rapid transfer between the GPU and the CPU memories. Table 2 shows the acceleration of the speed of the data transfer between the CPU and the GPU for the use of the pinned memory in the CUDA. The pageable, which has been defined in the CUDA, corresponds to a malloc in C language which is a conventional memory allocation method. The transfer speed from the CPU to the GPU is faster than the reverse, since the GPU memory has been originally designed as asymmetric transfer rates, i.e. the transfer rate from the CPU to the GPU is faster than the reverse. This is because of the original design of fast outputting of data to display. So the use of the pinned memory removes the second bottleneck.

4.6. Accuracies of the single precision CUFFT Here we check the accuracies of the use of the single precision CUFFT in the double precision PWscf code by comparing both the total energies and the forces calculated with the

94 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Table 2 Comparison of the data transfer speeds between the pinned memory and pageable using bandwidthTest in the CUDA. The use of the pinned memory in the CUDA is faster than that of the pageable in data transfer. The pageable is a conventional memory allocation method. The speed 124.49 and 124.48 GB/s of the data transfer within the GPU is 22 times as fast as the transfer speed of the pinned memory between the GPU and the CPU.

Data transfer speeds in unit of GB/s pinned pageable CPU to GPU 5.78 4.89 GPU to CPU 5.45 4.11 GPU to GPU 124.49 124.48

Table 3 The differences of the calculated total energies per atom and the forces between the PWscf with the single precision CUFFT and that with the double precision FFTW for supercells. Each force is the sum of the three components of a single silicon atom displaced along the 111 direction by 0.2 % of the lattice constant .

Natom total energy [Eh/atom] force [Eh/a0] 28.30 × 10−7 2.17 × 10−7 16 1.61 × 10−6 2.68 × 10−7 128 2.12 × 10−6 1.65 × 10−7

single precision CUFFT with those with the double precision FFTW. This check is an essen- tial since all the planewave based DFT codes have been developed using the double precision FFT routines. Table 3 shows the differences of the total energies per atom and the interatomic forces; the differences of the energies between the CUFFT and the FFTW are smaller than −6 the present total energy convergence threshold 5.0 × 10 Eh/atom for the electronic self- consistency. We have also calculated the interatomic forces when we displace a single silicon atom by 0.2 % of the lattice constant to the 111 direction. The differences of the forces −6 between the CUFFT and the FFTW are less than 10 Eh/a0 which is far below the usual −4 thresholds of the force convergence of order of 5.0 × 10 Eh/a0 in the geometrical optimiza- tions. The differences in the energies or the forces between of the double precision FFTW calculation and the single precision GPU calculation have, not shown here, been independent of the magnitude of the convergence threshold energies. The differences have been the order of magnitude of the machine epsilon of the single precision variables. The single precision CUFFT allows us to replace the double precision FFTW in the PWscf code. 5. Results

5.1. Comparison of speeds between CUFFT and FFTW First we compare the speed of the single precision CUFFT with that of the FFTW which has two types of FFT’s, i.e. single precision and double precision. Figure 1 shows the com-

putation times of the three 3-D FFT’s including overheads as a function of NFFT log2 NFFT. The three 3-D FFT’s are FFTW SP (single precision FFT on CPU), FFTW DP (double pre- cision FFT on CPU), and CUFFT (single precision FFT on GPU). The times for the sizes with 4 ≤ m ≤ 8 in Eq. (7) are shown in this figure. The times for the mesh sizes less than m ≤ 3 have been shown as zero, indicating that the times are less than 0.001 s which is the minimum unit of the system clock. All the slopes for large sized FFT meshes with m = 7, 8

have unity, indicating the computation times are proportional to NFFT log2 NFFT. In this region, the CUFFT is faster than the FFTW SP or FFTW DP for m ≥ 7, whereas the CUFFT is slower than those for m ≤ 6. The speed of the CUFFT is faster than the FFTW DP by a factor of 7 3 for m = 8(NFFT = 256 ), indicating the high efficiency of the CUFFT for larger systems. In the next we implement the CUFFT into the PWscf code to evaluate the acceleration in the first principles calculation.

5.2. Acceleration of PWscf(CUFFT) in two silicon system First, we investigate the computation times of the PWscf(CUFFT) in which we use the

95 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 1 The total computation times of the three 3-D FFT’s including overheads as a function of NFFT log2 NFFT, to which the number of operations in FFT is proportional, where the NFFT is the number of the total 3-D FFT meshes.

CUFFT routine with single precision and those of the PWscf(FFTW) in which we use FFTW

routine with double precision, as a function of the number NFFT of 3D meshes for the system containing two silicon atoms. The increase of the NFFT corresponds to the increase of the cutoff energies Ec of the system, i.e. to the increase of the number of the reciprocal lattice vectors, which leads to the increase of accuracy of the calculation. The acceleration by the implementation of the CUFFT into the PWscf will be given in a subsequent section. Figure 2 shows the computation times of the PWscf(CUFFT) and the PWscf(FFTW) as a function of

the number NFFT of 3D meshes for the system containing two silicon atoms. All the times increase with the 3-D mesh size NFFT. The slope of the PWscf(CUFFT) is unity for m = 7 and 8, indicating the computation time is proportional to NFFT. The slope of the PWscf(FFTW) is larger than unity for the cases and slower than the PWscf(CUFFT). The CUFFT needs two kinds of extra operations except for the pure FFT operation: they are type conversions between the SP in the CPU and the DP in the CPU which we have defined as cast and the transfers of the data between the CPU and the GPU which we define as memcpy

for memory copy. Figure 3 shows the times for the cast as a function of the 3-D NFFT, which shows that the times are proportional to the number NFFT of FFT meshes. Figure 4 shows the times for the memcpy as a function of the NFFT. While the overall = times increase with the NFFT, there are two types√ of dependence; For the cases of m 4, 5, and 6, the time of the memcpy varies with NFFT, whereas the cases of m = 7and8are 1.6 proportional to the NFFT . These imply that the CUDA uses two types of the methods for the data communication between the CPU and the GPU depending on the sizes of the data. Figure 5 shows the details of each execution time, they are for the FFT, the cast, the memcpy, and the else, in the total PWscf computation time for (a) the PWscf(CUFFT) and (b) for the PWscf(FFTW). The ‘else’ means the rest of the computation time except for the first three operations. In the case of (a) PWscf(CUFFT), the percentage of the CUFFT exec decreases with the index m in Eq. (7), while in the case of (b) PWscf(FFTW), the percentage of the FFTW exec increases with the index m. The decrease is due to the acceleration of the CUFFT in the PWscf. In the (a) PWscf(CUFFT), the fractions of the cast increase with the index m, which relates to the linear increase of the cast time shown in Fig. 3. The decrease of the fractions of the memcpy relates to the increase of the memcpy shown in Fig. 4. Their sum is almost constant over the index m.Form = 8 in the (a) PWscf(CUFFT), the sum of the percentages related to the CUFFT reduces to 54 % from 93 % of the FFTW exec in the (b) PWscf(FFTW). This is due to the high speed of the CUFFT operation.

96 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 2 The total computation times of the PWscf(CUFFT) and the PWscf(FFTW) as a function of the NFFT for the two silicon atoms in the rhombohedral unitcell. The numbers NFFT are for m ≥ 4, 5, 6, 7, and 8.

Fig. 3 The times required to cast the variables from the four-bytes to the eight-bytes types and from the eight-bytes types to the four-bytes for the system with two silicon atoms for m = 4, 5, 6, 7, and 8, as a function of the 3-D NFFT. The times consist of two dependency; the times for m = 7and8isslowerthanthetimes m ≤ 6, while both the slopes are unity.

97 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 4 The times required to transfer the data from the CPU memory to the GPU memory and from the GPU memory to the CPU memory for the system containing two silicon atoms. For the cases of m = 4, 5, and 6, the slope is 0.5 and for the cases of m = 7 and 8 the one is 1.6.

Fig. 5 The fractions of times of the FFT exec, the memcpy, the cast, and the else for (a) PWscf(CUFFT) and (b) PWscf(FFTW) as a function of m,whichrelatesto the size of NFFT through Eq. (7).

98 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 6 The total numbers of FFT calls for the PWscf(CUFFT) to achieve the self- consistent solution as a function of the system size Natom. The same numbers are obtained for the PWscf(FFTW).

5.3. Acceleration of PWscf(CUFFT) in supercells In this section, we evaluate the extent of the acceleration in the PWscf(CUFFT) for the

supercells containing 2, 16, and 128 silicon atoms. We have set the same cutoff energy 9.0 Eh for the supercells as that of the unitcell to conserve the numerical accuracy of the calcula- tion. Their numbers of the planewaves have been shown in Table 1 for the supercells. The corresponding total numbers of the FFT calls for the PWscf(CUFFT) to achieve the self- consistent convergence are plotted in Fig. 6, while total numbers of the iterations to the con- vergence have been five for all the supercells. Each number of the FFT calls has been 4,766, 10,483, and 31,508. The number increases with the system size. For the PWscf(CUFFT) these have been indeed the case. Figure 7 shows the comparison of the computation times for the PWscf(CUFFT) with that for PWscf(FFTW). The total computation times of the PWscf(CUFFT) and PWscf(FFTW) are plotted in Fig. 7(a); The use of the CUFFT in the PWscf speedups a factor of 2.2 for the system with 128 silicon atoms, whereas for the two silicon system the use retards the speed by a factor of 0.41. The crossover is located at the five atoms. The use of the GPU FFT is effective in the acceleration of larger systems of the peri- odic DFT calculations. Figure 7(b) shows the computation times for the CUFFT exec and the FFTW in the PWscf. The use of the CUFFT exec speedups a factor of 88 for the system with 128 silicon atoms comparing with the use of FFTW, whereas for the two silicon system they are almost the same times. Figure 7(c) and (d) are the times for the data transfer (memcpy) and the type convergence (cast), which will be discussed later. Figure 8 shows the percentages of the CUFFT and FFTW exec times and its related times in the PWscf(CUFFT) and PWscf(FFTW). For the 128 silicon system, the sum of the percentages of the CUFFT exec, the memcpy, and the cast reduces to 20 % of the total calculation of the PWscf(CUFFT), while for the two silicon system the sum amounts to 80 %. This is because of that the use of CUFFT in PWscf has speedupped a factor of 2.2 for the system with 128 silicon atoms as shown in Fig. 7. The percentages of the time of the use of the FFTW in the PWscf are independent of the system sizes. This indicates that, in the CPU calculation, the increased system size increases the load not only in the FFT operation but also in other part, i.e., the part of ‘else’ operations in the PWscf(FFTW). To enhance the acceleration of the PWscf(CUFFT), we have combined the multi-CPU with the GPU, which is so called hybrid parallelization. The compiling a MPICH library with the CUDA enables us to hybridize. The ratios of the computation time of the multi-CPU with GPU to that of the multi-CPU without GPU have been independent of the number of

99 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 7 The computation times of the PWscf(CUFFT) and the PWscf(FFTW) as a function of the system size Natom for the cutoff energy 9.0 Eh,where+ marks correspond to the times for the PWscf(CUFFT) and × marks to the PWscf(FFTW). (a) total times, (b) CUFFT exec and FFTW exec, (c) CUFFT data transfer, and (d) CUFFT cast. To compare the slopes among the figure, we show different ranges of both the abscissa and ordinate axes.

100 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Fig. 8 The percentages of times of computations of (a) the CUFFT routine in the PWscf and (b) the FFTW routines in the PWscf as a function of the system size, where the black regions are time percentages for the net FFT calculations. For the CUFFT, the memcpy is time percentage for the data transfer and the cast is that for type conversion.

101 Journal of Computational Vol.5, No.3, 2011 Science and Technology

the CPU’s. The computation times of the multi-CPU with GPU have been proportional to the number of the CPU’s. These indicate that the use of the multi-CPU system with the GPU FFT accelerates by 2.2 f , where f is the acceleration factor of the multi-CPU system. We have confirmed this by using another multi-CPU system with the GPU. 6. Discussion

In the first part of this article, we have replaced the double precision FFT routine in the PWscf code with the single precision CUFFT and has evaluated the errors, induced by the implementation of the single precision FFT routine, through the evaluation of the total energies and interatomic forces. The differences of the energies between the CUFFT and the −6 FFTW have been smaller than the usual total energy convergence threshold 5.0×10 Eh/atom for the electronic self-consistency. We have also calculated the forces when we have displaced a single silicon atom by 0.2 % of the lattice constant to the 111 direction. The differences of −6 the forces between the CUFFT and the FFTW have been less than 10 Eh/a0 which has been −4 far below the usual thresholds of the force convergences of order of 5.0 × 10 Eh/a0 in the geometrical optimizations. The differences in the total energies between the CUFFT and the −6 FFTW have been less than 1.7 × 10 Eh/atom and those in the interatomic forces between −7 them have been less than 4.0 × 10 Eh/a0. Both the differences have been independent of the convergence thresholds of the energy or the system sizes. These results have indicated that the PWscf(CUFFT) gives no appreciable differences in both the physical properties comparing with the PWscf(FFTW) which uses the double precision FFTW. This has been shown in Table 3. The single precision CUFFT has allowed us to replace the double precision FFTW in the PWscf code. This is the first time that the authors evaluate both the errors and shows the validity of the replacement in the planewave based DFT calculation, to the best of the authors’ knowledge. This shows the validity of the replacement in the planewave based DFT calculation. There may be another possibility of the acceleration of the PWscf with GPU. It is a use of double precision GPU. However the speed of the double precision GPU is slower than that of single precisions. For instance, in the case of NVIDIA Tesla C1060, the speed of the double precision operation is 8 % of the single precision operation(23). Whereas the replacement of the single precision GPU with the double precision GPU, see Fig. 8(a), leads to an absence of the cast part, the CUFFT exec part (black part) has increased by a factor of 13.6 and the memcpy part has increased by a factor of two. The factor 13.6 arises from the product between ratio 1476 MHz / 1296 MHz of the processor clock of GeForce GTX 285 to that of Tesla C1060 and the ratio 78 GFlops / 933 GFlops of the double to single floating point performance in the Tesla C1060. This is also the case of Fermi series. Besides, the frequency of the memory clock of the former is 1.55 time larger than the latter(23). The implementation of the double precision GPU such as NVIDIA Tesla C1060 allows us to predict the retardation of the PWscf(CUFFT) calculation than that of the single precision GPU. The price of the double precision GPU is more expensive than the single precision GPU. Thus the implementation of the single precision GPU is more efficient than the double precision GPU for the planewave based DFT calculations with respect to the cost and the speedup. While the fraction of time of the CPU FFT calculation has amounted to 0.64 of the whole PWscf code for the 128 silicon system, the fraction of time of the CUFFT calculation has decreased to 0.20 of the whole PWscf code. The replacement has accelerated the whole PWscf calculation a factor of 2.2 for the system with 128 silicon atoms. This is due to the replacement of the FFTW with the fast CUFFT in the code. While the factor is modest, the GPU calculation in the PWscf becomes to be effective when we extend the part of the GPU computation in the PWscf code; For the further acceleration of the PWscf code, we will be able to implement not only the CUFFT but also matrix calculations such as matrix-matrix and

matrix-vector products. Since, while the order of FFT operations is O(N log2 N), that of the

102 Journal of Computational Vol.5, No.3, 2011 Science and Technology

matrix-matrix operation is O(N3), a further acceleration in the calculation speedups will be expected. Here, we elucidate why the use of the single precision FFT routines has given no appre- ciable differences in the total energies or the interatomic forces from those calculated with the double precision FFT routines. The self-consistent iterative process of the electronic system has a degree of freedom in the choice of the mixing ratios of the new potential to the previ- ous one. The use of the single precision FFT routines for calculating the new potentials is the same situation as the choice of the ratios. Even with the single precision FFT routine the self-consistent process converges to eigenstates through the iteration process, as long as the wavefunctions have been calculated with the double precision. Here the differences in the total energies or the interatomic forces between the use of the single precision CUFFT routine and that of the double precision FFTW routine have been the order of magnitude of the machine epsilon of the single precision variables and independent of the convergence threshold ener- gies. This is also the cases of the implementation of the two-electron integral calculations into the single precision GPU in the real space DFT calculations(9) – (12). Thus, in general the single precision GPU calculation is implementable into any self-consistent electronic structure code, except for the implementation in the eigensolver routine in the DFT codes. In the last place of the discussion, we discuss an acceleration of BIGDFT code with the GPU Tesla C1060(3), which is a Wavelet based periodic DFT code written in Fortran language. The acceleration has been accomplished by a factor of six using the GPU based wavelet transformation and convolution operation, and among others. The following several points, however, should be clarified to compare the BIGDFT acceleration with the present PWscf acceleration: First, the BIGDFT acceleration gave no specification of the CPU used and its clock frequency; The paper wrote that they used either Xeon Harpertown (3GHz) CPU or Xeon Nehalem CPU. The specification enables us to evaluate an exact extent of the accel- eration, since the factor depends on the clock speed of the CPU used. Second, the paper used different compile options between the BIGDFT(GPU) code and the BIGDFT(CPU) code; for the BIGDFT(CPU) code, they used a “deeply optimized” option and for the BIGDFT(GPU) code, they used -O2 -xT options. The discrepancy gives an incorrect evaluation of the factor of the acceleration, since the factor contains the effect of the compile options. Moreover, al- though at present only the nvcc and PGI compiler are capable of compiling the CUDA code, they wrote that “the BIGDFT(GPU) code was compiled with the Intel Fortran compiler”. This Intel Fortran compiler that is capable of compiling the BIGDFT(GPU) has, however, been un- released to the public. 7. Conclusions

We have accelerated the ab initio CPU based periodic DFT calculation code PWscf using the GPU FFT routine. Before the implementation, we have evaluated the fractions of the CPU FFTW calculation in the PWscf code and have found that the FFT calculation is the most time-consuming part in the PWscf code; For the 128 silicon system, the fraction of time of the CPU FFT calculation amounts to 0.64 of the whole PWscf code. The implementation of the single precision GPU FFT routine CUFFT has given no appreciable differences from the double precision PWscf(FFTW) code in both the numerical total energies and the interatomic forces. The replacement has accelerated the PWscf by a factor of 2.2 for the system with 128 silicon atoms and the fraction of time of the CUFFT calculation has decreased to 0.20 of the whole PWscf code, whereas the replacement has retarded the speed for the two silicon system. The use of the multi-CPU system with the GPU FFT accelerates by 2.2 f , where f is the acceleration factor of the multi-CPU system. The implementation of the GPU FFT is effective in the acceleration of the larger systems of the periodic DFT calculations. The further acceleration of the processing speed of the GPU’s will contributes to accelerate the calculating of the large systems.

103 Journal of Computational Vol.5, No.3, 2011 Science and Technology

Acknowledgments

An earlier stage of the numerical calculations on the PWscf(FFTW) was carried out on Altix3700 BX2 at YITP (Yukawa Institute for Theoretical ) in Kyoto University and on SCore system at the Ikuta Media Support Office in Meiji University. References

( 1 ) Car R. and Parrinello M., Unified Approach for and Density- functional Theory, Physical Review Letters, Vol. 55 (1985), pp. 2471–2474. ( 2 ) Tomono H., Aoki M., Iitaka T. and Tsumuraya K., GPU Based Acceleration of First Principles Calculation, Journal of Physics: Conference Series, Vol. 215 (2010), pp. 012121.1–012121.4. ( 3 ) Genovese L., Ospici M., Deutsch T., Mehaut´ J-F., Neelov A. and Goedecker S., Den- sity Functional Theory Calculation on Many-cores Hybrid Central Processing Unit- graphic Processing Unit Architectures, Journal of Chemical Physics, Vol. 131 (2009), pp. 034103.1–034108.8. ( 4 ) Makino J. and Taiji M., Scientific Simulations with Special Purpose Computers: The GRAPE System, (1998), Wiley Blackwell. ( 5 ) Harris M., ”GPGPU.org”, General-Purpose Computation on Graphics Processing Units, (online), available from , (accessed 2011-05-27). ( 6 ) Hamada T. and Iitaka T., The Chamomile Scheme an Optimized Algorithm for N-body Simulations on Programmable Graphics Processing Unit, astro-ph/0703100 (2009). ( 7 ) Yokota R. and Obi S., Vortex Methods for the Simulation of Turbulent Flows: Review, Journal of Fluid Science and Technology, Vol. 6, (2011), pp. 14–29. ( 8 ) Matsuoka S., Aoki T., Endo T., Nukada A., Kato T. and Hasegawa A., GPU Accelerated Computing – from Hype to Mainstream, the Rebirth of Vector Computing, Journal of Physics: Conference Series, Vol. 180 (2009), pp. 012043.1–012043.10. ( 9 ) Yasuda K., Accelerating Density Functional Calculations with Graphics Processing Unit. Journal of Chemical Theory and Computation, Vol. 4 (2008), pp. 1230–1236. (10) Ufimtsev I. S. and Mart´ınez T. J., on Graphical Processing Units. 1. Strategies for Two-electron Integral Evaluation. Journal of Chemical Theory and Com- putation, Vol. 4 (2008), pp. 222–231. (11) Ufimtsev I. S. and Mart´ınez T. J., Quantum Chemistry on Graphical Processing Units. 2. Direct Self-consistent-field Implementation. Journal of Chemical Theory and Computa- tion, Vol. 5 (2009), pp. 1004–1015. See also the erratum: Journal of Chemical Theory and Computation, Vol. 5 (2009), p. 313. (12) Ufimtsev I. S. and Mart´ınez T. J., Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. Journal of Chemical Theory and Computation, Vol. 5 (2009), pp. 2619– 2628. (13) Nguyen H., GPU Gems 3. (2007), Addison-Wesley Professional. (14) NVIDIA Co., ”NVIDIA CUDA ZONE”, NVIDIA CUDA ZONE, (online), available from , (accessed 2011-05-27). (15) ”openSUSE project”, openSUSE.org, (online), available from , (accessed 2011-05-27). (16) Giannozzi P., Baroni S., Bonini N., Calandra M., Car R., Cavazzoni C., Ceresoli D., Chiarotti G. L., Cococcioni M., Dabo I., Dal Corso A., De Gironcoli S., Fabris S., Fratesi G., Gebauer R., Gerstmann U., Gougoussis C., Kokalj A., Lazzeri M., Martin- Samos L., Marzari N., Mauri F., Mazzarello R., Paolini S., Pasquarello A., Paulatto L., Sbraccia C., Scandolo S., Sclauzero G., Seitsonen A. P., Smogunov A., Umari P. and Wentzcovitch R. M, QUANTUM ESPRESSO: a Modular and Open-source Software Project for Quantum Simulations of Materials, Journal of Physics: Condensed Matter, Vol. 21 (2009), pp. 395502.1–395502.19.

104 Journal of Computational Vol.5, No.3, 2011 Science and Technology

(17) ”The g95 Project”, The g95 Project, (online), available from , (accessed 2011-05-27). (18) Frigo M. and Johnson S. G., The Design and Implementation of FFTW3, Proceedings of the IEEE, Vol. 93 (2005), pp. 216–231. (19) Hamann D. R., Schluter¨ M. and Chiang C., Norm-conserving . Physical Review Letters, Vol. 43 (1979), pp. 1494–1497. (20) Perdew J. P. and Zunger A., Self-interaction Correction to Density-functional Approxi- mations for Many-electron Systems, Physical Review B, Vol. 23 (1981), pp. 5048–5079. (21) Monkhorst H. J. and Pack J. D., Special points for Brillouin-zone integrations, Physical Review B, Vol. 13 (1976), pp. 5188–5192. (22) Pulay P., Convergence Acceleration of Iterative Sequences. The Case of SCF Iteration. Chemical Physics Letters, Vol. 73 (1980), pp. 393–398. (23) NVIDIA Co., ”NVIDIA Products”, NVIDIA products, (online), available from , (accessed 2011-05-27).

105