Journal of Computational Science and Technology

Journal of Computational Vol.5, No.3, 2011 Science and Technology Implementation of GPU-FFT into Planewave Based First Principles Calculation Method∗ Hidekazu TOMONO∗∗, Masaru AOKI∗∗∗,∗∗, Toshiaki IITAKA† and Kazuo TSUMURAYA∗∗ ∗∗ Department of Mechanical Engineering Informatics, School of Science and Technology, Meiji University 1–1–1 Higashimita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan E-mail: [email protected] (KT) ∗∗∗ School of Management, Shizuoka Sangyo University 1572-1 Ohwara, Iwata, Shizuoka 438-0043, Japan † RIKEN (The Institute of Physical and Chemical Research) 2-1 Hirosawa, Wako, Saitama 351-0198, Japan Abstract We present an implementation of a GPU based FFT routine (Graphics Processing Unit based Fast Fourier Transformation) into a CPU based ab initio periodic DFT (Den- sity Functional Theory) calculation code. The FFT calculation in the CPU based DFT codes is the most time-consuming part; for the 128 silicon system, the fraction of time of a CPU FFT calculation amounts to 0.64 of the whole periodic DFT calculation. The replacement of a double precision FFT in the periodic PWscf code with a single precision FFT gives no appreciable differences in both the numerical total energies and the interatomic forces, guaranteeing the use of a single precision GPU based FFT, CUFFT, for the code. The use of the CUFFT reduces the fraction to 0.20 of the whole PWscf code; the replacement speedups a factor of 2.2 for single CPU system. The use of the multi-CPU system with the GPU FFT accelerates by 2.2 f , where f is the acceleration factor of the multi-CPU system. The single precision GPU calculation is implementable in any self-consistent electronic structure code, except for the eigen- solver part in the DFT codes. Key words : GPU, First Principles Calculation, DFT, Planewaves, GPGPU, CUFFT 1. Introduction First principles or ab initio electronic structure calculation methods are a robust technique to calculate and predict the properties of materials; the methods enable us not only to predict the solid state properties but also to design new materials. The methods, however, require a high computation cost. There have been two approach to accelerate the computation; one is an improvement of the hardware system and the other is the one of the software system. The devices, with higher mobility of carriers such as GaAs compound than silicon devices, have been attempted to use for the CPU (Central Processing Unit) devices to accelerate the calculations. Since the GaAs compound is a two component system, it has been difficult to manufacture defect- free GaAs devices. At present we are unable to find commercial computers using the GaAs device. So the silicon devices with finer spaced wiring have been manufactured instead of the two-component CPU devices to accelerate the computations. The decrease of the spacing of the wiring, however, has been saturated due to the limit of the use of the short wavelength of rights. These lead to the stagnation of the acceleration of the calculations with single CPU’s. On one hand, the MPI (Message Passing Interface) system has enabled us to accelerate ∗ Received 30 May, 2011 (No. 11-0305) the calculations to overcome the stagnation. This is the parallel computing. On the other hand, [DOI: 10.1299/jcst.5.89] Copyright c 2011 by JSME the GPU devices, which have been used for the fast outputting the data into graphic display, 89 Journal of Computational Vol.5, No.3, 2011 Science and Technology have begun to be used for the scientific numerical calculations. This is another challenge to overcome the stagnation. This is the GPGPU (General Purpose Graphics Processing Units). The application of the GPGPU to the planewave based first principles electronic structure code is essential for accelerating the code. The code allows us to solve a partial differential Kohn-Sham equation HKSΨi = iΨi. (1) The Ψ is the wavefunction, i is the eigenstates, and is the eigenvalues. The HKS is a Kohn- Sham Hamiltonian, which is given by ∇2 H = − + V (r) + V (r) + V (r)(2) KS 2 ion H xc in Hartree energy Eh, and Bohr radius a0 units. The first term is the kinetic operator for electron, the second the electron-atom potential, the third the Hartree potential, and the fourth the exchange and correlation (XC) potential of the electrons. The XC potential is given by δE [ρ] V = xc , (3) xc δρ(r) where ρ is the charge density. The planewave based DFT methods expand the wavefunctions into the planewaves to solve the equation. The following process of iteration is needed to reach the self-consistent solution of the Kohn-Sham equation. The use of the charge density ρ calculated from the wavefunction Ψ, which is the solution of the Eq. (1), allows us to cal- culates the Hartree potential and the XC potential. Substituting the old potential in Eq. (1) for the new potential, which is the sum of the Hartree potential, the XC potential, and the pseu- dopotentials, enables us to solve the equation which will gives another new potential. This is an improvement process of the potential. This is a self-consistent process of the Kohn-Sham equation. Since the XC potential is defined in a real space, we need to transform the charge density in spectral space into the real space, to evaluate the XC potential, and to transform back to the spectral space. These processes use the fast Fourier transformation (FFT) routines. To obtain the wavefunction Ψ, which is the eigenvectors of the equation, we are able to use an iterative procedure for the large size of the rectangular Hamiltonian matrices. This is the CP (Car-Parrinello) method(1). The planewave based self-consistent process uses a number of the FFT and inverse FFT routines to reach the self-consistent solutions. So the FFT calculations spend a large fraction of the time of the total calculation in the first principles calculations. As will be shown later, the FFT spends 0.65 of the fraction of the total time of the calculation for the system with two silicon atoms in a rhombohedral unitcell and the FFT spends 0.64 for the system with 128 silicon atoms. The implementation of GPU FFT routines instead of the CPU FFT routines is expected to accelerate the total computation time of the self-consistent process. In this paper, we accelerate the code using the GPU FFT instead of the CPU FFT. An earlier, primitive version of this paper has been presented at elsewhere(2). While Goedecker’s group has accelerated their realspace BIGDFT code with a GPU based wavelet transformation routine and Intel compiler in 2009(3), there has been no report on the implementation of the GPU based FFT code in the reciprocal space, periodic, DFT code. 2. GPU and GPGPU 2.1. Graphics Processing Unit: GPU The GPU’s are specialized devices for rendering and accelerating graphics operations and are commercial and mass products for general consumers including gamers. In a broad sense, on one hand, GPU is a graphics card. On the other hand, in a narrow sense, GPU means the processor on the graphic card. In this paper, we define GPU as the former broad sense. The GPU has the following features: First, the GPU is a processor system to calculate floating point operations. Almost all the GPU’s are for four-byte floating point operations, 90 Journal of Computational Vol.5, No.3, 2011 Science and Technology which is faster in computation than the eight-byte computation. Second, the GPU’s are de- signed as Single Instruction Multiple Data (SIMD) system. We are able to exploit multiple data stream against a single instruction stream to perform operations which are naturally par- allelized. The GPU’s have advantages over CPU in primitive operations for graphics display output such as matrix multiplication and trigonometric functions and are able to draw directly to screen much faster than CPU. Third, the GPU’s incorporate many processor cores; for instance, GeForce GTX 285 contains 240 processor cores which are used for special mathe- matical operations such as trigonometrical functions and matrix operations. Fourth, the GPU’s have three types of memories; They are local, shared, and global memories; each of the microchips uses all the types of memories. The microchips have high memory bandwidth. While the bandwidth of a CPU system is from 2.0 to 25.6 GB/s, the one of the GPU system amounts to 10.0 GB/s to over 177 GB/s. These four features enable us to extend the original use of the GPU’s to general computational science and engineering. This application is the GPGPU. 2.2. GPU as heterogeneous multicore The processing of the computer has been accelerated by the development of the CPU in the following sequence; scalar, vector, parallel, and multicore parallel processing. In the past a single computer contained a single processor; To accelerate the computations, the CPU has changed from the scalar CPU to the vector CPU and to the parallel CPU. After that, more than single processor have been integrated into the single computer. This is the multicore processors. The processors have enabled us to accelerate the computations in the numerical calculations. This has been forced by the stagnation of the development of high speed devices with high mobilities of carriers. The next stage of the acceleration of the computations has been a heterogeneous multicore processing. One of the processing was the GRAPE system, for instance. The system uses a hardware to accelerate computations of the gravitational interaction which spends a large fraction of the total computation time(4).

Journal of Computational Science and Technology

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support