Su(3) Gluodynamics on Graphics Processing Units V

Proceedings of the International School-seminar \New Physics and Quantum Chromodynamics at External Conditions", pp. 29 { 33, 3-6 May 2011, Dnipropetrovsk, Ukraine SU(3) GLUODYNAMICS ON GRAPHICS PROCESSING UNITS V. I. Demchik Dnipropetrovsk National University, Dnipropetrovsk, Ukraine SU(3) gluodynamics is implemented on graphics processing units (GPU) with the open cross-platform computational language OpenCL. The architecture of the first open computer cluster for high performance computing on graphics processing units (HGPU) with peak performance over 12 TFlops is described. 1 Introduction Most of the modern achievements in science and technology are closely related to the use of high performance computing. The ever-increasing performance requirements of computing systems cause high rates of their development. Every year the computing power, which completely covers the overall performance of all existing high-end systems reached before is introduced according to popular TOP-500 list of the most powerful (non- distributed) computer systems in the world [1]. Obvious desire to reduce the cost of acquisition and maintenance of computer systems simultaneously, along with the growing demands on their performance shift the supercom- puting architecture in the direction of heterogeneous systems. These kind of architecture also contains special computational accelerators in addition to traditional central processing units (CPU). One of such accelerators becomes graphics processing units (GPU), which are traditionally designed and used primarily for narrow tasks visualization of 2D and 3D scenes in games and applications. The peak performance of modern high-end GPUs (AMD Radeon HD 6970, AMD Radeon HD 5870, nVidia GeForce GTX 580) is about 20 times faster than the peak performance of comparable CPUs (see Fig. 1). For last five years the GPU performance has been increasing twice per year. Thanks to a substantially lower cost (per Flops1), as well as high rates of productivity growth compared to CPUs, the graphics accelerators have become a very attractive platform for high performance computing. Current overall performance of all computer systems ranked in the last TOP-500 list is about 85 PFlops [1]. The number of computer systems for different countries is: USA - 255 computers (51.0%), China - 61 (12.2%), Germany - 30 (6.0%), UK - 27 (5.4%), Japan - 26 (5.2%), France - 25 (5.0%), Russia - 12 (2.4%). Some of these supercomputer systems, equipped with graphics accelerators are shown in Table 1 (peak performance of corresponding system measured in GFlops is shown in the last column). According to Flynn's taxonomy [2], which describes four classes in both a serial and parallel context, GPUs are based on so-called SIMD architecture. SIMD (Single Instruction stream - Multiple Data stream) is an architecture, in which each processing unit executes the same instruction stream as the others on its own set of data. In other words, a set of processors share the same control unit, and their execution differs only by the different data elements each processor operates on. Due to a large number of processing units in GPU and SIMD architecture it is reasonable to implement the lattice simulations on GPU. The technique of using GPU, which typically handles computations only for computer graphics, to perform computation in applications traditionally handled by the CPU is called general-purpose computing on graphics processing units (GPGPU). There are different approaches for GPU utilization. One of the novel languages for programming heterogeneous systems is an open computational language OpenCL [3]. It is maintained by the group of industry vendors, Khronos Group, that have interest in promoting a general way to target different computational devices. In spite of some performance inefficiency, OpenCL possesses a very important feature for high performance computation { portability. So, we choose OpenCL as programming environment for our task. The technical details of lattice simulations on computers are well-known [4, 5, 6, 7]. We implemented the standard lattice simulation scheme on the GPU with the OpenCL language. The scheme is described in the next Section. The details of GPU implementation is shown in Section 3. 1Flops (floating point operations per second) { is a measure of computer's performance e-mail: [email protected] c Demchik V.I. , 2011. 30 Demchik V.I. Figure 1. GPU and CPU performance growth. Figure 2. Checkerboard decomposi- tion of the lattice. 2 Lattice formulation We used hypercubic lattice L L3 (L < L ) with hypertorus geometry; L and L are the temporal and the t × s t s t s spatial sizes of the lattice, respectively. Evidently, the total number of lattice sites is Nsites = L1 L2 L3 Lt. In order to parallelize the simulation algorithm we have implemented so-called checkerboard scheme× ×(see×Fig. 2) { the whole lattice is divided into 2 parts { odd (red) and even (green) sites. The number of odd sites is Nvect = Nsites=2, as well as the number of even ones. The lattice may be initialized by cold or hot configuration. Standard Wilson one-plaquette action of SU(3) lattice gauge theory is used, 1 S = β 1 Tr U (x) ; (1) W − 3 µν x µ<ν X X Uµν (x) = Uµ(x)Uν (x + aµ^)Uµy (x + aν^)Uνy (x); (2) 2 where β = 6=g is the lattice coupling constant, g is the bare coupling, Uµ(x) is the link variable located on the link leaving the lattice site x in the direction µ, Uµν (x) is the ordered product of the link variables (here and below we omit the group indices). The defining representations of the gauge field variables are the complex 3 3 matrices for SU(3). Due to the unitarity of the SU(3) group the minimal number of parameters is 8.×There are different possible representations of SU(3) matrices in computer memory [4]. In order to speed up the practical calculations it is more convenient to use a redundant representation. The SU(3) group elements may be represented either by the complete complex matrix, or by the first two rows, corresponding to the complex 3-vectors u and v. Due to the group properties of SU(3) matrices, the third row of the matrix can be reconstructed from vectors u and v: u U (x) = v : (3) µ 0 1 u∗ v∗ × @ A So, SU(3) lattice variable could be presented as a set of 6 complex numbers (12 real numbers). No Computer Country Year Cores Rmax 1 K computer Japan 2011 548352 8162000 2 NUDT TH MPP, Tesla C2050 GPU China 2010 186368 2566000 3 Cray XT5-HE USA 2009 224162 1759000 4 Dawning TC3600 Blade, Tesla C2050 GPU China 2010 120640 1271000 5 HP ProLiant SL390s, Tesla C2050 GPU Japan 2010 73278 1192000 13 T-Platforms T-Blade2/1.1, Tesla C2070 GPU Russia 2011 33072 674105 22 Supermicro Cluster, ATI Radeon GPU Germany 2011 16368 299300 499 Cluster Platform 3000 BL2x220 Germany 2010 4800 40217 500 BladeCenter HS22 Cluster China 2010 7104 40187 Table 1. Some computer systems in TOP-500 list of World's Supercomputers [1] SU(3) gluodynamics on Graphics Processing Units 31 To avoid the accumulation of rounding errors in the multiplications of the group elements, all the matrices have to be projected to unitarity regularly. It is well-know Gram-Schmidt method for building orthonormal basis elements in vector spaces, unew = u= u ; (4) j j vnew = v0= v0 ; v0 = v unew(v u∗ ): j j − · new To generate a candidate matrix for the updating algorithm we used Cabibbo-Marinari scheme [14]. SU(3) random matrices are constructed from three 2 2 SU(2) random matrices (R, S and T ): × r11 r12 0 s11 0 s12 1 0 0 U¯ a (x) = r r 0 0 1 0 0 t t (5) µ 0 21 22 1 0 1 0 11 12 1 0 0 1 s21 0 s22 0 t21 t22 @ A @ A @ A To update the lattice, heat-bath algorithm with overrelaxation was used [8, 15]. At every MC iteration we successively replaced each lattice link U (x) U¯ (x) = wW (x); (6) µ ! µ µ W (x) = U (x)U (x + aν^)Uy (x + aµ^) + Uy (x aν^)U (x aν^)U (x + a(µ^ ν^)) ; (7) µ ν µ ν ν − µ − ν − ν X(µ<ν) h i w is a normalization constant such that U¯ (x) SU(3): (8) µ 2 The central part of the lattice simulation program is the updating subroutine. We have to systematically update all lattice links for given configuration by suggesting new variables and accepting them. One time visiting all links is called a lattice sweep. Sufficiently many updates have to be done, until the subsequently generated configurations represent the equilibrium distribution. This procedure is known as thermalization. After the thermalization, one can start to compute the observables (e.g. energy) on the Monte Carlo configurations. The configurations used for measuring are separated by several intermediate updates. There are several software packages for Monte Carlo simulations on a lattice: FermiQCD: an open C++ library for development of parallel Lattice Quantum Field Theory computa- • tions [9, 10]. MILC: an open code of high performance research software written in C for doing SU(3) lattice gauge • theory simulations on several different (MIMD) parallel computers [11]. SZIN: the open-source software system supports data-parallel programming constructs for lattice field • theory and in particular lattice QCD [12]. QUDA: a library for performing calculations in lattice QCD on graphics processing units [13]. • First three mentioned packages are central processing unit (CPU)-based, while the later one is intended for using nVidia GPUs only. In the next Section we will show how the described simulation algorithm can be implemented both on AMD/ATI, and nVidia GPUs.

Su(3) Gluodynamics on Graphics Processing Units V

2.5 Classification of Parallel Computers

Improving Tasks Throughput on Accelerators Using Opencl Command Concurrency∗

State-Of-The-Art in Parallel Computing with R

Towards a Scalable File System on Computer Clusters Using Declustering

Programmable Interconnect Control Adaptive to Communication Pattern of Applications

Introduction to Parallel Computing

A Beginner's Guide to High–Performance Computing

Performance Comparison of MPICH and Mpi4py on Raspberry Pi-3B

Designing a Low Cost and Scalable PC Cluster System for HPC Environment

Review Questions and Answers on Cloud Overview 1

A Simplified Introduction to the Architecture of High Performance Computing

Camx MULTIPROCESSING CAPABILITY for COMPUTER CLUSTERS USING the MESSAGE PASSING INTERFACE (MPI) PROTOCOL