Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel R Xeon PhiTM Coprocessor and NVIDIA GPU Accelerator
A Thesis
Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University
By
Bobo Shi, M.S.
Graduate Program in Computer Science and Engineering
The Ohio State University
2016
Master’s Examination Committee:
Dr. P. Sadayappan, Advisor Dr. Louis-Noel Pouchet c Copyright by
Bobo Shi
2016 Abstract
CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on applica- tion of the high-accuracy methods. Intel R Xeon PhiTM Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel R Xeon PhiTM Coprocessor and NVIDIA GPU.
CCSD(T) method performs tensor contractions. In order to have an efficient implemen- tation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/ac- celerator to avoid huge data transfer from coprocessor/accelerator to host.
The tensor contraction are performed using BLAS dgemm on coprocessor/accelerator.
Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi im- plementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional
ii loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ∼ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively.
iii This is dedicated to my parents, my sister and my wife
iv Acknowledgments
Firstly, I would like to take this opportunity to express my sincere appreciation to my master thesis adviser, Dr. P. (Saday) Sadayappan for his continuous support, patience and immense knowledge throughout my master’s studies and thesis work. I met Saday when
I was in his class, Introduction to Parallel Computing. His teaching skills, patience to students and broad knowledge in the fields impressed me. My major was Biophysics at that time. And the class motivated my interest in computer science, especially in high performance computing field. The most important, I started to learn more knowledge in computer science field and finally decided to get a master’s degree in computer science.
Saday is the one who gives me another gate through which I can start a new adventure.
I would like to give my sincere gratitude to Dr. Sriram Krishnamoorthy. He provided me direct advice about my thesis work. Sriram had lots of video chat with Saday and me.
I appreciate the insightful discussion with him. He always gave good ideas, which helped my thesis work.
I also want to thank the member of my thesis committee, Prof. Louis-Noel Pouchet for participating in my master exam and taking time to read my thesis.
I would like to thank the lab members, Venmugil Elango, Weilei Bao, Changwan Hong,
Prashant Singh Rawat and John Eisenlohr for the help during my thesis work.
v Vita
2010 ...... B.S. Physics, Fudan University 2015 ...... M.S. Biophysics, Ohio State University 2010-present ...... PhD student in Biophyscis Program Graduate Teaching Associate, Graduate Research Associate, NSEC fellow, Ohio State University 2015-present ...... Master student in Computer Science, Ohio State University
Fields of Study
Major Field: Computer Science and Engineering
vi Table of Contents
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vi
List of Tables ...... ix
List of Figures ...... x
1. Introduction ...... 1
1.1 CCSD(T) method ...... 2 1.1.1 Algebraic structure of the CCSD(T) approach ...... 2 1.1.2 Non-iterative CCSD(T) correction ...... 4 1.2 General structure of the algorithm ...... 5 1.3 Introduction to Intel R Xeon PhiTM ...... 9 1.3.1 Intel R Xeon PhiTM coprocessor architecture ...... 9 1.3.2 Offload mode ...... 12 1.4 GPU and CUDA ...... 13 1.4.1 CUDA C ...... 13
2. Implementation ...... 16
2.1 Implementation on Intel R Xeon PhiTM Coprocesser ...... 16 2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser . . . . . 16 2.1.2 OpenMP thread optimization ...... 17 2.2 Implementation on CUDA ...... 18 2.2.1 Degmm on CUDA ...... 18
vii 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA . . . . . 19
3. Performance Analysis ...... 23
3.1 Performance for Intel R Xeon PhiTM Coprocessor ...... 23 3.1.1 Tilesize ...... 24 3.1.2 OMP threads affinity ...... 27 3.1.3 Comparison between MIC implementation and original CPU ver- sion ...... 29 3.2 Performance for CUDA implementation ...... 29 3.2.1 Tuning for gridDim and blockDim ...... 31 3.2.2 Comparison between CUDA implementation and the original CPU version ...... 32 3.3 Discussion ...... 34
4. Conclusion and Future work ...... 38
viii List of Tables
Table Page
1.1 The CUDA keywords for functions declaration...... 13
1.2 The CUDA keywords for variable declaration...... 14
3.1 Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p ...... 23
3.2 Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K...... 26
3.3 Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize...... 33
3.4 Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for multicores for different tilesize...... 35
3.5 Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize...... 35
3.6 GFlops for 4 parts for CPU, MIC and GPU...... 37
3.7 Cache miss of LOOP1 and LOOP2 for CPU and GPU ...... 37
ix List of Figures
Figure Page
1.1 Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the sym- metry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead...... 6
1.2 Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i])...... 7
1.3 High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1] ...... 10
1.4 Architecture overview of an Intel MIC architecture core [2] ...... 11
1.5 Grid of Thread Blocks...... 15
2.1 Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round- Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only)...... 18
3.1 Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize...... 24
x 3.2 MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case...... 26
3.3 OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated. 27
3.4 The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2...... 30
3.5 Overall performance comparison between CPU and MIC for different tile- sizes...... 31
3.6 Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1...... 32
3.7 Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2...... 33
3.8 Overall CCSD(T) wall time in second for the original CPU version and implementation of CPU version for different tilesize...... 34
3.9 Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize...... 36
xi Chapter 1: Introduction
Computational chemistry always require high accurate methods to describe the instanta-
neous interactions between electrons or the correlation effects in molecules. From Hartre-
Fock [3], Density functional theory [4] to Coupled Cluster [5] and Configuration inter-
action [6] method, each new method improves the accuracy of the computational chem-
istry and bridge the gap between theory and experiments. Among many methods that
describe correlation effects, the coupled cluster (CC) method [5, 7–9] has been proved and
widely used as a very accurate method for solving the electronic Schrodinger¨ equation. The
CCSD(T) [10] method, part of coupled cluster (CC) method, is often called the “gold stan-
dard” of computational chemistry, since it is one of the most accurate methods applicable
to reasonably large molecules.
Many modern coprocessors/accelerators are capable of exploiting data-level parallelism
through the use of Single-Instruction-Multiple-Data (SIMD) execution. SIMD execution is
a power-efficient way of boosting peak performance. The most important coprocessors/ac-
celerators include Intel R Xeon PhiTM and NVIDIA GPU. Intel R Xeon PhiTM is a recently released high-performance coprocessor which features 57/60/61 cores each supporting 4 hardware threads. NVIDIA GPU (Graphics Processing Unit) has powerful parallel com- puting ability due to its massively parallel many-core architecture. The availability of an
1 efficient parallel CCSD(T) implementation on the Intel Many Integrated Core (MIC) archi-
tecture and NVIDIA CUDA architecture will have a significant impact on application of
the high-accuracy methods.
In this work, we map the CCSD(T) code to Intel Many Integrated Core architecture
(MIC) and NVIDIA GPU. The NWChem [11], a computational chemistry tool, is evolved
and its CCSD(T) code is revised to have an Intel R Xeon PhiTM coprocessor / NVIDIA
CUDA entry directive.
1.1 CCSD(T) method
The CCSD(T) approach [8] and its analogs have been explored in Ref. [5]. Here, we
highlight only the basic theoretical threads necessary to understand its parallel implemen-
tation [12].
1.1.1 Algebraic structure of the CCSD(T) approach
The Coupled-Cluster theory is predicated on the assumption that there exists a judicious
choice of a single Slater determinant |Φi referred to as the reference function, which is capable of providing a zeroth-order description of the exact ground-state wavefunction |Ψi.
In most cases, the reference function is chosen as Hartree-Fock (HF) determinants although other choices have been discussed in the literature. The Coupled-Cluster formulation is based on the exponential parameterization of the correlated wavefunction
|Ψi = eT |Φi (1.1)
where T refers to the so-called cluster operator. The perburbative analysis, based on the
linked cluster theorem, shows that only connected diagrams are included in the cluster
operator. The most common approximation used in routine high-precision calculations is
2 the CCSD method (CC with singles and doubles), where the cluster operator is defined
singly (T1) and doubly excited (T2) many-body components. This leads to the following
representation of CCSD wavefunction
T1+T2 |ΨCCSD = e |Φi, (1.2)
i i j where T1 and T2 are represented by singly- (ta) and doubly- excited (tab) cluster amplitudes
+ and corresponding strings of creation/annihilation operators (Xp /Xp)
X i + T1 = taXa Xi, (1.3) i,a 1 X T = ti j X+X+X X . (1.4) 2 4 ab a b j i i, j,a,b As always, the i, j, ...(a, b, ...) indices refer to occupied (unoccupied) spin-orbital indices
in the reference function Φi. Standard CCSD equations for the cluster amplitudes are
obtained from the connected form of the Schrodinger¨ equation defined by projections on
a + ab + + singly (|Φi i = Xa Xi|Φi and doubly (|Φi j i = Xa Xb X jXi|Φi) excited configurations.
hΦa e−T HeT Φi = 0 ∀i, a, (1.5) i
hΦab e−T HeT Φi = 0 ∀i, j, a, b, (1.6) i j (1.7) where H designates electronic Hamiltonian operator.
Once the cluster amplitudes are determined from Equation (1.5) (1.6), the energy is calculated from the expression
E = hΦ|e−T HeT |Φi. (1.8)
Using diagrammatic techniques one can easily determine the algebraic structure of Equa- tions (1.5) (1.6) and the corresponding numerical complexity of the equations which is
3 2 4 proportional to nonu (no and nu stand for number of occupied and unoccupied spin-orbitals,
respectively). Unfortunately, the accuracy obtained with the CCSD formalism in many
cases is not sufficient to provide the so-called chemical accuracy, which is typically de-
fined as an error below 1kcal/mol. To achieve chemical accuracy inclusion of triply excited
effects is necessary.
1.1.2 Non-iterative CCSD(T) correction
The direct inclusion of the triply excited T3 effects in the cluster operator results in
3 5 3 3 high numerical scaling (' nonu) and huge memory demands (' nonu) which prohibits the
CCSDT (CC with singles, doubles, and triples) calculations even for relatively small molec- ular system. In order to reduce the scaling of the CCSDT method without significant loss of accuracy, several methods have been introduced in the past, where T3 amplitudes are estimated pejoratively. [13, 14, 14] The most popular method in this class of formalism is the CCSD(T) method in which the ground-state energy is represented as sum of the CCSD energy (ECCSD) and non-iterative CCSD(T) correction, which combines elements of fourth- and fifth-order of the standard many-body perturbation theory (MBPT) expansion contain- ing triply excited intermediate states:
CCSD(T) CCSD + (0) + (0) E = E + hΦ|(T2 VN)R3 VT2|Φi + hΦ|(T1 VN)R3 VT2|Φi, (1.9)
where VN is two body part of electronic Hamiltonian in normal product form. Currently, the CCSD(T) method is the most frequently employed CC approach especially in studies of spectroscopic properties, geometry optimization, and chemical reactions.
4 (0) Using the definition of the 3-body resolvent R3 , the CCSD(T) energy can be re-written + abc abc X hΦ|(T2 VN)|Φi jk ihΦi jk |VT2|Φi ECCSD(T) = ECCSD + + + − − − i< j 3 4 abc (scaling as nonu) is associated with the presence of the hΦi jk |VT2|Φi term, which is defined by the expression: abc hΦi jk |VNT2|Φi = i j mk i j mk i j mk ik m j ik m j ik m j vmatbc − vmbtac + vmctab − vmatbc + vmbtac − vmctac jk mi jk mi jk mi ei jk ei jk ei jk +vmatbc − vmbtac + vmctab − vabtec + vacteb − vbctea ei ik e j ik e j ik ek i j ek i j ek i j +vabtec − vacteb + vbctea − vabtec + vacteb − vbctea, (1.11) (i < j < k, a < b < c) pq where vrs is the tensor of two-electron integrals. Equation (1.11) can be separated into abc terms defined by contractions over occupied indices (Ai jk : first nine terms on the right hand side of Equation (1.11)) and terms corresponding to contraction over unoccupied indices abc (Bi jk : remaining nine terms on the right hand side of Equation (1.11)). 1.2 General structure of the algorithm The original code structure is to calculate non-iterative CCSD(T) correction using the tensor contraction shown below. R[a,b,c,i,j,k] -= T[l,k,c,b] * V[l,a,i,j] A 6 dimensional loop is used to calculate result tensor R. In this work, we implement CCSD(T) on Intel R Xeon PhiTM coprocessor and GPU based on an improved algorithm 5 structure. Instead of doing a 6-dimensional loop, we use dgemm to replace. The BLAS dgemm is expected to be faster than the simple loops. We also need to permute indices to enable the dgemm calculations. Because of the properties of Quantum Chemistry [15], tensor T[l,k,c,b] and V[l,a,i,j] are symmetrical tensor. Product of two symmetrical tensors are not necessary to be symmetric. However, Quantum Chemistry [15] require the result tensor, R[a,b,c,i,j,k], to be a symmetric tensor and R can be obtained by symmetrization. Therefore, we can calculate R in a different way because of the symmetry property. We use a simple 2 dimensional tensor to illustrate the calculations. C[i,j] = A[i,k] * B[k,j] Suppose A and B are symmetric. Figure 1.1 shows how we calculate C[i,j]. Blue blocks Figure 1.1: Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead. 6 show the value we actually use in the calculation. Red blocks and its complimentary blue blocks together show what the regular matrix multiplication algorithm should use. Because of the symmetry property of A and B, we can store only half of elements of A and B to save memory use. As a consequence, blue blocks are used instead. Figure 1.2: Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i]). Although A and B are symmetric, their product C are not necessary to be symmetric. The symmetrical element C[j,i] can be calculated in the similar way, as shown in Figure 1.2. After we obtain both C[i,j] and C[j,i], the symmetrized result can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i]). 7 In our system, we are calculating the higher dimensional matrix multiplication. The analogous steps are used. The symmetrized result tensor R[a,b,c,i,j,k] are calculated by symmetrizing nine components because of the Quantum Chemistry properties [15]. In the algorithm, we use dgemm to calculate each symmetrical element and store them in nine buffers for both singles part and doubles part. After we transfer tensor contraction to dgemm, we are able to call BLAS dgemm proce- dures [16]. T and V need the local memory of (tilesize)4 and R need the local memory of (tileszie)6. In this work, to avoid excessive data transfer between host and MIC/GPU, we store R[a,b,c,i,j,k] in a buffer only on MIC/GPU at the very beginning of the each task cycle and keep it until the end of the task, where the calculation of the energy contribution from CCSD(T) for that task is also performed on MIC/GPU. For each dgemm calculation on MIC/GPU, we only transfer T[l,k,c,b] and V[l,a,i,j] from host to MIC/GPU. The following pseudocode shows the algorithm structure in this work to calculate ECCSD(T). do t_p4b = noab+1,noab+nvab do t_p5b = t_p4b,noab+nvab do t_p6b = t_p5b,noab+nvab do t_h1b = 1,noab do t_h2b = t_h1b,noab do t_h3b = t_h2b,noab /* allocate memory space r buf on MIC/GPU */ // calculate singles part using BLAS dgemm and // store 9 results to 9 buffers to r buf on MIC/CUDA call ccsd_t_singles(...) { LOOP1: a 6-dimensional loop to sum over 9 buffers to one buffer. } 8 // calculate doubles part using BLAS dgemm and // store 9 results to 9 buffers to r buf on MIC/CUDA call ccsd_t_doubles(...) { LOOP2: a 6-dimensional loop to sum over 9 buffers to one buffer. Then accumulate the contribution to $Eˆ{CCSD(T)}$ } /* free r_buf on MIC/GPU */ enddo enddo enddo enddo enddo enddo noab and nvab parameters refer to the total number of occupied and unoccupied tiles, respectively. Inside the 6 dimensional loops, a large memory space is allocated on MIC or GPU and is kept on the device for each iteration. Then function ccsd t single(...) is called to calculate singles part by using dgemm. After that, LOOP1 is used to sum up nine buffers to one buffer. It is similar for the doubles part. dgemm and LOOP2 are performed to accumulate the results to ECCSD(T). 1.3 Introduction to Intel R Xeon PhiTM 1.3.1 Intel R Xeon PhiTM coprocessor architecture The Intel Xeon Phi coprocessor is connected to an Intel Xeon processor, also known as the host, through a PCI Express (PCIe) bus. Since the Intel Xeon Phi coprocessor runs a Linux operating system, a virtualized TCP/IP stack could be implemented over the PCIe 9 bus, allowing the user to access the coprocessor as a network node. Thus, any user can connect to the coprocessor through a secure shell and directly run individual jobs or submit batchjobs to it. The coprocessor also supports heterogeneous applications wherein a part of the application executes on the host while a part executes on the coprocessor. Figure 1.3: High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1] Fig. 1.3 shows an high level overview of the Intel Many Integrated Core architecture and the Intel Xeon Phi coprocessor. In this work, we use the 5110P model of the pro- cessor, which offers 60 general-purpose cores with in-order execution and a fixed clock speed of 1.053GHz. The MIC card has 8GB of on-card GDDR5 memory with 16 memory 10 channels. The Intel MIC architecture is based on the x86 ISA, extended with 64-bit ad- dressing and new 512-bit wide SIMD vector instructions and registers. Each core supports hyper-threading with a four-way round-robin scheduling of hardware threads. Each core has a 32KB L1 data cache, a 32KB L1 instruction cache, and a 512KB L2 cache. The L2 caches of all cores are interconnected with each other and the memory controllers via a bidirectional ring bus, effectively creating a shared last-level cache of up to 32MB. The peak floating-point performance for double precision (DP) is 1010.88 GFLOPS or 2021.76 GFLOPS for single-precision (SP) computation. Fig. 1.3 shows each core are connected together by the ring interconnect. TD in Fig. 1.3 is distributed duplicate tag directory for cross-snooping L2 caches in all cores. The CPU L2 caches are kept fully coherent with each other by the TDs, which are referenced after an L2 cache miss. Figure 1.4: Architecture overview of an Intel MIC architecture core [2] 11 1.3.2 Offload mode Intel R Xeon PhiTM coprocessor provides native and offload mode for programming. Native mode is to run applications directly on Intel R Xeon PhiTM coprocessors without offload from a host system. Offload code deals with two levels of memory blocking: one to fit the input data into coprocessor and another within the offload code to fit within the processor caches. Bashed on our design, we only calculate CCSD(T) part on coprocessor, so we use offload mode. The offload mode has to be explicitly coded. We need to add offload directives in the code than just compile as host code. For example, to only transfer data, we need !dir$ offload target (mic[:n]) where call subroutine_name(args) or ret_val = function_name(args) To add code block in offload mode, we need the following !dir$ offload begin target (mic[:n]) !dir$ end offload 12 1.4 GPU and CUDA Graphics Processing Unit (GPU) is designed by NVIDIA in 1999. It allows for faster graphics processing speeds. It has powerful parallel computing ability due to its mas- sively parallel many-core architecture. CUDA [17] is a proprietary framework developed by NVIDIA for developing general purpose application on Graphic Processing Unites. In this work, we use NVIDIA Tesla K40 GPU to do the tests and benchmarks. NVIDIA Tesla K40 GPUs have memory size 12GB and 2880 CUDA cores. Peak single and double precision floating point performance are 1.43Tflops and 4.29 Tflops. 1.4.1 CUDA C In 2006, NVIDIA introduced CUDA, a general purpose parallel computing platform and programming model that allows programmers to code in high level languages without considering the graphics interfaces of GPUs [18]. CUDA extends from traditional C/C++ programming language, and it provides some new keywords and API functions to the pro- grammers. NVCC (NVIDIA C Compiler) is used to compile the CUDA source programs. The NVCC compiler uses keywords added by programmers to identify the host code and the device code. Table 1.1 shows the keywords for function declaration and Table 1.2 shows the keywords for variable declaration. Table 1.1: The CUDA keywords for functions declaration. Function declaration Called from Executed on device int func() device device global void func() device host host double func() host host 13 Keyword “ host ” indicates that a function is a host function. If no keyword is used, “ host ” is set as the default keyword. The host function must run on the CPUs, and they can only be called from other host functions. A function declared with “ global ” is a kernel function called from host (CPUs) and running on the device (GPUs). The return type of this function must be void. A function declared with “ device ” indicates the function is a device function, which can only be called from device (kernel functions) and run only on device. Table 1.2: The CUDA keywords for variable declaration. Variable declaration Memory Scope Lifetime int local var register thread thread device int global var global memory grid application device constant int const var constant memory grid application device shared int shared var shared memory block block The CUDA keywords for variable declaration is shown in Table 1.2. The keyword shared is used to declare a variable on shared memory that can be shared by all the threads in the same thread block. The shared memory is on-chip, therefore it is much faster than local and global memory. Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. The keyword “ constant ” is used to declare a constant variable, which can only be read by the CUDA threads running on the GPUs. The kernel function is called by the syntax: kernelFunc<<< gridDim, blockDim >>> (args... ) 14 gridDim and blockDim are the execution configuration parameters. Each can be one di- mensional, two dimensional or three dimensional as illustrated by Figure 1.5. Figure 1.5: Grid of Thread Blocks. 15 Chapter 2: Implementation 2.1 Implementation on Intel R Xeon PhiTM Coprocesser 2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser As we discussed on Section 1.2, in order to avoid excessive data transfer between host and MIC, a buffer only on MIC is used to store the result tensor, R, at the very beginning of each task. An code example is shown below. if (is_phi_offload) then allocate(a_c(1)) size_max_thres = 56800800 !dir$ offload_transfer target(mic) in(a_c: & length(10*size_max_thres) & alloc_if(.TRUE.) free_if(.FALSE.)) endif ! is_phi_offload a c in the code above is for R. We allocate a memory size of 10*size max thres on MIC and keep it there for future usage. abc abc To calculate hΦi jk |VNT1|Φi and hΦi jk |VNT2|Φi, we need to transfer V and T from host to MIC. Then a dgemm call on MIC is used. An code example is shown below. aoff=>dbl_mb(k_a_sort:k_a_sort+dima_sort*dim_common-1) boff=>dbl_mb(k_b_sort:k_b_sort+dimb_sort*dim_common-1) !dir$ offload begin target(mic:0) nocopy(a_c: length(0) 16 & alloc_if(.FALSE.) free_if(.FALSE.)) in(dima_sort) & in(dimb_sort), in(dim_common), & in(dbeta), in(ia6), in(dimc) in(aoff:length( & dima_sort*dim_common) & alloc_if(.TRUE.) free_if(.TRUE.)) in(boff: & length(dimb_sort*dim_common) & alloc_if(.TRUE.) free_if(.TRUE.)) CALL DGEMM(’T’,’N’,dima_sort,dimb_sort,dim_common,1.0d0,aoff, & dim_common,boff,dim_common,dbeta, & a_c((ia6-1)*dimc+1), & dima_sort) !dir$ end offload nullify(aoff, boff) aoff and boff are used to transfer V and T from host to MIC. dima sort*dim common and dimb sort*dim common are the size of data to be transferred. After finishing dgemm call, memory of aoff and boff on MIC are freed. 2.1.2 OpenMP thread optimization The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using KMP AFFINITY and OMP NUM THREADS environment variable with compilers version 13.1.0 and newer. Three types of affinities, compact, scatter and balanced, are used in this work. Those types have the following properties. compact: pack threads close to each other; scatter: Round-Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only). Fig. 2.1 illustrates how three affinity types work. Depending on system application, topology and operating system, the OpenMP thread affinity can have significant effect on the application performance. 17 Figure 2.1: Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round-Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only). In this work, we add OpenMP implementation to LOOP1 and LOOP2, described on sec- hΦ|T +V |Φabci hΦ|(T +V )|Φabci P 1 N i jk P 2 N i jk tion 1.2, to efficiently calculate i< j mulate to total contribution δECCSD(T) of CCSD(T). Both of LOOP1 or LOOP2 contain a 6 dimensional loops, we test and confirm that an omp parallel implementation of collapse of 3 gives best performance. 2.2 Implementation on CUDA 2.2.1 Degmm on CUDA The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subpro- grams) on top of the CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU). In this work, we use function cublasDgemm to perform dgemm on GPU. 18 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA For LOOP1 and LOOP2, we need to map 6 dimensional loops to thread-blocks. In CUDA, all thread blocks and threads share the same values of the kernel function arguments. The differentiation between the work to be performed by the different threads and thread-blocks needs to be derived from their location in the thread block and thread block grid. Dimen- sions of threads and thread-blocks can be notated as blockDim and gridDim. Each can be one dimensional, two dimensional or three dimensional. In LOOP1 and LOOP2, each dimen- sion is corresponding to each index of tensor R[a,b,c,i,j,k]. In NWChem, 6 dimensional tensor R[a,b,c,i,j,k] a stored in a huge one dimensional array. The elements of R are assigned to the threads in CUDA by addressing the index of R[a,b,c,i,j,k] as the following code snippet. int idt = id + i*tnt; int re; p6 = idt % range_p6; re = idt / range_p6; p5 = re % range_p5; re = re / range_p5; p4 = re % range_p4; re = re / range_p4; h2 = re % range_h2; re = re / range_h2; h1 = re % range_h1; re = re / range_h1; h3 = re % range_h3; idt is flatten index of R. Index (a,b,c,i,j,k) is expressed as (h1, h2, h3, p4, p5, p6) in the code. range h1, range h2, range h3, range p4, range p5, range p6 is the size for each dimension. After assigning indices to each thread on CUDA, the calcu- lation to sum up the nine buffers can be done by the CUDA threads. 19 To improve the CUDA memory accessing performance, we need tiling to have coalesc- ing memory access. We use LOOP1 as an example to explain. LOOP1 has the following structure. S[p6,p5,p4,h2,h1,h3] = + buf1[p4,h1,p5,p6,h2,h3] - buf2[p4,h2,p5,p6,h1,h3] + buf3[p4,h3,p5,p6,h1,h2] - buf4[p5,h1,p4,p6,h2,h3] + buf5[p5,h2,p4,p6,h1,h3] - buf6[p5,h3,p4,p6,h1,h2] + buf7[p6,h1,p4,p5,h2,h3] - buf8[p6,h2,p4,p5,h1,h3] + buf9[p6,h3,p4,p5,h1,h2] Since the main code is written in FORTRAN, the most left index is the fastest iterating index. We need to access both S and the nine buffers memory coalescingly. To do that we need to do tiling so that we copy the memory block coalescingly from buffer to the shared memory of CUDA and then write it back coalescingly to the variable S. The following pseudo code illustrates how to have coalescing memory access to both buf1 and S. //blockDim.x = range_p4 and blockDim.y = range_p6 // range_p4 is the size in dimension p4. __shared__ double sbuf[range_p4*range_p6]; // make tx = p4 and ty = p6 ty = bid % range_p6; int re = bid / range_p6; int p5 = re % range_p5; re = re / range_p5; tx = re % range_p4; re = re / range_p4; int h2 = re % range_h2; re = re / range_h2; int h1 = re % range_h1; re = re / range_h1; int h3 = re % range_h3; // read from the buf to shared memory 20 sbuf[ty * range_p4 + tx] = buf1[tx, h1, p5, ty, h2, h3]; __syncthreads(); // make threadIdx.x = p6 and threadIdx.y = p4 tx = threadIdx.x % range_p6; ty = threadIdx.x / range_p6; // write to the variable S S[tx,p5,ty,h2,h1,h3] = sbuf[tx*range_p4 + ty]; Since buf1, buf2 and buf3 have the same fastest iterating index, tiling can be used for all of them at the same time. Similarly, buf4, buf5 and buf6 can be passed at the same time. buf7, buf8 and buf9 have the same fastest iterating index as variable S, so we can directly have coalescing memory access without tiling. For LOOP2, we found that directly summing up 9 buffers through 1 pass is faster than doing tiling with multiple passes. LOOP2 has the structure of the following. D[p6,p5,p4,h2,h1,h3] = -buf1[p5, h1, p4, p6, h2, h3] +buf2[p5, h2, p4, p6, h1, h3] -buf3[p5, h3, p4, p6, h1, h2] -buf4[p6, h1, p5, p4, h2, h3] +buf5[p6, h2, p5, p4, h1, h3] -buf6[p6, h3, p5, p4, h1, h2] +buf7[p6, h1, p4, p5, h2, h3] -buf8[p6, h2, p4, p5, h1, h3] +buf9[p6, h3, p4, p5, h1, h2] buf1, buf2 and buf3 need tiling to have coalescing memory access. However, buf4 - buf9 have the same fastest iterating index with the D. Therefore no tiling is needed for these cases. If we do the tiling for buf1, buf2 and buf3, we need to have least 2 passes to sum up 6 buffers. One pass for buf1, buf2 and buf3. Another pass for buf4 - buf9. We also need to have an extra space to store D and D is going to be used to calculate total energy later. If we don’t do tiling, the only thing we lose is the tiling for 3 buffers. However, 21 we can do the summing up in one pass and we can directly use the result D to accumulate to the final energy. In other words, we don’t need to write and read D at the extra space, which saves both time and space. From the experiments, It turns out no tiling gives better result. 22 Chapter 3: Performance Analysis 3.1 Performance for Intel R Xeon PhiTM Coprocessor In this section, we analyze the performance of the Intel R Xeon PhiTM Coprocessor implementation. The number reported here are based on two Intel R Xeon PhiTM Coproces- sors: 3120A/P and 5110P. The detail specification of the two Intel R Xeon PhiTM Coproces- sors are in Table 3.1. Table 3.1: Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p Intel R Xeon PhiTM Intel R Xeon PhiTM Product Name Coprocessor 3120A/P Coprocessor 5110p Cache 28.5MB L2 Cache 30MB L2 Cache # of Cores 57 60 Processor Base Frequency 1.1GHz 1.052GHz Memory Size 6GB 8GB Memory Bandwidth 240GB/s 320GB/s The evaluation of Intel R Xeon PhiTM Coprocessor 3120A/P was performed on a Intel Xeon CPU E5-2650 v2 (2.60GHz) which contains 8 cores. Since in this part, we only focus on the performance of Intel R Xeon PhiTM Coprocessor, we only use one CPU core to run experiment and communicate with Intel R Xeon PhiTM Coprocessor. 23 3.1.1 Tilesize The tiling scheme of NWChem corresponds to partitioning of the spin-orbital domain into smaller subsets containing the spin-orbitals of the same spin and spatial symmetries (the so-called tiles). This partitioning of the spin-orbital domain entails the blocking of all tensors corresponding to one- and two-electron integrals, cluster amplitudes, and all recursive intermediates, into smaller blocks of the size defined by the size of the tile (or tilesize for short). The size of tiles (tilesize) defines also the local memory requirements in CCSD(T). Figure 3.1: Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize. The performance of Intel R Xeon PhiTM coprocessor implementation depends on the tilesize. For each dgemm calculation, we need to offload 4 dimensional data T[l,k,c,b] and V[l,a,i,j] from host to MIC. The smaller the tilesize, the smaller the data transferred for each dgemm and the more times we need to transfer the data for all the calculations. 24 The total number of operations do not depend on the tilesize. However, the standard Intel MKL DGEMM matrix multiplication performance depends on the matrix size. Generally speaking, the larger the matrix size we have for the dgemm calculation, the higher the perfor- mance we can obtain. Fig. 3.1 gives the measurements of the dgemm calculation time and data transfer time for singles part and doubles part of a small system (2 ozone), for different tilesizes. The experiments are performed on Intel R Xeon PhiTM coprocessor 3120A/P. Both the dgemm calculation time and data transfer time decrease when tilesize increases. At this point, the data transfer time seems too large. The estimated bandwidth is only 0.41GB/s. The theoretical for CPU and mic transfer bandwidth is 6.7GB/s. To figure out why we have such low bandwidth, we use a simple dgemm example code to profile data transfer bandwidth and matrix size. Fig. 3.2 shows the data transfer speed vs. the data size for a simple MIC offload dgemm code, which is performed at Intel R Xeon PhiTM Coprocessor 3120A/P. The red dot indicates the data transfer for the nwchem mic offload doubles dgemm tilesize=40, which is close to the simple offload dgemm code experiment performance. Therefore, the low transfer speed is due to the smaller data size transferred. If we have a larger system and larger tilesize, we will have better host-mic data transfer speed. Because of the relatively large amount of data offloaded of two 4-dimensional arrays, good performance will be obtained if we set environment variable MIC USE 2MB BUFFERS. The offload infrastructure then uses 2MB pages to store data that is transferred to the co- processor. A reduction of time spent for data transfer is expected, due to much less page faults on the coprocessor and less misses in the Translation Lookaside Buffer. Table 3.2 shows the doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K. Table 3.2 performed on Intel R Xeon 25 Figure 3.2: MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case. PhiTM 5110P and have better results than Intel R Xeon PhiTM 3120A/P. As we can see, when Table 3.2: Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K. MIC USE 2MB BUFFERS 16K 32K 64K tilesize=40, dgemm calculation time (second) 174.349 161.435 161.186 tilesize=40, data transfer time (second) 72.525 74.030 74.119 tilesize=50, dgemm calculation time (second) 127.54 128.955 128.466 tilesize=50, data transfer time (second) 30.1334 30.809 30.1582 tilesize of 50 is chosen, we can get a small data transfer time (30 second). 26 3.1.2 OMP threads affinity In this part, we examine how the number of threads and threads affinity affect the per- formance. The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. Two schemes should be considered for OpenMP threading and affinity. Firstly, determine the number of threads to utilize and secondly, how to bind threads to specific processor cores. The Intel R Xeon PhiTM Coprocessor supports 4 threads contexts per core. Intel R Xeon PhiTM Coprocessor 3120A/P and 5110P have 59 and 60 cores, respectively. The best advice for an offload program is to try different numbers of threads from N − 1 threads to 4 × (N − 1), where N is the number of physical cores on the processor. Using N − 1 instead of N is because of OS overhead. It is inefficient to schedule worker threads on cores where OS threads are contenting for cycles. The next is to choose threads affinity, which controls how threads are placed on cores of the Intel R Xeon PhiTM Figure 3.3: OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated. 27 coprocessor. As we discussed in Section 2.1.2, we are three types of OpenMP threads affin- ity. compact: pack threads close to each other; scatter: Round-Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only). Fig. 3.3 shows how number of OpenMP threads and the ways to place threads affect the total performance of the CCSD(T). This experiment is calculating a larger system (3 ozone) and is performed on Intel R Xeon PhiTM Croprocessor 5110P. 3 set of data are collected and error bar is shown on the graph. The errors are small so that this reflects the true performance for each setting. The total time for CCSD(T) decrease when we increase the OpenMP threads. When we choose maximum OpenMP threads, different affinity gives similar performance. However, when the number of threads is less than the maximum value, the threads affinity affects the performance more. LOOP1 and LOOP2 contain a 6 dimensional loop, we implemented the OpenMP paral- lelism scheme as following !$omp parallel do collapse(3) do h3 = 0, range_h3-1 do h1 = 0, range_h1-1 do h2 = 0, range_h2-1 do p4 = 0, range_p4-1 do p5 = 0, range_p5-1 do p6 = 0, range_p6-1p ...... Different collapse are tested and we confirm that collapse(3) provides best perfor- mance. 28 3.1.3 Comparison between MIC implementation and original CPU version In this section, we compare the performance of Intel R Xeon PhiTM implementation with the CPU version. The CPU we used in the benchmark is Intel Xeon E5 2670 V2. We evaluated the sequential implementation on CPU (one CPU core) as they appear in NWChem and the Intel R Xeon PhiTM offload mode for CCSD(T) part of NWChem on one MIC card (Intel R Xeon PhiTM Coprocessor 5110P). Figure. 3.4 shows the performance comparison between CPU and MIC with several threads affinity for four main parts of the CCSD(T) calculation: singles part dgemm, dou- bles part dgemm, LOOP1 and LOOP2. The performance of CPU does not have clear relation with the tilesize. The performance of Intel R Xeon PhiTM Coprocessor gives better per- formance for larger tilesize. Large tilesize cause large data transferred for each dgemm offload, which lead to a better bandwidth for data transfer. We obtained 8x speed up for the dgemm calculation and about 2x speedup for the LOOP. Figure. 3.5 illustrates the overall performance comparison between CPU and MIC for different tilesizes. CPU performance does not have a clear relation with tilesize, but MIC gives better performance for larger tilesize. For tilesize = 30 and 40, MIC provides a speedup about 4x. 3.2 Performance for CUDA implementation In this section, we evaluate the performance of CUDA implementation for CCSD(T). The GPU we used in this evaluation is NVIDIA Tesla K40, which have memory size 12GB and 2880 CUDA cores. For the dgemm part, we use the CUDA BLAS library. CUDA will automatically decide the configuration for dgemm, so we don’t need to do the tuning 29 Figure 3.4: The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2. for dgemm calls. singels part dgemm and doubles part dgemm generally take 13.95s and 210.20s, respectively. The number of operations for single dgemm and double dgemm are proportional to T 6 and (no+nv)T 6, where no and nv parameters refer to the total number of occupied and unoccupied tiles, respectively. 30 Figure 3.5: Overall performance comparison between CPU and MIC for different tilesizes. 3.2.1 Tuning for gridDim and blockDim We need to tune gridDim and blockDim for LOOP1 and LOOP2. For LOOP1 we have parameters (NB, NT) to tune, where NB stands for number of blocks (gridDim) and NT stands for number of threads per block (blockDim). Figure 3.6 shows the tuning for NB and NT for LOOP1 at tilesize = 30. x axis is the number of blocks and y axis is the wall time in second for LOOP1. NB = 2048 and NT = 256 gives best performance. Different tilesize may have different results, but we confirm they do not generate significant difference. 31 Figure 3.6: Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1. Analogous tuning is done for LOOP2. Figure 3.7 shows the wall time for LOOP2 with different NB and NT for tilesize = 30. When NT is greater or equal to 256, the total work done by threads in one block start to saturate. When NT is smaller than 64, perfor- mance is not good either. NB = 4096 and NT = 128 give best performance. 3.2.2 Comparison between CUDA implementation and the original CPU version In this part, we compare the results to the original version (CPU) for four main parts of the CCSD(T) algorithm: dgemm s (dgemm for singles part), dgemm d (dgemm for doubles part), LOOP1 and LOOP2. CPU we used in the experiment is Intel Xeon E5 2670 V2. Table 3.3 shows the wall time in second for four main parts of CCSD(T) algorithm for CPU and GPU for different tilesize. Generally speaking, GPU gives about 16x, 20x, 9x and 13x speedup for dgemm s, dgemm d, LOOP1 and LOOP2. dgemm gives much higher speedup than the loop. 32 Figure 3.7: Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2. Table 3.3: Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize. dgemm s dgemm d LOOP1 LOOP2 tilesize CPU GPU CPU GPU CPU GPU CPU GPU 20 303.68 13.94 4236.3 209.92 1405.4 148.19 1378.0 90.59 25 375.37 17.18 5118.2 214.87 1645.5 175.85 1993.7 108.5 30 325.05 14.61 4272.6 163.28 1404.9 145.76 1413.3 94.17 35 287.01 17.20 4757.5 160.35 1658.0 162.73 1589.2 113.61 40 270.70 17.31 4741.2 155.24 1703.9 164.12 1569.8 117.76 Figure 3.8 shows the overall CCSD(T) wall time in second for the original CPU version and implementation of GPU for different tilesize. GPU provides 9.3x, 11x, 12x, 13.5x and 14.1x speedup for tilesize of 20, 25, 30, 35 and 40, respectively. GPU performance is better when the tilesize is larger. This is due to the good performance for CUDA BLAS dgemm 33 Figure 3.8: Overall CCSD(T) wall time in second for the original CPU version and imple- mentation of CPU version for different tilesize. with larger tilesize. The performance of the original code does not have clear relation with tilesize. 3.3 Discussion In this section, we discuss several performance related analysis. It is good to know the performance for multi-CPU-cores so that we know the effective MIC/GPU speedup compared to the performance of multicores. Table 3.4 and Table 3.5 shows walltime in second for the dgemm parts and LOOPS part for different cores and different tilesize. The experiments are done with core number of 1, 2, 4 and 8. We can see that each part get 34 speed up with more cores. Generally, 8 cores gives 5.7x speed up for singles dgemm and 7x speedup for doubles dgemm, LOOP1 and LOOP2. Table 3.4: Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for mul- ticores for different tilesize. dgemm s dgemm d tilesize 1core 2core 4core 8core 1core 2core 4core 8core 20 303.6 153.1 79.38 42.95 4236.3 2128.7 1100.3 591.5 25 654.2 190.6 99.39 53.24 5484.5 2575.5 1332.6 713.6 30 325.0 163.2 84.87 45.82 4757.4 2150.7 1108.4 595.4 35 287.0 145.2 76.26 45.32 4757.4 2382.4 1235.7 668.2 40 270.6 135.6 71.36 46.96 4741.1 2385.1 1234.1 677.1 Table 3.5: Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize. dgemm s dgemm d tilesize 1core 2core 4core 8core 1core 2core 4core 8core 20 1405.3 708.3 363.2 195.8 1378.0 694.8 358.3 194.0 25 2997.5 829.0 428.5 230.4 3067.8 851.2 444.4 239.9 30 1404.8 709.2 363.9 196.8 1413.3 713.8 369.0 201.4 35 1658.0 830.9 429.3 255.2 1589.2 795.5 412.8 247.2 40 1703.9 852.9 439.5 288.5 1569.7 784.9 406.6 264.7 Figure 3.9 shows the overall walltime in second for multi-cores. Except for tilesize of 25 and 40, 8 cores generate gives 3x speedup, which is much lower than the speedup for each parts. This is due to the MPI communications. GFLOPS are provided in Table 3.6. We don’t have the MIC data for tilesize 40 since the MIC memory size didn’t satisfy the the memory requirement for tilesize 40. We can see 35 Figure 3.9: Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize. that for dgemm s, all of the CPU, MIC, GPU have better GFLOPS for larger tilesize. It is because the data size for dgemm s is usually small. CPU, MIC, GPU still need larger data size to get better performance. However, the CPU GFLOPS for dgemm d is independent of tilesize and MIC/GPU will have better performance with larger tilesize. Therefore, if we would run a larger system, we expect to get better GFLOPS for dgemm d. The data transfer time are considered in the wall time for each part, and it also affects the GFLOPS (e.g. larger data transfer time leads to smaller GFLOPS). LOOP2 has better GFLOPS than LOOP1. It is due to that LOOP1 only merges nine buffers to one buffer, but LOOP2 does extra work to calculate the CCSD(T) energy from the buffer result. The energy calculation reuses many data and hence improves the GFLOPS. Table 3.7 lists the cache misses for LOOP1 and LOOP2 for CPU and GPU for different tilesize. CPU and GPU cache misses are obtained from Intel Vtune and NVIDIA nvprof. Since we use Telsa K40, the default way would not use L1 cache for global loads. Only L2 36 Table 3.6: GFlops for 4 parts for CPU, MIC and GPU. dgemm s dgemm d LOOP1 LOOP2 CPU MIC GPU CPU MIC GPU CPU MIC GPU CPU MIC GPU 20 0.65 2.0 14.2 2.8 11.2 56.8 0.47 1.27 2.23 1.26 1.73 13.7 25 0.62 2.9 13.7 2.8 16. 66.1 0.482 1.22 2.26 1.23 2.01 13.7 30 0.61 3.5 13.6 2.7 20.1 72. 0.472 1.12 2.28 1.23 2.29 13.2 35 0.76 4.9 12.8 2.7 29.8 82.2 0.45 1.02 2.29 1.23 2.62 12.3 40 0.81 - 12.7 2.8 - 84.9 0.439 - 2.28 1.25 - 11.9 cache would be accessed for global load. For LOOP1, we only have global load, there is not data reuse, therefore we only list L2 miss. For LOOP2, we not only have global load, but also reuse the data to calculate the CCSD(T) energy. Therefore there is L1 cache access for LOOP2. Table 3.7: Cache miss of LOOP1 and LOOP2 for CPU and GPU LOOP1 LOOP2 CPU GPU CPU GPU tilesize L1 miss L2 miss L2 miss L1 miss L2 miss L1 miss L2 miss 20 0.4301 0.2085 0.2697 0.0944 0.5483 0.0448 0.2803 25 0.3135 0.2176 0.2904 0.1015 0.7066 0.0926 0.3513 30 0.3106 0.2139 0.2928 0.0869 0.7075 0.0942 0.3502 35 0.3106 0.2139 0.3469 0.0971 0.7565 0.1749 0.3275 40 0.3436 0.4001 0.3477 0.1003 0.7578 0.1770 0.3484 37 Chapter 4: Conclusion and Future work CCSD(T) method provides high accuracy to describe the instantaneous interactions between electrons in molecules. The availability of the an efficient parallel CCSD(T) im- plementation on the Intel Many Integrated Core (MIC) architecture and NVIDIA GPU will have a significant impact on application of the high-accuracy methods. In this work, CCSD(T) code of NWChem is mapped to the Intel R Xeon PhiTM Coprocessors and NVIDIA CUDA accelerator. CCSD(T) method performs tensor contractions, which is performed us- ing BLAS dgemm on the coprocessor/accelerator. OpenMP is used in the post-processing loops to bind threads to Intel Xeon Phi coprocessor physical processing units. A algorithm is designed to the map the post-processing loops to the GPU threads. The experiments show a 4x speedup for implementation of Intel R Xeon PhiTM Coprocessors and 9x ∼ 13x speedup for implementation of NVIDIA GPU accelerator, compared to one CPU core. Per- formance, such cache misses and GFLOPS, is analyzed for each part of the algorithm. The CCSD(T) algorithm has singles part and doubles part. Each part contains a dgemm calculation and a post-processing loop (LOOP1 and LOOP2). In the post-processing loop, nine buffers, result from the dgemm calculation, are summed up to one buffer. singles part dgemm takes much less time than doubles part dgemm, however, summing up nine buffers into one buffer takes similar time for singles part (LOOP1) and doubles part (LOOP2), and those time costs are relatively large. Considering relatively large amount of time spent on 38 the singles part, we should try using loops instead of dgemm to deal with singles part in the future work. singles part needs much less operations than doubles part and we expect simple loops will have better performance than dgemm plus post-processing loop for singles part. 39 Bibliography [1] James Reinders Jim Jeffers. Intel Xeon Phi Coprocessor High Performance Program- ming. [2] Intel Xeon Phi Coprocessor Developer’s Quick Start Guide version 1.8. [3] Charlotte Froese Fischer. General Hartree-Fock program. Computer Physics Com- munications, 43(3):355–365, February 1987. [4] Tanja van Mourik and Robert J. Gdanitz. A critical note on density functional theory studies on rare-gas dimers. The Journal of Chemical Physics, 116(22):9620–9623, June 2002. [5] Rodney J. Bartlett and Monika Musial. Coupled-cluster theory in quantum chemistry. Reviews of Modern Physics, 79(1):291–352, February 2007. [6] C. David Sherrill and Henry F. Schaefer III. The Configuration Interaction Method: Advances in Highly Correlated Approaches. In John R. Sabin Per-Olov Lwdin, Michael C. Zerner and Erkki Brndas, editor, Advances in Quantum Chemistry, vol- ume 34, pages 143–269. Academic Press, 1999. [7] Jiri Cizek. On the Correlation Problem in Atomic and Molecular Systems. Calcu- lation of Wavefunction Components in UrsellType Expansion Using QuantumField 40 Theoretical Methods. The Journal of Chemical Physics, 45(11):4256–4266, Decem- ber 1966. [8] S. A. Moszkowski. Short-Range Correlations in Nuclear Matter. Physical Review, 140(2B):B283–B286, October 1965. [9] J. Paldus, J. Cizek, and I. Shavitt. Correlation Problems in Atomic and Molecular Systems. IV. Extended Coupled-Pair Many-Electron Theory and Its Application to the BH3 Molecule. Physical Review A, 5(1):50–67, January 1972. [10] Jan Rezac, Lucia Simova, and Pavel Hobza. CCSD[T] Describes Noncovalent In- teractions Better than the CCSD(T), CCSD(TQ), and CCSDT Methods. Journal of Chemical Theory and Computation, 9(1):364–369, January 2013. [11] M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A com- prehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, September 2010. [12] E. Apra, M. Klemm, and K. Kowalski. Efficient Implementation of Many-Body Quan- tum Chemical Methods on the Intel #x00ae; Xeon Phi Coprocessor. In High Perfor- mance Computing, Networking, Storage and Analysis, SC14: International Confer- ence for, pages 674–684, November 2014. [13] Yannick J. Bomble, John F. Stanton, Mihaly Kallay, and Jurgen Gauss. Coupled- cluster methods including noniterative corrections for quadruple excitations. The Journal of Chemical Physics, 123(5):054101, August 2005. 41 [14] Steven R. Gwaltney, Edward F. C. Byrd, Troy Van Voorhis, and Martin Head-Gordon. A perturbative correction to the quadratic coupled-cluster doubles method for higher excitations. Chemical Physics Letters, 353(56):359–367, February 2002. [15] Gustavo E. Scuseria, Curtis L. Janssen, and Henry F. Schaefer Iii. An efficient re- formulation of the closedshell coupled cluster single and double excitation (CCSD) equations. The Journal of Chemical Physics, 89(12):7382–7387, December 1988. [16] An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw., 28(2):135–151, June 2002. [17] NVIDIA: NVIDIA CUDA Programming guide, version 3.0 (2010). [18] Wen-Mei W. Hwu David B. Kirk. Programming Massively Parallel Processors. Mor- gan Kaufmann Publishers, 2013. 42