<<

Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the R PhiTM Coprocessor and NVIDIA GPU Accelerator

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Bobo Shi, M.S.

Graduate Program in Computer Science and Engineering

The Ohio State University

2016

Master’s Examination Committee:

Dr. P. Sadayappan, Advisor Dr. Louis-Noel Pouchet c Copyright by

Bobo Shi

2016 Abstract

CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on applica- tion of the high-accuracy methods. Intel R Xeon PhiTM Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel R Xeon PhiTM Coprocessor and NVIDIA GPU.

CCSD(T) method performs tensor contractions. In order to have an efficient implemen- tation, we allocate the result tensor only on Intel Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/ac- celerator to avoid huge data transfer from coprocessor/accelerator to host.

The tensor contraction are performed using BLAS dgemm on coprocessor/accelerator.

Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi im- plementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional

ii loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ∼ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively.

iii This is dedicated to my parents, my sister and my wife

iv Acknowledgments

Firstly, I would like to take this opportunity to express my sincere appreciation to my master thesis adviser, Dr. P. (Saday) Sadayappan for his continuous support, patience and immense knowledge throughout my master’s studies and thesis work. I met Saday when

I was in his class, Introduction to Parallel Computing. His teaching skills, patience to students and broad knowledge in the fields impressed me. My major was Biophysics at that time. And the class motivated my interest in computer science, especially in high performance computing field. The most important, I started to learn more knowledge in computer science field and finally decided to get a master’s degree in computer science.

Saday is the one who gives me another gate through which I can start a new adventure.

I would like to give my sincere gratitude to Dr. Sriram Krishnamoorthy. He provided me direct advice about my thesis work. Sriram had lots of video chat with Saday and me.

I appreciate the insightful discussion with him. He always gave good ideas, which helped my thesis work.

I also want to thank the member of my thesis committee, Prof. Louis-Noel Pouchet for participating in my master exam and taking time to read my thesis.

I would like to thank the lab members, Venmugil Elango, Weilei Bao, Changwan Hong,

Prashant Singh Rawat and John Eisenlohr for the help during my thesis work.

v Vita

2010 ...... B.S. Physics, Fudan University 2015 ...... M.S. Biophysics, Ohio State University 2010-present ...... PhD student in Biophyscis Program Graduate Teaching Associate, Graduate Research Associate, NSEC fellow, Ohio State University 2015-present ...... Master student in Computer Science, Ohio State University

Fields of Study

Major Field: Computer Science and Engineering

vi Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... ix

List of Figures ...... x

1. Introduction ...... 1

1.1 CCSD(T) method ...... 2 1.1.1 Algebraic structure of the CCSD(T) approach ...... 2 1.1.2 Non-iterative CCSD(T) correction ...... 4 1.2 General structure of the algorithm ...... 5 1.3 Introduction to Intel R Xeon PhiTM ...... 9 1.3.1 Intel R Xeon PhiTM coprocessor architecture ...... 9 1.3.2 Offload mode ...... 12 1.4 GPU and CUDA ...... 13 1.4.1 CUDA C ...... 13

2. Implementation ...... 16

2.1 Implementation on Intel R Xeon PhiTM Coprocesser ...... 16 2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser . . . . . 16 2.1.2 OpenMP optimization ...... 17 2.2 Implementation on CUDA ...... 18 2.2.1 Degmm on CUDA ...... 18

vii 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA . . . . . 19

3. Performance Analysis ...... 23

3.1 Performance for Intel R Xeon PhiTM Coprocessor ...... 23 3.1.1 Tilesize ...... 24 3.1.2 OMP threads affinity ...... 27 3.1.3 Comparison between MIC implementation and original CPU ver- sion ...... 29 3.2 Performance for CUDA implementation ...... 29 3.2.1 Tuning for gridDim and blockDim ...... 31 3.2.2 Comparison between CUDA implementation and the original CPU version ...... 32 3.3 Discussion ...... 34

4. Conclusion and Future work ...... 38

viii List of Tables

Table Page

1.1 The CUDA keywords for functions declaration...... 13

1.2 The CUDA keywords for variable declaration...... 14

3.1 Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p ...... 23

3.2 Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K...... 26

3.3 Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize...... 33

3.4 Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for multicores for different tilesize...... 35

3.5 Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize...... 35

3.6 GFlops for 4 parts for CPU, MIC and GPU...... 37

3.7 miss of LOOP1 and LOOP2 for CPU and GPU ...... 37

ix List of Figures

Figure Page

1.1 Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the sym- metry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead...... 6

1.2 Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i])...... 7

1.3 High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1] ...... 10

1.4 Architecture overview of an Intel MIC architecture core [2] ...... 11

1.5 Grid of Thread Blocks...... 15

2.1 Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round- Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only)...... 18

3.1 Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize...... 24

x 3.2 MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case...... 26

3.3 OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated. 27

3.4 The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2...... 30

3.5 Overall performance comparison between CPU and MIC for different tile- sizes...... 31

3.6 Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1...... 32

3.7 Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2...... 33

3.8 Overall CCSD(T) wall time in second for the original CPU version and implementation of CPU version for different tilesize...... 34

3.9 Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize...... 36

xi Chapter 1: Introduction

Computational chemistry always require high accurate methods to describe the instanta-

neous interactions between electrons or the correlation effects in molecules. From Hartre-

Fock [3], Density functional theory [4] to Coupled Cluster [5] and Configuration inter-

action [6] method, each new method improves the accuracy of the computational chem-

istry and bridge the gap between theory and experiments. Among many methods that

describe correlation effects, the coupled cluster (CC) method [5, 7–9] has been proved and

widely used as a very accurate method for solving the electronic Schrodinger¨ equation. The

CCSD(T) [10] method, part of coupled cluster (CC) method, is often called the “gold stan-

dard” of computational chemistry, since it is one of the most accurate methods applicable

to reasonably large molecules.

Many modern coprocessors/accelerators are capable of exploiting data-level parallelism

through the use of Single-Instruction-Multiple-Data (SIMD) execution. SIMD execution is

a power-efficient way of boosting peak performance. The most important coprocessors/ac-

celerators include Intel R Xeon PhiTM and NVIDIA GPU. Intel R Xeon PhiTM is a recently released high-performance coprocessor which features 57/60/61 cores each supporting 4 hardware threads. NVIDIA GPU () has powerful parallel com- puting ability due to its massively parallel many-core architecture. The availability of an

1 efficient parallel CCSD(T) implementation on the Intel Many Integrated Core (MIC) archi-

tecture and NVIDIA CUDA architecture will have a significant impact on application of

the high-accuracy methods.

In this work, we map the CCSD(T) code to Intel Many Integrated Core architecture

(MIC) and NVIDIA GPU. The NWChem [11], a computational chemistry tool, is evolved

and its CCSD(T) code is revised to have an Intel R Xeon PhiTM coprocessor / NVIDIA

CUDA entry directive.

1.1 CCSD(T) method

The CCSD(T) approach [8] and its analogs have been explored in Ref. [5]. Here, we

highlight only the basic theoretical threads necessary to understand its parallel implemen-

tation [12].

1.1.1 Algebraic structure of the CCSD(T) approach

The Coupled-Cluster theory is predicated on the assumption that there exists a judicious

choice of a single Slater determinant |Φi referred to as the reference function, which is capable of providing a zeroth-order description of the exact ground-state wavefunction |Ψi.

In most cases, the reference function is chosen as Hartree-Fock (HF) determinants although other choices have been discussed in the literature. The Coupled-Cluster formulation is based on the exponential parameterization of the correlated wavefunction

|Ψi = eT |Φi (1.1)

where T refers to the so-called cluster operator. The perburbative analysis, based on the

linked cluster theorem, shows that only connected diagrams are included in the cluster

operator. The most common approximation used in routine high-precision calculations is

2 the CCSD method (CC with singles and doubles), where the cluster operator is defined

singly (T1) and doubly excited (T2) many-body components. This leads to the following

representation of CCSD wavefunction

T1+T2 |ΨCCSD = e |Φi, (1.2)

i i j where T1 and T2 are represented by singly- (ta) and doubly- excited (tab) cluster amplitudes

+ and corresponding strings of creation/annihilation operators (Xp /Xp)

X i + T1 = taXa Xi, (1.3) i,a 1 X T = ti j X+X+X X . (1.4) 2 4 ab a b j i i, j,a,b As always, the i, j, ...(a, b, ...) indices refer to occupied (unoccupied) spin-orbital indices

in the reference function Φi. Standard CCSD equations for the cluster amplitudes are

obtained from the connected form of the Schrodinger¨ equation defined by projections on

a + ab + + singly (|Φi i = Xa Xi|Φi and doubly (|Φi j i = Xa Xb X jXi|Φi) excited configurations.

hΦa e−T HeT Φi = 0 ∀i, a, (1.5) i

hΦab e−T HeT Φi = 0 ∀i, j, a, b, (1.6) i j (1.7) where H designates electronic Hamiltonian operator.

Once the cluster amplitudes are determined from Equation (1.5) (1.6), the energy is calculated from the expression

E = hΦ|e−T HeT |Φi. (1.8)

Using diagrammatic techniques one can easily determine the algebraic structure of Equa- tions (1.5) (1.6) and the corresponding numerical complexity of the equations which is

3 2 4 proportional to nonu (no and nu stand for number of occupied and unoccupied spin-orbitals,

respectively). Unfortunately, the accuracy obtained with the CCSD formalism in many

cases is not sufficient to provide the so-called chemical accuracy, which is typically de-

fined as an error below 1kcal/mol. To achieve chemical accuracy inclusion of triply excited

effects is necessary.

1.1.2 Non-iterative CCSD(T) correction

The direct inclusion of the triply excited T3 effects in the cluster operator results in

3 5 3 3 high numerical scaling (' nonu) and huge memory demands (' nonu) which prohibits the

CCSDT (CC with singles, doubles, and triples) calculations even for relatively small molec- ular system. In order to reduce the scaling of the CCSDT method without significant loss of accuracy, several methods have been introduced in the past, where T3 amplitudes are estimated pejoratively. [13, 14, 14] The most popular method in this class of formalism is the CCSD(T) method in which the ground-state energy is represented as sum of the CCSD energy (ECCSD) and non-iterative CCSD(T) correction, which combines elements of fourth- and fifth-order of the standard many-body perturbation theory (MBPT) expansion contain- ing triply excited intermediate states:

CCSD(T) CCSD + (0) + (0) E = E + hΦ|(T2 VN)R3 VT2|Φi + hΦ|(T1 VN)R3 VT2|Φi, (1.9)

where VN is two body part of electronic Hamiltonian in normal product form. Currently, the CCSD(T) method is the most frequently employed CC approach especially in studies of spectroscopic properties, geometry optimization, and chemical reactions.

4 (0) Using the definition of the 3-body resolvent R3 , the CCSD(T) energy can be re-written + abc abc X hΦ|(T2 VN)|Φi jk ihΦi jk |VT2|Φi ECCSD(T) = ECCSD +  +  +  −  −  −  i< j

3 4 abc (scaling as nonu) is associated with the presence of the hΦi jk |VT2|Φi term, which is defined by the expression:

abc hΦi jk |VNT2|Φi =

i j mk i j mk i j mk ik m j ik m j ik m j vmatbc − vmbtac + vmctab − vmatbc + vmbtac − vmctac

jk mi jk mi jk mi ei jk ei jk ei jk +vmatbc − vmbtac + vmctab − vabtec + vacteb − vbctea

ei ik e j ik e j ik ek i j ek i j ek i j +vabtec − vacteb + vbctea − vabtec + vacteb − vbctea, (1.11)

(i < j < k, a < b < c)

pq where vrs is the tensor of two-electron integrals. Equation (1.11) can be separated into

abc terms defined by contractions over occupied indices (Ai jk : first nine terms on the right hand side of Equation (1.11)) and terms corresponding to contraction over unoccupied indices

abc (Bi jk : remaining nine terms on the right hand side of Equation (1.11)).

1.2 General structure of the algorithm

The original code structure is to calculate non-iterative CCSD(T) correction using the tensor contraction shown below. R[a,b,c,i,j,k] -= T[l,k,c,b] * V[l,a,i,j]

A 6 dimensional loop is used to calculate result tensor R. In this work, we implement

CCSD(T) on Intel R Xeon PhiTM coprocessor and GPU based on an improved algorithm

5 structure. Instead of doing a 6-dimensional loop, we use dgemm to replace. The BLAS

dgemm is expected to be faster than the simple loops. We also need to permute indices to enable the dgemm calculations. Because of the properties of Quantum Chemistry [15],

tensor T[l,k,c,b] and V[l,a,i,j] are symmetrical tensor. Product of two symmetrical

tensors are not necessary to be symmetric. However, Quantum Chemistry [15] require

the result tensor, R[a,b,c,i,j,k], to be a symmetric tensor and R can be obtained by

symmetrization. Therefore, we can calculate R in a different way because of the symmetry

property. We use a simple 2 dimensional tensor to illustrate the calculations. C[i,j] = A[i,k] * B[k,j]

Suppose A and B are symmetric. Figure 1.1 shows how we calculate C[i,j]. Blue blocks

Figure 1.1: Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead.

6 show the value we actually use in the calculation. Red blocks and its complimentary blue blocks together show what the regular matrix multiplication algorithm should use. Because of the symmetry property of A and B, we can store only half of elements of A and B to save memory use. As a consequence, blue blocks are used instead.

Figure 1.2: Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i]).

Although A and B are symmetric, their product C are not necessary to be symmetric.

The symmetrical element C[j,i] can be calculated in the similar way, as shown in Figure

1.2. After we obtain both C[i,j] and C[j,i], the symmetrized result can be obtained by

1 C*[i,j] = 2 (C[i,j] + C[j,i]).

7 In our system, we are calculating the higher dimensional matrix multiplication. The

analogous steps are used. The symmetrized result tensor R[a,b,c,i,j,k] are calculated

by symmetrizing nine components because of the Quantum Chemistry properties [15]. In

the algorithm, we use dgemm to calculate each symmetrical element and store them in nine buffers for both singles part and doubles part.

After we transfer tensor contraction to dgemm, we are able to call BLAS dgemm proce-

dures [16]. T and V need the local memory of (tilesize)4 and R need the local memory of

(tileszie)6. In this work, to avoid excessive data transfer between host and MIC/GPU,

we store R[a,b,c,i,j,k] in a buffer only on MIC/GPU at the very beginning of the

each task cycle and keep it until the end of the task, where the calculation of the energy

contribution from CCSD(T) for that task is also performed on MIC/GPU. For each dgemm

calculation on MIC/GPU, we only transfer T[l,k,c,b] and V[l,a,i,j] from host to

MIC/GPU.

The following pseudocode shows the algorithm structure in this work to calculate ECCSD(T). do t_p4b = noab+1,noab+nvab do t_p5b = t_p4b,noab+nvab do t_p6b = t_p5b,noab+nvab do t_h1b = 1,noab do t_h2b = t_h1b,noab do t_h3b = t_h2b,noab /* allocate memory space r buf on MIC/GPU */

// calculate singles part using BLAS dgemm and // store 9 results to 9 buffers to r buf on MIC/CUDA call ccsd_t_singles(...)

{ LOOP1: a 6-dimensional loop to sum over 9 buffers to one buffer. }

8 // calculate doubles part using BLAS dgemm and // store 9 results to 9 buffers to r buf on MIC/CUDA call ccsd_t_doubles(...)

{ LOOP2: a 6-dimensional loop to sum over 9 buffers to one buffer. Then accumulate the contribution to $Eˆ{CCSD(T)}$ }

/* free r_buf on MIC/GPU */ enddo enddo enddo enddo enddo enddo noab and nvab parameters refer to the total number of occupied and unoccupied tiles, respectively. Inside the 6 dimensional loops, a large memory space is allocated on MIC or

GPU and is kept on the device for each iteration. Then function ccsd t single(...) is called to calculate singles part by using dgemm. After that, LOOP1 is used to sum up nine

buffers to one buffer. It is similar for the doubles part. dgemm and LOOP2 are performed to

accumulate the results to ECCSD(T).

1.3 Introduction to Intel R Xeon PhiTM

1.3.1 Intel R Xeon PhiTM coprocessor architecture

The Intel Xeon Phi coprocessor is connected to an Intel Xeon , also known as

the host, through a PCI Express (PCIe) . Since the Intel Xeon Phi coprocessor runs a

Linux operating system, a virtualized TCP/IP stack could be implemented over the PCIe

9 bus, allowing the user to access the coprocessor as a network node. Thus, any user can

connect to the coprocessor through a secure shell and directly run individual jobs or submit

batchjobs to it. The coprocessor also supports heterogeneous applications wherein a part

of the application executes on the host while a part executes on the coprocessor.

Figure 1.3: High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1]

Fig. 1.3 shows an high level overview of the Intel Many Integrated Core architecture and the Intel Xeon Phi coprocessor. In this work, we use the 5110P model of the pro- cessor, which offers 60 general-purpose cores with in-order execution and a fixed clock speed of 1.053GHz. The MIC card has 8GB of on-card GDDR5 memory with 16 memory

10 channels. The Intel MIC architecture is based on the ISA, extended with 64-bit ad- dressing and new 512-bit wide SIMD vector instructions and registers. Each core supports hyper-threading with a four-way round-robin scheduling of hardware threads. Each core has a 32KB L1 data cache, a 32KB L1 instruction cache, and a 512KB L2 cache. The

L2 caches of all cores are interconnected with each other and the memory controllers via a bidirectional ring bus, effectively creating a shared last-level cache of up to 32MB. The peak floating-point performance for double precision (DP) is 1010.88 GFLOPS or 2021.76

GFLOPS for single-precision (SP) computation. Fig. 1.3 shows each core are connected together by the ring interconnect. TD in Fig. 1.3 is distributed duplicate tag directory for cross-snooping L2 caches in all cores. The CPU L2 caches are kept fully coherent with each other by the TDs, which are referenced after an L2 cache miss.

Figure 1.4: Architecture overview of an Intel MIC architecture core [2]

11 1.3.2 Offload mode

Intel R Xeon PhiTM coprocessor provides native and offload mode for programming.

Native mode is to run applications directly on Intel R Xeon PhiTM coprocessors without

offload from a host system. Offload code deals with two levels of memory blocking: one to fit the input data into coprocessor and another within the offload code to fit within the

processor caches. Bashed on our design, we only calculate CCSD(T) part on coprocessor,

so we use offload mode. The offload mode has to be explicitly coded. We need to add

offload directives in the code than just compile as host code. For example, to only transfer

data, we need !dir$ offload target (mic[:n])

where is

call subroutine_name(args)

or

ret_val = function_name(args)

To add code block in offload mode, we need the following !dir$ offload begin target (mic[:n])

!dir$ end offload

12 1.4 GPU and CUDA

Graphics Processing Unit (GPU) is designed by NVIDIA in 1999. It allows for faster graphics processing speeds. It has powerful parallel computing ability due to its mas- sively parallel many-core architecture. CUDA [17] is a proprietary framework developed by NVIDIA for developing general purpose application on Graphic Processing Unites. In this work, we use NVIDIA Tesla K40 GPU to do the tests and benchmarks. NVIDIA

Tesla K40 GPUs have memory size 12GB and 2880 CUDA cores. Peak single and double precision floating point performance are 1.43Tflops and 4.29 Tflops.

1.4.1 CUDA C

In 2006, NVIDIA introduced CUDA, a general purpose parallel computing platform and programming model that allows programmers to code in high level languages without considering the graphics interfaces of GPUs [18]. CUDA extends from traditional C/C++ programming language, and it provides some new keywords and API functions to the pro- grammers. NVCC (NVIDIA C Compiler) is used to compile the CUDA source programs.

The NVCC compiler uses keywords added by programmers to identify the host code and the device code. Table 1.1 shows the keywords for function declaration and Table 1.2 shows the keywords for variable declaration.

Table 1.1: The CUDA keywords for functions declaration.

Function declaration Called from Executed on device int func() device device global void func() device host host double func() host host

13 Keyword “ host ” indicates that a function is a host function. If no keyword is used,

“ host ” is set as the default keyword. The host function must run on the CPUs, and they can only be called from other host functions. A function declared with “ global ” is a kernel function called from host (CPUs) and running on the device (GPUs). The return type of this function must be void. A function declared with “ device ” indicates the function is a device function, which can only be called from device (kernel functions) and run only on device.

Table 1.2: The CUDA keywords for variable declaration.

Variable declaration Memory Scope Lifetime int local var register thread thread device int global var global memory grid application device constant int const var constant memory grid application device shared int shared var shared memory block block

The CUDA keywords for variable declaration is shown in Table 1.2. The keyword

shared is used to declare a variable on shared memory that can be shared by all the threads in the same thread block. The shared memory is on-chip, therefore it is much faster than local and global memory. Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. The keyword “ constant ” is used to declare a constant variable, which can only be read by the CUDA threads running on the GPUs.

The kernel function is called by the syntax: kernelFunc<<< gridDim, blockDim >>> (args... )

14 gridDim and blockDim are the execution configuration parameters. Each can be one di- mensional, two dimensional or three dimensional as illustrated by Figure 1.5.

Figure 1.5: Grid of Thread Blocks.

15 Chapter 2: Implementation

2.1 Implementation on Intel R Xeon PhiTM Coprocesser

2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser

As we discussed on Section 1.2, in order to avoid excessive data transfer between host

and MIC, a buffer only on MIC is used to store the result tensor, R, at the very beginning of

each task. An code example is shown below. if (is_phi_offload) then allocate(a_c(1)) size_max_thres = 56800800

!dir$ offload_transfer target(mic) in(a_c: & length(10*size_max_thres) & alloc_if(.TRUE.) free_if(.FALSE.))

endif ! is_phi_offload

a c in the code above is for R. We allocate a memory size of 10*size max thres on MIC and keep it there for future usage.

abc abc To calculate hΦi jk |VNT1|Φi and hΦi jk |VNT2|Φi, we need to transfer V and T from host to MIC. Then a dgemm call on MIC is used. An code example is shown below. aoff=>dbl_mb(k_a_sort:k_a_sort+dima_sort*dim_common-1) boff=>dbl_mb(k_b_sort:k_b_sort+dimb_sort*dim_common-1)

!dir$ offload begin target(mic:0) nocopy(a_c: length(0)

16 & alloc_if(.FALSE.) free_if(.FALSE.)) in(dima_sort) & in(dimb_sort), in(dim_common), & in(dbeta), in(ia6), in(dimc) in(aoff:length( & dima_sort*dim_common) & alloc_if(.TRUE.) free_if(.TRUE.)) in(boff: & length(dimb_sort*dim_common) & alloc_if(.TRUE.) free_if(.TRUE.))

CALL DGEMM(’T’,’N’,dima_sort,dimb_sort,dim_common,1.0d0,aoff, & dim_common,boff,dim_common,dbeta, & a_c((ia6-1)*dimc+1), & dima_sort)

!dir$ end offload nullify(aoff, boff)

aoff and boff are used to transfer V and T from host to MIC. dima sort*dim common

and dimb sort*dim common are the size of data to be transferred. After finishing dgemm

call, memory of aoff and boff on MIC are freed.

2.1.2 OpenMP thread optimization

The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical

processing units. The interface is controlled using KMP AFFINITY and OMP NUM THREADS

environment variable with compilers version 13.1.0 and newer. Three types of affinities,

compact, scatter and balanced, are used in this work. Those types have the following properties. compact: pack threads close to each other; scatter: Round-Robin threads to

cores; balanced: keep OMP thread ids consecutive (MIC only). Fig. 2.1 illustrates how

three affinity types work. Depending on system application, topology and operating system,

the OpenMP thread affinity can have significant effect on the application performance.

17 Figure 2.1: Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round-Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only).

In this work, we add OpenMP implementation to LOOP1 and LOOP2, described on sec-

hΦ|T +V |Φabci hΦ|(T +V )|Φabci P 1 N i jk P 2 N i jk tion 1.2, to efficiently calculate i< j

mulate to total contribution δECCSD(T) of CCSD(T). Both of LOOP1 or LOOP2 contain a 6

dimensional loops, we test and confirm that an omp parallel implementation of collapse of

3 gives best performance.

2.2 Implementation on CUDA

2.2.1 Degmm on CUDA

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subpro-

grams) on top of the CUDA runtime. It allows the user to access the computational

resources of NVIDIA Graphics Processing Unit (GPU). In this work, we use function

cublasDgemm to perform dgemm on GPU.

18 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA

For LOOP1 and LOOP2, we need to map 6 dimensional loops to thread-blocks. In CUDA, all thread blocks and threads share the same values of the kernel function arguments. The differentiation between the work to be performed by the different threads and thread-blocks needs to be derived from their location in the thread block and thread block grid. Dimen- sions of threads and thread-blocks can be notated as blockDim and gridDim. Each can be one dimensional, two dimensional or three dimensional. In LOOP1 and LOOP2, each dimen- sion is corresponding to each index of tensor R[a,b,c,i,j,k]. In NWChem, 6 dimensional tensor R[a,b,c,i,j,k] a stored in a huge one dimensional array. The elements of R are assigned to the threads in CUDA by addressing the index of R[a,b,c,i,j,k] as the following code snippet. int idt = id + i*tnt; int re; = idt % range_p6; re = idt / range_p6; = re % range_p5; re = re / range_p5; p4 = re % range_p4; re = re / range_p4; h2 = re % range_h2; re = re / range_h2; h1 = re % range_h1; re = re / range_h1; h3 = re % range_h3; idt is flatten index of R. Index (a,b,c,i,j,k) is expressed as (h1, h2, h3, p4, p5, p6) in the code. range h1, range h2, range h3, range p4, range p5, range p6 is the size for each dimension. After assigning indices to each thread on CUDA, the calcu- lation to sum up the nine buffers can be done by the CUDA threads.

19 To improve the CUDA memory accessing performance, we need tiling to have coalesc-

ing memory access. We use LOOP1 as an example to explain. LOOP1 has the following structure. S[p6,p5,p4,h2,h1,h3] = + buf1[p4,h1,p5,p6,h2,h3] - buf2[p4,h2,p5,p6,h1,h3] + buf3[p4,h3,p5,p6,h1,h2] - buf4[p5,h1,p4,p6,h2,h3] + buf5[p5,h2,p4,p6,h1,h3] - buf6[p5,h3,p4,p6,h1,h2] + buf7[p6,h1,p4,p5,h2,h3] - buf8[p6,h2,p4,p5,h1,h3] + buf9[p6,h3,p4,p5,h1,h2]

Since the main code is written in FORTRAN, the most left index is the fastest iterating index. We need to access both S and the nine buffers memory coalescingly. To do that we need to do tiling so that we copy the memory block coalescingly from buffer to the shared memory of CUDA and then write it back coalescingly to the variable S. The following pseudo code illustrates how to have coalescing memory access to both buf1 and S. //blockDim.x = range_p4 and blockDim.y = range_p6 // range_p4 is the size in dimension p4. __shared__ double sbuf[range_p4*range_p6];

// make tx = p4 and ty = p6 ty = bid % range_p6; int re = bid / range_p6; int p5 = re % range_p5; re = re / range_p5; tx = re % range_p4; re = re / range_p4; int h2 = re % range_h2; re = re / range_h2; int h1 = re % range_h1; re = re / range_h1; int h3 = re % range_h3;

// read from the buf to shared memory

20 sbuf[ty * range_p4 + tx] = buf1[tx, h1, p5, ty, h2, h3];

__syncthreads();

// make threadIdx.x = p6 and threadIdx.y = p4 tx = threadIdx.x % range_p6; ty = threadIdx.x / range_p6;

// write to the variable S S[tx,p5,ty,h2,h1,h3] = sbuf[tx*range_p4 + ty];

Since buf1, buf2 and buf3 have the same fastest iterating index, tiling can be used

for all of them at the same time. Similarly, buf4, buf5 and buf6 can be passed at the

same time. buf7, buf8 and buf9 have the same fastest iterating index as variable S, so

we can directly have coalescing memory access without tiling.

For LOOP2, we found that directly summing up 9 buffers through 1 pass is faster than doing tiling with multiple passes. LOOP2 has the structure of the following. D[p6,p5,p4,h2,h1,h3] = -buf1[p5, h1, p4, p6, h2, h3] +buf2[p5, h2, p4, p6, h1, h3] -buf3[p5, h3, p4, p6, h1, h2] -buf4[p6, h1, p5, p4, h2, h3] +buf5[p6, h2, p5, p4, h1, h3] -buf6[p6, h3, p5, p4, h1, h2] +buf7[p6, h1, p4, p5, h2, h3] -buf8[p6, h2, p4, p5, h1, h3] +buf9[p6, h3, p4, p5, h1, h2]

buf1, buf2 and buf3 need tiling to have coalescing memory access. However, buf4

- buf9 have the same fastest iterating index with the D. Therefore no tiling is needed for

these cases. If we do the tiling for buf1, buf2 and buf3, we need to have least 2 passes

to sum up 6 buffers. One pass for buf1, buf2 and buf3. Another pass for buf4 - buf9.

We also need to have an extra space to store D and D is going to be used to calculate total

energy later. If we don’t do tiling, the only thing we lose is the tiling for 3 buffers. However,

21 we can do the summing up in one pass and we can directly use the result D to accumulate to the final energy. In other words, we don’t need to write and read D at the extra space, which saves both time and space. From the experiments, It turns out no tiling gives better result.

22 Chapter 3: Performance Analysis

3.1 Performance for Intel R Xeon PhiTM Coprocessor

In this section, we analyze the performance of the Intel R Xeon PhiTM Coprocessor

implementation. The number reported here are based on two Intel R Xeon PhiTM Coproces-

sors: 3120A/P and 5110P. The detail specification of the two Intel R Xeon PhiTM Coproces-

sors are in Table 3.1.

Table 3.1: Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p

Intel R Xeon PhiTM Intel R Xeon PhiTM Product Name Coprocessor 3120A/P Coprocessor 5110p Cache 28.5MB L2 Cache 30MB L2 Cache # of Cores 57 60 Processor Base Frequency 1.1GHz 1.052GHz Memory Size 6GB 8GB Memory Bandwidth 240GB/s 320GB/s

The evaluation of Intel R Xeon PhiTM Coprocessor 3120A/P was performed on a Intel

Xeon CPU E5-2650 v2 (2.60GHz) which contains 8 cores. Since in this part, we only focus

on the performance of Intel R Xeon PhiTM Coprocessor, we only use one CPU core to run

experiment and communicate with Intel R Xeon PhiTM Coprocessor.

23 3.1.1 Tilesize

The tiling scheme of NWChem corresponds to partitioning of the spin-orbital domain into smaller subsets containing the spin-orbitals of the same spin and spatial symmetries

(the so-called tiles). This partitioning of the spin-orbital domain entails the blocking of all tensors corresponding to one- and two-electron integrals, cluster amplitudes, and all recursive intermediates, into smaller blocks of the size defined by the size of the tile (or tilesize for short). The size of tiles (tilesize) defines also the local memory requirements in

CCSD(T).

Figure 3.1: Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize.

The performance of Intel R Xeon PhiTM coprocessor implementation depends on the tilesize. For each dgemm calculation, we need to offload 4 dimensional data T[l,k,c,b] and V[l,a,i,j] from host to MIC. The smaller the tilesize, the smaller the data transferred for each dgemm and the more times we need to transfer the data for all the calculations.

24 The total number of operations do not depend on the tilesize. However, the standard Intel

MKL DGEMM matrix multiplication performance depends on the matrix size. Generally

speaking, the larger the matrix size we have for the dgemm calculation, the higher the perfor-

mance we can obtain. Fig. 3.1 gives the measurements of the dgemm calculation time and

data transfer time for singles part and doubles part of a small system (2 ozone), for different

tilesizes. The experiments are performed on Intel R Xeon PhiTM coprocessor 3120A/P. Both the dgemm calculation time and data transfer time decrease when tilesize increases.

At this point, the data transfer time seems too large. The estimated bandwidth is only

0.41GB/s. The theoretical for CPU and mic transfer bandwidth is 6.7GB/s. To figure out why we have such low bandwidth, we use a simple dgemm example code to profile data transfer bandwidth and matrix size. Fig. 3.2 shows the data transfer speed vs. the data size for a simple MIC offload dgemm code, which is performed at Intel R Xeon PhiTM

Coprocessor 3120A/P. The red dot indicates the data transfer for the nwchem mic offload

doubles dgemm tilesize=40, which is close to the simple offload dgemm code experiment

performance. Therefore, the low transfer speed is due to the smaller data size transferred.

If we have a larger system and larger tilesize, we will have better host-mic data transfer

speed.

Because of the relatively large amount of data offloaded of two 4-dimensional arrays,

good performance will be obtained if we set environment variable MIC USE 2MB BUFFERS.

The offload infrastructure then uses 2MB pages to store data that is transferred to the co- processor. A reduction of time spent for data transfer is expected, due to much less page faults on the coprocessor and less misses in the Translation Lookaside Buffer. Table 3.2 shows the doubles part dgemm calculation time and data transfer time for tilesize of 40 and

50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K. Table 3.2 performed on Intel R Xeon

25 Figure 3.2: MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case.

PhiTM 5110P and have better results than Intel R Xeon PhiTM 3120A/P. As we can see, when

Table 3.2: Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K.

MIC USE 2MB BUFFERS 16K 32K 64K tilesize=40, dgemm calculation time (second) 174.349 161.435 161.186 tilesize=40, data transfer time (second) 72.525 74.030 74.119 tilesize=50, dgemm calculation time (second) 127.54 128.955 128.466 tilesize=50, data transfer time (second) 30.1334 30.809 30.1582

tilesize of 50 is chosen, we can get a small data transfer time (30 second).

26 3.1.2 OMP threads affinity

In this part, we examine how the number of threads and threads affinity affect the per- formance. The Intel OpenMP runtime library has the ability to bind OpenMP threads to physical processing units. Two schemes should be considered for OpenMP threading and affinity. Firstly, determine the number of threads to utilize and secondly, how to bind threads to specific processor cores. The Intel R Xeon PhiTM Coprocessor supports 4 threads

contexts per core. Intel R Xeon PhiTM Coprocessor 3120A/P and 5110P have 59 and 60

cores, respectively. The best advice for an offload program is to try different numbers of

threads from N − 1 threads to 4 × (N − 1), where N is the number of physical cores on the

processor. Using N − 1 instead of N is because of OS overhead. It is inefficient to schedule

worker threads on cores where OS threads are contenting for cycles. The next is to choose

threads affinity, which controls how threads are placed on cores of the Intel R Xeon PhiTM

Figure 3.3: OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated.

27 coprocessor. As we discussed in Section 2.1.2, we are three types of OpenMP threads affin- ity. compact: pack threads close to each other; scatter: Round-Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only).

Fig. 3.3 shows how number of OpenMP threads and the ways to place threads affect the total performance of the CCSD(T). This experiment is calculating a larger system (3 ozone) and is performed on Intel R Xeon PhiTM Croprocessor 5110P. 3 set of data are collected and error bar is shown on the graph. The errors are small so that this reflects the true performance for each setting. The total time for CCSD(T) decrease when we increase the

OpenMP threads. When we choose maximum OpenMP threads, different affinity gives similar performance. However, when the number of threads is less than the maximum value, the threads affinity affects the performance more.

LOOP1 and LOOP2 contain a 6 dimensional loop, we implemented the OpenMP paral- lelism scheme as following !$omp parallel do collapse(3) do h3 = 0, range_h3-1 do h1 = 0, range_h1-1 do h2 = 0, range_h2-1 do p4 = 0, range_p4-1 do p5 = 0, range_p5-1 do p6 = 0, range_p6-1p ......

Different collapse are tested and we confirm that collapse(3) provides best perfor- mance.

28 3.1.3 Comparison between MIC implementation and original CPU version

In this section, we compare the performance of Intel R Xeon PhiTM implementation with the CPU version. The CPU we used in the benchmark is Intel Xeon E5 2670 V2.

We evaluated the sequential implementation on CPU (one CPU core) as they appear in

NWChem and the Intel R Xeon PhiTM offload mode for CCSD(T) part of NWChem on one

MIC card (Intel R Xeon PhiTM Coprocessor 5110P).

Figure. 3.4 shows the performance comparison between CPU and MIC with several threads affinity for four main parts of the CCSD(T) calculation: singles part dgemm, dou- bles part dgemm, LOOP1 and LOOP2. The performance of CPU does not have clear relation with the tilesize. The performance of Intel R Xeon PhiTM Coprocessor gives better per- formance for larger tilesize. Large tilesize cause large data transferred for each dgemm offload, which lead to a better bandwidth for data transfer. We obtained 8x speed up for the dgemm calculation and about 2x speedup for the LOOP.

Figure. 3.5 illustrates the overall performance comparison between CPU and MIC for different tilesizes. CPU performance does not have a clear relation with tilesize, but MIC gives better performance for larger tilesize. For tilesize = 30 and 40, MIC provides a speedup about 4x.

3.2 Performance for CUDA implementation

In this section, we evaluate the performance of CUDA implementation for CCSD(T).

The GPU we used in this evaluation is NVIDIA Tesla K40, which have memory size 12GB and 2880 CUDA cores. For the dgemm part, we use the CUDA BLAS library. CUDA will automatically decide the configuration for dgemm, so we don’t need to do the tuning

29 Figure 3.4: The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2.

for dgemm calls. singels part dgemm and doubles part dgemm generally take 13.95s and

210.20s, respectively. The number of operations for single dgemm and double dgemm are proportional to T 6 and (no+nv)T 6, where no and nv parameters refer to the total number of occupied and unoccupied tiles, respectively.

30 Figure 3.5: Overall performance comparison between CPU and MIC for different tilesizes.

3.2.1 Tuning for gridDim and blockDim

We need to tune gridDim and blockDim for LOOP1 and LOOP2. For LOOP1 we have parameters (NB, NT) to tune, where NB stands for number of blocks (gridDim) and NT stands for number of threads per block (blockDim). Figure 3.6 shows the tuning for NB and

NT for LOOP1 at tilesize = 30. x axis is the number of blocks and y axis is the wall time in second for LOOP1. NB = 2048 and NT = 256 gives best performance. Different tilesize may have different results, but we confirm they do not generate significant difference.

31 Figure 3.6: Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1.

Analogous tuning is done for LOOP2. Figure 3.7 shows the wall time for LOOP2 with

different NB and NT for tilesize = 30. When NT is greater or equal to 256, the total

work done by threads in one block start to saturate. When NT is smaller than 64, perfor-

mance is not good either. NB = 4096 and NT = 128 give best performance.

3.2.2 Comparison between CUDA implementation and the original CPU version

In this part, we compare the results to the original version (CPU) for four main parts of

the CCSD(T) algorithm: dgemm s (dgemm for singles part), dgemm d (dgemm for doubles

part), LOOP1 and LOOP2. CPU we used in the experiment is Intel Xeon E5 2670 V2. Table

3.3 shows the wall time in second for four main parts of CCSD(T) algorithm for CPU and

GPU for different tilesize. Generally speaking, GPU gives about 16x, 20x, 9x and 13x

speedup for dgemm s, dgemm d, LOOP1 and LOOP2. dgemm gives much higher speedup than the loop.

32 Figure 3.7: Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2.

Table 3.3: Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize.

dgemm s dgemm d LOOP1 LOOP2 tilesize CPU GPU CPU GPU CPU GPU CPU GPU 20 303.68 13.94 4236.3 209.92 1405.4 148.19 1378.0 90.59 25 375.37 17.18 5118.2 214.87 1645.5 175.85 1993.7 108.5 30 325.05 14.61 4272.6 163.28 1404.9 145.76 1413.3 94.17 35 287.01 17.20 4757.5 160.35 1658.0 162.73 1589.2 113.61 40 270.70 17.31 4741.2 155.24 1703.9 164.12 1569.8 117.76

Figure 3.8 shows the overall CCSD(T) wall time in second for the original CPU version and implementation of GPU for different tilesize. GPU provides 9.3x, 11x, 12x, 13.5x and

14.1x speedup for tilesize of 20, 25, 30, 35 and 40, respectively. GPU performance is better when the tilesize is larger. This is due to the good performance for CUDA BLAS dgemm

33 Figure 3.8: Overall CCSD(T) wall time in second for the original CPU version and imple- mentation of CPU version for different tilesize.

with larger tilesize. The performance of the original code does not have clear relation with

tilesize.

3.3 Discussion

In this section, we discuss several performance related analysis. It is good to know

the performance for multi-CPU-cores so that we know the effective MIC/GPU speedup

compared to the performance of multicores. Table 3.4 and Table 3.5 shows walltime in

second for the dgemm parts and LOOPS part for different cores and different tilesize. The experiments are done with core number of 1, 2, 4 and 8. We can see that each part get

34 speed up with more cores. Generally, 8 cores gives 5.7x speed up for singles dgemm and 7x speedup for doubles dgemm, LOOP1 and LOOP2.

Table 3.4: Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for mul- ticores for different tilesize.

dgemm s dgemm d tilesize 1core 2core 4core 8core 1core 2core 4core 8core 20 303.6 153.1 79.38 42.95 4236.3 2128.7 1100.3 591.5 25 654.2 190.6 99.39 53.24 5484.5 2575.5 1332.6 713.6 30 325.0 163.2 84.87 45.82 4757.4 2150.7 1108.4 595.4 35 287.0 145.2 76.26 45.32 4757.4 2382.4 1235.7 668.2 40 270.6 135.6 71.36 46.96 4741.1 2385.1 1234.1 677.1

Table 3.5: Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize.

dgemm s dgemm d tilesize 1core 2core 4core 8core 1core 2core 4core 8core 20 1405.3 708.3 363.2 195.8 1378.0 694.8 358.3 194.0 25 2997.5 829.0 428.5 230.4 3067.8 851.2 444.4 239.9 30 1404.8 709.2 363.9 196.8 1413.3 713.8 369.0 201.4 35 1658.0 830.9 429.3 255.2 1589.2 795.5 412.8 247.2 40 1703.9 852.9 439.5 288.5 1569.7 784.9 406.6 264.7

Figure 3.9 shows the overall walltime in second for multi-cores. Except for tilesize of

25 and 40, 8 cores generate gives 3x speedup, which is much lower than the speedup for each parts. This is due to the MPI communications.

GFLOPS are provided in Table 3.6. We don’t have the MIC data for tilesize 40 since the MIC memory size didn’t satisfy the the memory requirement for tilesize 40. We can see

35 Figure 3.9: Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize.

that for dgemm s, all of the CPU, MIC, GPU have better GFLOPS for larger tilesize. It is

because the data size for dgemm s is usually small. CPU, MIC, GPU still need larger data size to get better performance. However, the CPU GFLOPS for dgemm d is independent

of tilesize and MIC/GPU will have better performance with larger tilesize. Therefore, if

we would run a larger system, we expect to get better GFLOPS for dgemm d. The data

transfer time are considered in the wall time for each part, and it also affects the GFLOPS

(e.g. larger data transfer time leads to smaller GFLOPS). LOOP2 has better GFLOPS than

LOOP1. It is due to that LOOP1 only merges nine buffers to one buffer, but LOOP2 does extra

work to calculate the CCSD(T) energy from the buffer result. The energy calculation reuses many data and hence improves the GFLOPS.

Table 3.7 lists the cache misses for LOOP1 and LOOP2 for CPU and GPU for different

tilesize. CPU and GPU cache misses are obtained from Intel Vtune and NVIDIA nvprof.

Since we use Telsa K40, the default way would not use L1 cache for global loads. Only L2

36 Table 3.6: GFlops for 4 parts for CPU, MIC and GPU.

dgemm s dgemm d LOOP1 LOOP2 CPU MIC GPU CPU MIC GPU CPU MIC GPU CPU MIC GPU 20 0.65 2.0 14.2 2.8 11.2 56.8 0.47 1.27 2.23 1.26 1.73 13.7 25 0.62 2.9 13.7 2.8 16. 66.1 0.482 1.22 2.26 1.23 2.01 13.7 30 0.61 3.5 13.6 2.7 20.1 72. 0.472 1.12 2.28 1.23 2.29 13.2 35 0.76 4.9 12.8 2.7 29.8 82.2 0.45 1.02 2.29 1.23 2.62 12.3 40 0.81 - 12.7 2.8 - 84.9 0.439 - 2.28 1.25 - 11.9

cache would be accessed for global load. For LOOP1, we only have global load, there is not

data reuse, therefore we only list L2 miss. For LOOP2, we not only have global load, but

also reuse the data to calculate the CCSD(T) energy. Therefore there is L1 cache access for

LOOP2.

Table 3.7: Cache miss of LOOP1 and LOOP2 for CPU and GPU

LOOP1 LOOP2 CPU GPU CPU GPU tilesize L1 miss L2 miss L2 miss L1 miss L2 miss L1 miss L2 miss 20 0.4301 0.2085 0.2697 0.0944 0.5483 0.0448 0.2803 25 0.3135 0.2176 0.2904 0.1015 0.7066 0.0926 0.3513 30 0.3106 0.2139 0.2928 0.0869 0.7075 0.0942 0.3502 35 0.3106 0.2139 0.3469 0.0971 0.7565 0.1749 0.3275 40 0.3436 0.4001 0.3477 0.1003 0.7578 0.1770 0.3484

37 Chapter 4: Conclusion and Future work

CCSD(T) method provides high accuracy to describe the instantaneous interactions between electrons in molecules. The availability of the an efficient parallel CCSD(T) im- plementation on the Intel Many Integrated Core (MIC) architecture and NVIDIA GPU will have a significant impact on application of the high-accuracy methods. In this work,

CCSD(T) code of NWChem is mapped to the Intel R Xeon PhiTM Coprocessors and NVIDIA

CUDA accelerator. CCSD(T) method performs tensor contractions, which is performed us- ing BLAS dgemm on the coprocessor/accelerator. OpenMP is used in the post-processing loops to bind threads to Intel Xeon Phi coprocessor physical processing units. A algorithm is designed to the map the post-processing loops to the GPU threads. The experiments show a 4x speedup for implementation of Intel R Xeon PhiTM Coprocessors and 9x ∼ 13x speedup for implementation of NVIDIA GPU accelerator, compared to one CPU core. Per- formance, such cache misses and GFLOPS, is analyzed for each part of the algorithm.

The CCSD(T) algorithm has singles part and doubles part. Each part contains a dgemm calculation and a post-processing loop (LOOP1 and LOOP2). In the post-processing loop, nine buffers, result from the dgemm calculation, are summed up to one buffer. singles part dgemm takes much less time than doubles part dgemm, however, summing up nine buffers into one buffer takes similar time for singles part (LOOP1) and doubles part (LOOP2), and those time costs are relatively large. Considering relatively large amount of time spent on

38 the singles part, we should try using loops instead of dgemm to deal with singles part in

the future work. singles part needs much less operations than doubles part and we expect

simple loops will have better performance than dgemm plus post-processing loop for singles part.

39 Bibliography

[1] James Reinders Jim Jeffers. Intel Xeon Phi Coprocessor High Performance Program-

ming.

[2] Intel Xeon Phi Coprocessor Developer’s Quick Start Guide version 1.8.

[3] Charlotte Froese Fischer. General Hartree-Fock program. Computer Physics Com-

munications, 43(3):355–365, February 1987.

[4] Tanja van Mourik and Robert J. Gdanitz. A critical note on density functional theory

studies on rare-gas dimers. The Journal of Chemical Physics, 116(22):9620–9623,

June 2002.

[5] Rodney J. Bartlett and Monika Musial. Coupled-cluster theory in quantum chemistry.

Reviews of Modern Physics, 79(1):291–352, February 2007.

[6] C. David Sherrill and Henry F. Schaefer III. The Configuration Interaction Method:

Advances in Highly Correlated Approaches. In John R. Sabin Per-Olov Lwdin,

Michael C. Zerner and Erkki Brndas, editor, Advances in Quantum Chemistry, vol-

ume 34, pages 143–269. Academic Press, 1999.

[7] Jiri Cizek. On the Correlation Problem in Atomic and Molecular Systems. Calcu-

lation of Wavefunction Components in UrsellType Expansion Using QuantumField

40 Theoretical Methods. The Journal of Chemical Physics, 45(11):4256–4266, Decem-

ber 1966.

[8] S. A. Moszkowski. Short-Range Correlations in Nuclear Matter. Physical Review,

140(2B):B283–B286, October 1965.

[9] J. Paldus, J. Cizek, and I. Shavitt. Correlation Problems in Atomic and Molecular

Systems. IV. Extended Coupled-Pair Many-Electron Theory and Its Application to

the BH3 Molecule. Physical Review A, 5(1):50–67, January 1972.

[10] Jan Rezac, Lucia Simova, and Pavel Hobza. CCSD[T] Describes Noncovalent In-

teractions Better than the CCSD(T), CCSD(TQ), and CCSDT Methods. Journal of

Chemical Theory and Computation, 9(1):364–369, January 2013.

[11] M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. Van Dam,

D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong. NWChem: A com-

prehensive and scalable open-source solution for large scale molecular simulations.

Computer Physics Communications, 181(9):1477–1489, September 2010.

[12] E. Apra, M. Klemm, and K. Kowalski. Efficient Implementation of Many-Body Quan-

tum Chemical Methods on the Intel #x00ae; Xeon Phi Coprocessor. In High Perfor-

mance Computing, Networking, Storage and Analysis, SC14: International Confer-

ence for, pages 674–684, November 2014.

[13] Yannick J. Bomble, John F. Stanton, Mihaly Kallay, and Jurgen Gauss. Coupled-

cluster methods including noniterative corrections for quadruple excitations. The

Journal of Chemical Physics, 123(5):054101, August 2005.

41 [14] Steven R. Gwaltney, Edward F. C. Byrd, Troy Van Voorhis, and Martin Head-Gordon.

A perturbative correction to the quadratic coupled-cluster doubles method for higher

excitations. Chemical Physics Letters, 353(56):359–367, February 2002.

[15] Gustavo E. Scuseria, Curtis L. Janssen, and Henry F. Schaefer Iii. An efficient re-

formulation of the closedshell coupled cluster single and double excitation (CCSD)

equations. The Journal of Chemical Physics, 89(12):7382–7387, December 1988.

[16] An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math.

Softw., 28(2):135–151, June 2002.

[17] NVIDIA: NVIDIA CUDA Programming guide, version 3.0 (2010).

[18] Wen-Mei W. Hwu David B. Kirk. Programming Massively Parallel Processors. Mor-

gan Kaufmann Publishers, 2013.

42