Implementation and Performance Analysis of Many-Body Quantum Chemical Methods on the Intel R Xeon Phitm Coprocessor and NVIDIA GPU Accelerator

Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel R Xeon PhiTM Coprocessor and NVIDIA GPU Accelerator A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Bobo Shi, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2016 Master’s Examination Committee: Dr. P. Sadayappan, Advisor Dr. Louis-Noel Pouchet c Copyright by Bobo Shi 2016 Abstract CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on applica- tion of the high-accuracy methods. Intel R Xeon PhiTM Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel R Xeon PhiTM Coprocessor and NVIDIA GPU. CCSD(T) method performs tensor contractions. In order to have an efficient implementation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/accelerator to avoid huge data transfer from coprocessor/accelerator to host. The tensor contraction are performed using BLAS dgemm on coprocessor/accelerator. Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi implementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional ii loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ∼ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively. iii This is dedicated to my parents, my sister and my wife iv Acknowledgments Firstly, I would like to take this opportunity to express my sincere appreciation to my master thesis adviser, Dr. P. (Saday) Sadayappan for his continuous support, patience and immense knowledge throughout my master’s studies and thesis work. I met Saday when I was in his class, Introduction to Parallel Computing. His teaching skills, patience to students and broad knowledge in the fields impressed me. My major was Biophysics at that time. And the class motivated my interest in computer science, especially in high performance computing field. The most important, I started to learn more knowledge in computer science field and finally decided to get a master’s degree in computer science. Saday is the one who gives me another gate through which I can start a new adventure. I would like to give my sincere gratitude to Dr. Sriram Krishnamoorthy. He provided me direct advice about my thesis work. Sriram had lots of video chat with Saday and me. I appreciate the insightful discussion with him. He always gave good ideas, which helped my thesis work. I also want to thank the member of my thesis committee, Prof. Louis-Noel Pouchet for participating in my master exam and taking time to read my thesis. I would like to thank the lab members, Venmugil Elango, Weilei Bao, Changwan Hong, Prashant Singh Rawat and John Eisenlohr for the help during my thesis work. v Vita 2010 . B.S. Physics, Fudan University 2015 . M.S. Biophysics, Ohio State University 2010-present . .PhD student in Biophyscis Program Graduate Teaching Associate, Graduate Research Associate, NSEC fellow, Ohio State University 2015-present . .Master student in Computer Science, Ohio State University Fields of Study Major Field: Computer Science and Engineering vi Table of Contents Page Abstract . ii Dedication . iv Acknowledgments . .v Vita ........................................... vi List of Tables . ix List of Figures . .x 1. Introduction . .1 1.1 CCSD(T) method . .2 1.1.1 Algebraic structure of the CCSD(T) approach . .2 1.1.2 Non-iterative CCSD(T) correction . .4 1.2 General structure of the algorithm . .5 1.3 Introduction to Intel R Xeon PhiTM ....................9 1.3.1 Intel R Xeon PhiTM coprocessor architecture . .9 1.3.2 Offload mode . 12 1.4 GPU and CUDA . 13 1.4.1 CUDA C . 13 2. Implementation . 16 2.1 Implementation on Intel R Xeon PhiTM Coprocesser . 16 2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser . 16 2.1.2 OpenMP thread optimization . 17 2.2 Implementation on CUDA . 18 2.2.1 Degmm on CUDA . 18 vii 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA . 19 3. Performance Analysis . 23 3.1 Performance for Intel R Xeon PhiTM Coprocessor . 23 3.1.1 Tilesize . 24 3.1.2 OMP threads affinity . 27 3.1.3 Comparison between MIC implementation and original CPU version . 29 3.2 Performance for CUDA implementation . 29 3.2.1 Tuning for gridDim and blockDim ................ 31 3.2.2 Comparison between CUDA implementation and the original CPU version . 32 3.3 Discussion . 34 4. Conclusion and Future work . 38 viii List of Tables Table Page 1.1 The CUDA keywords for functions declaration. 13 1.2 The CUDA keywords for variable declaration. 14 3.1 Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p . 23 3.2 Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K. 26 3.3 Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize. 33 3.4 Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for multicores for different tilesize. 35 3.5 Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize. 35 3.6 GFlops for 4 parts for CPU, MIC and GPU. 37 3.7 Cache miss of LOOP1 and LOOP2 for CPU and GPU . 37 ix List of Figures Figure Page 1.1 Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead. .6 1.2 Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i])....................7 1.3 High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1] . 10 1.4 Architecture overview of an Intel MIC architecture core [2] . 11 1.5 Grid of Thread Blocks. 15 2.1 Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round- Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only). 18 3.1 Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize. 24 x 3.2 MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case. 26 3.3 OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated. 27 3.4 The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2................................ 30 3.5 Overall performance comparison between CPU and MIC for different tile- sizes. 31 3.6 Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1................................... 32 3.7 Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2................................... 33 3.8 Overall CCSD(T) wall time in second for the original CPU version and implementation of CPU version for different tilesize. 34 3.9 Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize. 36 xi Chapter 1: Introduction Computational chemistry always require high accurate methods to describe the instanta- neous interactions between electrons or the correlation effects in molecules. From Hartre- Fock [3], Density functional theory [4] to Coupled Cluster [5] and Configuration inter- action [6] method, each new method improves the accuracy of the computational chemistry and bridge the gap between theory and experiments.

Load more