A Framework for Performance Optimization of Tensor Contraction Expressions
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By Pai-Wei Lai, M.S. Graduate Program in Computer Science and Engineering
The Ohio State University 2014
Dissertation Committee:
P. Sadayappan, Advisor
Gagan Agrawal
Atanas Rountev © Copyright by
Pai-Wei Lai
2014 ABSTRACT
Attaining high performance and productivity in the evaluation of scientific ap- plications is a challenging task for computer scientists, and is often critical in the advancement of many scientific disciplines. In this dissertation, we focus on the de- velopment of high performance, scalable parallel programs for a class of scientific computations in quantum chemistry — tensor contraction expressions.
Tensor contraction expressions are generalized forms of multi-dimensional matrix- matrix operations, which form the fundamental computational constructs in elec- tronic structure modeling. Tensors in these computations exhibit various types of symmetry and sparsity. Contractions on such tensors are highly irregular with sig- nificant computation and communication cost, if data locality is not considered in the implementation. Prior efforts have focused on implementing tensor contrac- tions using block-sparse representation. Many parallel programs of tensor contrac- tions have been successfully implemented, however, their performances are unsat- isfactory on emerging computer systems.
In this work, we investigate into several performance bottlenecks of previous ap- proaches, and present responding techniques to optimize operations, parallelism, workload balance, and data locality. We exploit symmetric properties of tensors to minimize the operation count of tensor contraction expressions through algebraic
ii transformation. Rules are formulated to discover symmetric properties of inter- mediate tensors; cost models and algorithms are developed to reduce operation counts. Our approaches result in significant operation count reduction, compared to many other state of the art computational chemistry softwares, using examples from real-world tensor contraction expressions from the coupled cluster methods.
In order to achieve high performance and scalability, multiple programming models are often used in a single application. We design a domain-specific frame- work which utilizes the partitioned global address space programming model for data management and inter-node communication. We employ the task parallel execution model for dynamic load balancing. Tensor contraction expressions are decomposed into a collection of computational tasks operating on tensor tiles. We eliminate most of the synchronization steps by executing independent tensor con- tractions concurrently, and present mechanisms to improve their data locality. Our framework shows improved performance and scalability for tensor contraction ex- pressions from representative coupled cluster methods.
iii To Li-Lun, for the laughter everyday;
To Aaron, for the crying every night;
And to Ola, for the purring once in a while.
iv ACKNOWLEDGMENTS
I have relied on the help of many people in this effort. I would like to express my greatest appreciation to my advisor, Dr. P. Sadayappan, for his continuous guidance and support over the past five years. I feel very fortunate to have worked with him and have learned a great deal from him. I am grateful to my dissertation committee, Dr. Gagan Agrawal and Dr. Atanas Rountev, and the graduate faculty representative, Dr. Christopher Miller, for their valuable suggestions to improve my work.
I am sincerely grateful to have had the opportunity to work so closely with many of the best minds in computer science and chemistry. I thank my mentor, Dr. Sri- ram Krishnamoorthy, for two memorable summer internships at Pacific Northwest
National Laboratory, and Dr. Karol Kowalski, Dr. Edward Valeev, Dr. Marcel
Nooijen, and Dr. Dmitry Lyakh, for helping me understand the chemistry aspects of my research. Special thanks to Dr. Albert Hartono and Dr. Huaijian Zhang, for their assistance in the enhancement of OpMin; and Dr. Wenjing Ma, for providing important insights into the development of DLTC.
I am thankful to all my friends in Columbus for making my life here truly amaz- ing and enjoyable: Qingpeng Niu, Humayun Arafat, Naznin Fauzia, Kevin Stock,
Mahesh Ravishankar, Sanket Tavarageri, Martin Kong, Justin Holewinski, Tom Hen- retty, Samyam Rajbhandari, Akshay Nikam, Venmugil Elango, and many others.
v I’m also particularly grateful for the constant support of Yu-Keng Shih, Chun-Ming
Chen, En-Hsiang Tseng, Debbie Lee, Ko-Chih Wang, Kang-Che Lee, Tzu-Hsuan
Wei, my Saturday morning pickup game friends, and countless friends from the
Taiwanese Student Association.
Finally, I am deeply grateful for the love and support of my family: my inspiring parents, Feng-Wei and Mei-Chen; my lovely wife, Li-Lun; my adorable son, Aaron; my brother, Pai-Ching; and my cat, Ola. Thank you for always backing me up.
Words cannot express how much I love you all.
Pai-Wei Lai Columbus, Ohio August 25, 2014
vi VITA
January 9, 1984 ...... Born: Taipei, Taiwan
June 2001 ...... B.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan June 2005 ...... M.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2009 — present ...... Graduate Research Associate, The Ohio State University, Columbus, OH, USA Summer 2011 ...... Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA Summer 2012 ...... Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA
vii PUBLICATIONS
Qingpeng Niu, Pai-Wei Lai, S.M. Faisal, Srinivasan Parthasarathy, and P. Sadayap- pan: “A Fast Implementation of MLR-MCL Algorithm on Multi-core Processors”. To appear in International Conference on High Performance Computing (HiPC’14), Goa, India, December 17–20, 2014.
Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “Communication-Optimal Framework for Contracting Distributed Tensors”. To appear in Supercomputing (SC’14), New Orleans, LA, USA, November 16–21, 2014.
Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “CAST: Contraction Algorithms for Symmetric Tensors”. To appear in International Conference on Parallel Processing (ICPP’14), Minneapolis, MN, USA, September 9–12, 2014.
Pai-Wei Lai, Humayun Arafat, Venmugil Elango, and P. Sadayappan: “Accelerat- ing Strassen-Winograd’s Algorithm on GPUs”. In International Conference on High Performance Computing (HiPC’13), Bengaluru (Bangalore), India, December 18–21, 2013.
Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, Sriram Krishnamoorthy, and P. Sa- dayappan: “A Framework for Load Balancing of Tensor Contraction Expressions via Dynamic Task Partitioning”. In Supercomputing (SC’13), Denver, CO, USA, November 17–22, 2013.
Pai-Wei Lai, Huaijian Zhang, Samyam Rajbhandari, Edward Valeev, Karol Kowal- ski, and P. Sadayappan: “Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions”. In International Conference on Computational Science (ICCS’12), Omaha, NE, USA, June 4–6, 2012.
viii FIELDS OF STUDY
Major Field: Computer Science and Engineering
Studies in: High Performance Computing Prof. P. Sadayappan Software Engineering Prof. Atanas Rountev Artificial Intelligence Prof. Eric Fosler-Lussier
ix TABLE OF CONTENTS
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vii
List of Figures ...... xiv
List of Tables ...... xvii
List of Algorithms ...... xix
List of Listings ...... xx
Chapters:
1. Introduction ...... 1
2. Background ...... 6
2.1 Coupled Cluster Theory ...... 6 2.2 Tensor Contraction Expressions ...... 9 2.3 Tensor Contraction Engine ...... 11 2.4 Partitioned Global Address Space Programming Models ...... 13 2.5 Task Parallel Programming Models ...... 15 2.6 Domain Specific Languages ...... 17
x 3. Overview ...... 19
3.1 Operation Minimizer (OpMin) ...... 21 3.1.1 Language parser ...... 21 3.1.2 Operation optimizer ...... 22 3.1.3 Code generator ...... 23 3.2 Dynamic Load-balanced Tensor Contractions (DLTC) ...... 23 3.2.1 Dynamic task partitioning ...... 24 3.2.2 Dynamic task execution ...... 25
4. Operation Minimization on Symmetric Tensors ...... 27
4.1 Introduction ...... 27 4.2 Symmetry Properties of Tensors ...... 29 4.2.1 Antisymmetry ...... 29 4.2.2 Vertex symmetry ...... 30 4.3 Methods ...... 31 4.3.1 Derivation rules ...... 32 4.3.2 Cost models ...... 34 4.3.3 Operation minimization algorithms ...... 36 4.4 Results and Discussion ...... 42 4.4.1 Experimental setup ...... 42 4.4.2 Importance of Symmetry Properties ...... 43 4.4.3 Performance evaluation of OpMin algorithms ...... 47 4.4.4 Performance comparison of OpMin vs. NWChem ...... 48 4.4.5 Performance comparison of OpMin vs. PSI3 ...... 50 4.4.6 Performance comparison of OpMin vs. Genetic ...... 52 4.4.7 Choosing a small set of optimized forms ...... 52 4.5 Summary ...... 55
5. Dynamic Load-balanced Tensor Contractions ...... 57
5.1 Introduction ...... 57 5.2 Problem ...... 60 5.3 Methods ...... 63 5.3.1 Domain-specific primitives ...... 64 5.3.2 NWChem proxy application ...... 75 5.3.3 Exploiting inter-contraction parallelism ...... 76 5.3.4 Dynamic task partitioning ...... 78 5.3.5 Task distribution and execution ...... 80 5.3.6 Double buffering ...... 81
xi 5.4 Results and Discussion ...... 83 5.4.1 Experimental Setup ...... 83 5.4.2 Performance comparison of DLTC vs. CTF ...... 85 5.4.3 Performance comparison of DLTC vs. NWChem ...... 90 5.5 Summary ...... 90
6. Data Locality Enhancement of DLTC via Caching ...... 92
6.1 Introduction ...... 92 6.2 Preliminary ...... 94 6.2.1 Block-sparse representation ...... 94 6.2.2 Subroutines generated by TCE ...... 98 6.2.3 Target equations ...... 101 6.3 Methods ...... 103 6.3.1 Communication traffic analysis ...... 104 6.3.2 Communication optimization ...... 107 6.4 Results and Discussion ...... 108 6.4.1 Experimental setup ...... 108 6.4.2 Hit rate analysis ...... 109 6.4.3 Communication volume analysis ...... 111 6.4.4 Caching performance ...... 111 6.5 Summary ...... 113
7. Related Work ...... 115
7.1 Matrix Multiplication Algorithms ...... 115 7.2 Optimizing TCE ...... 117 7.3 Other Implementations for Tensor Contractions ...... 120
8. Future Work ...... 123
9. Conclusion ...... 125
Appendices:
A. Tensor Contraction Expressions ...... 126
A.1 Input equations to OpMin ...... 126 A.2 Equations implemented in the DLTC ...... 156
xii Bibliography ...... 158
xiii LIST OF FIGURES
Figure Page
2.1 The workflow of the Tensor Contraction Engine...... 12
2.2 The PGAS programming model...... 14
3.1 Overview of system...... 20
4.1 Operation count for V/O = 40 optimized at different ratios...... 54
4.2 Percentage difference from optimal operation count for different V/O ratios optimized at V/O = 5 and V/O = 40...... 54
5.1 DAG representation of CCD-T2...... 62
5.2 Iteration space of iterators...... 71
5.3 Layered DAG representation of CCD-T2 in DLTC...... 77
5.4 Influence of tile size selection. The profiling is performed onCCD. Problem size: 100. Number of cores: 512...... 78
5.5 Partitioning independent expressions into a number of tasks in a task pool...... 80
5.6 Double buffering...... 82
5.7 Performance results for CCD. Number of cores: 512...... 86
5.8 Performance results for CCSD. Number of cores: 512...... 87
xiv 5.9 Total execution time breakdown of DLTC for CCSD on 512 cores. Problem size: 100...... 87
5.10 Strong scaling comparison for CCD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100...... 88
5.11 Strong scaling comparison for CCSD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100...... 88
5.12 Performance comparison of DLTC vs. NWChem for CCSD on Uracil. 89
5.13 Performance comparison of DLTC vs. NWChem for CCSD on GFP. . 89
6.1 DAG representation of CCSD-T1...... 96
6.2 Block-sparse representation of a 2-D tensor...... 97
6.3 TCE implementation of CCSD-T1...... 99
6.4 Layered DAG representation of CCSD-T1 in DLTC...... 101
6.5 Viewing the F integral as four sub-tensors...... 102
6.6 The memory model and task execution model in the TCE...... 104
6.7 Total execution time breakdown of CCD-T2 on 256 cores. Problem size: 100...... 105
6.8 Communication time breakdown of CCD-T2 CCD-T2 on 256 cores for 2-D and 4-D tiles. Problem size: 100...... 106
6.9 Communication time breakdown of CCD-T2 on 256 cores for multi- plication and addition expressions. Problem size: 100...... 106
6.10 Communication time breakdown of 4-D tiles for different tensors. The profiling is performed for CCD-T2 on 256 cores. Problem size: 100...... 106
xv 6.11 The memory model and task execution model in DLTC with data replication and caching...... 108
6.12 Best hit rate achievable of DLTC for CCD-T2. Problem size: 100. ... 109
6.13 Best hit rate achievable of DLTC for CCSD-T2. Problem size: 100. .. 110
6.14 Communication volume of DLTC for CCD-T2 and CCSD-T2. Prob- lem size: 100...... 110
6.15 Communication time of DLTC for CCD-T2. Problem size: 100. .... 112
6.16 Communication time of DLTC for CCSD-T2. Problem size: 100. ... 112
6.17 Total execution time of DLTC for CCD-T2. Problem size: 100...... 113
6.18 Total execution time of DLTC for CCSD-T2. Problem size: 100. .... 114
xvi LIST OF TABLES
Table Page
4.1 Derivation rules for addition expressions...... 32
4.2 Derivation rules for multiplication expressions...... 33
4.3 Characteristics of input equations to OpMin for performance evalu- ation...... 44
4.4 Operation count comparison of dense vs. symmetric tensor contrac- tions...... 44
4.5 Performance evaluation of OpMin with different O and V values for UCCSD-T2 (V = a ⋅ O , a and b in the range of 1 to 3)...... 45
4.6 Performance evaluation of combining STO, FACT, and CSE algorithms in OpMin...... 46
4.7 O and V values of H O for different basis sets...... 48
4.8 Operation count comparison of OpMin vs. NWChem for CCSD and CCSDT equations. Various combinations of basis sets (cc-pVDZ/cc-
pvTZ) and symmetry point groups (c1/c2v) are tested for H O. ... 49
4.9 Operation count comparison of OpMin vs. NWChem for EOMCCSD
equations. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for H O. .. 50
4.10 Operation count comparison of OpMin vs. PSI3 for CCSD equations. Various combinations of basis sets (cc-pVDZ/cc-pvTZ) and symme-
try point groups (c1/c2v) are tested for H O...... 51
xvii 4.11 Operation count comparison of OpMin vs. Genetic for CCSD and CCSDT equations. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for
H O using c1 point symmetry group. Numbers of Genetic are ex- tracted from [24, 28]...... 52
5.1 Tensor contraction expressions of CCD-T2...... 61
5.2 SLOC comparison of DLTC vs. NWChem TCE...... 87
6.1 Tensor contraction expressions of CCSD-T1...... 95
6.2 Input and output tensors of CCD-T2 and CCSD-T2 equations. .... 103
xviii LIST OF ALGORITHMS
Algorithm Page
2.1 Task parallel execution model...... 15
4.1 Single term optimization ...... 38
4.2 Heuristic factorization ...... 40
4.3 Common sub-expression elimination ...... 42
6.1 TCE implementation ...... 100
xix LIST OF LISTINGS
1.1 Example of a tensor contraction expression from the CC theory of
quantum chemistry...... 2
2.1 A simple loop nest implementation of S[a, b, i, j] += A[a, k] ∗ B[b, l] ∗
C[k, l, i, j]...... 10
3.1 Example of an input to OpMin...... 22
5.1 Example of specifying a tensor contraction: I [h5, h1] += T[p6, p7, h1, h8]∗
V[h5, h8, p6, p7]...... 68
5.2 Example of 2-D matrix-matrix multiplication using three loops. ... 72
5.3 Creating a single iterator for 2-D matrix-matrix multiplication. .... 72
5.4 Creating two iterators for 2-D matrix-matrix multiplication by divid-
ing on dimension K...... 72
5.5 Task function for 2-D matrix-matrix multiplication...... 73
5.6 Task function for computing multiplication expressions...... 73
5.7 Task function for computing addition expressions...... 74
5.8 NXTVAL:A centralized load balancing scheme using a global shared
counter...... 76
5.9 Dynamic task partitioning...... 81
5.10 Task function for multiplication with double buffering...... 83
xx CHAPTER 1
Introduction
Computational methods which model the structures and interactions among molecules play an important role in many research fields, such as chemical, phys- ical, and biological sciences. Computationally intensive components in electronic structure calculations are often expressible as a set of tensor contraction expressions
(or simply tensor contractions). They consume a large fraction of computer resources at supercomputer centers nationwide. Many implementations for computing ten- sor contraction expressions have limitations on the problem sizes that are solvable, due to constrained memory and time.
Just as a vector is described by an array, a tensor is a generalized matrix often represented by a multi-dimensional array in a computer system. A tensor contrac- tion is the sum-of-products of tensor elements. Listing 1.1 shows an example of a tensor contraction expression from the Coupled Cluster (CC) [7] theory. The CC theory is one of the most prevalent methods used for solving chemical problems, providing accurate quantum-mechanical description of ground and excited states of chemical systems. The accuracy provided by the CC methods comes at a great computation cost. The computation cost for the example shown in Listing 1.1 is
1 Listing 1.1 Example of a tensor contraction expression from the CC theory of quan- tum chemistry.
1 r_vo[p1,h1]=f_vo[p1,h1]
2 + f_vv[p1,p2]*t_vo[p2,h1] 3 - f_oo[h2,h1]*t_vo[p1,h2]
4 + 2.0 * f_ov[h2,p2]*t_vvoo[p1,p2,h1,h2] 5 - f_ov[h2,p2]*t_vvoo[p2,p1,h1,h2]
6 - v_vovv[p1,h2,p2,p3]*t_vvoo[p3,p2,h1,h2] 7 + 2.0 * v_vovv[p1,h2,p2,p3]*t_vvoo[p2,p3,h1,h2]
8 - v_vovo[p1,h2,p2,h1]*t_vo[p2,h2] 9 + 2.0 * v_ovvo[h2,p1,p2,h1]*t_vo[p2,h2]
10 + v_oovo[h2,h3,p2,h1]*t_vvoo[p1,p2,h2,h3] 11 - 2.0 * v_oovo[h2,h3,p2,h1]*t_vvoo[p1,p2,h3,h2]
12 - f_ov[h2,p2]*t_vo[p2,h1]*t_vo[p1,h2] 13 + 2.0 * v_vovv[p1,h2,p2,p3]*t_vo[p2,h1]*t_vo[p3,h2]
14 - v_vovv[p1,h2,p2,p3]*t_vo[p3,h1]*t_vo[p2,h2] 15 + v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vvoo[p1,p3,h3,h2]
16 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vvoo[p1,p3,h2,h3] 17 + v_oovv[h2,h3,p2,p3]*t_vo[p1,h2]*t_vvoo[p3,p2,h1,h3]
18 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p1,h2]*t_vvoo[p2,p3,h1,h3] 19 + 4.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h2]*t_vvoo[p1,p3,h1,h3]
20 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h2]*t_vvoo[p3,p1,h1,h3]
21 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p3,h2]*t_vvoo[p1,p2,h1,h3] 22 + v_oovv[h2,h3,p2,p3]*t_vo[p3,h2]*t_vvoo[p2,p1,h1,h3]
23 - 2.0 * v_oovo[h2,h3,p2,h1]*t_vo[p2,h2]*t_vo[p1,h3] 24 + v_oovo[h2,h3,p2,h1]*t_vo[p1,h2]*t_vo[p2,h3]
25 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vo[p1,h2]*t_vo[p3,h3] 26 + v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vo[p3,h2]*t_vo[p1,h3];
O(N ) where N is the size of each dimension of tensor indices. When N is suffi- ciently large, it is fairly common to run the computation on thousands of computer cores for many hours. Therefore, improving the performance in efficiency and scal- ability for tensor contraction expressions is critical for scientific progress.
To manually implement tensor contraction expressions requires a vast amount of time and effort for domain experts. The development process is complicated, tedious, and difficult to debug, especially for methods of very sophisticated formu- lation. A number of automated approaches have been developed to accelerate the development process of tensor contractions. For instance, the Tensor Contraction
2 Engine (TCE) [6, 9, 10, 30, 32] was an effort by collaboration of computer scientists and quantum chemists to ease the process of developing efficient implementations for tensor contraction expressions.
The TCE automatically compiles tensor contraction expressions specified in high- level mathematical forms into efficient computer programs. The TCE attempts to form the most efficient sequence of contractions and minimize the memory usage via a series of optimization phases: algebraic transformation, memory minimiza- tion, loop fusion, space-time transformation, etc. As a result, the TCE successfully provided many first sequential and parallel computer implementation for a wide range of many-body methods. These programs are integrated and distributed in production chemistry software, such as NWChem [74], one of the best-known com- putational chemistry software suites. With the help of the TCE, chemists can fo- cus their efforts on developing their algorithms, instead of spending inordinate amounts of time on parallel programming and debugging.
However, as both computational methods and computer architectures advance and become increasingly complex, the performance and scalability of the programs generated by the TCE have become less satisfactory. Significant performance tun- ing is required to scale these programs to more than a few hundreds to thousands of cores. Several performance bottlenecks in these implementations have been discov- ered. For example, the parallelism of tensor contractions is only exploited within each contraction. The load balancing scheme is simple, and incurs high communi- cation overhead on large number of cores. Data locality is not taken into consider- ation in the TCE.
3 Symmetry properties that exist among one or multiple index groups of tensors provide significant opportunities for memory preservation and computation re- duction. To date, the data representation, computation algorithms, and communi- cation patterns for efficient parallel contraction of tensor expressions with symmet- ric properties are not very well understood. Most of the existing studies make use of tensor symmetries to a limited extent and perform redundant work and commu- nication.
In this dissertation, we will discuss several efforts that target performance and productivity enhancements for implementing tensor contraction expressions. Two main aspects are explored in this dissertation. In the first part, we present Opera- tion Minimizer (OpMin), a domain-specific compiler which exploits tensor symme- tries for performing effective algebraic transformation of tensor contraction expres- sions. In the second part, we introduce Dynamic Load-balanced Tensor Contrac- tions (DLTC), a framework built on top of the Partitioned Global Address Space
(PGAS) model and the task parallel execution model, for high-performance and scalable execution of tensor contraction expressions. Our system focuses on opti- mizations pertaining to operation minimization, load balancing, synchronization reduction, communication minimization, and data locality enhancement.
The major contributions of this dissertation are:
• Utilization of symmetry properties of tensors in the operation minimization
problem of tensor contraction expressions.
• Design and implementation of a new domain-specific library that supports
task-based parallel execution.
4 • Various performance and scalability enhancements that are demonstrated to
be faster than other contemporary approaches for real-world chemistry prob-
lems.
The rest of the dissertation is organized as follows:
Chapter 2 briefly introduces background information and the patterns of tensor contractions targeted in this work, along with the programming models the system is built upon.
Chapter 3 gives an overview description of the developed framework.
Chapter 4 explains how OpMin makes use of tensor symmetries in addressing
the operation minimization problem.
Chapter 5 describes DLTC abstractions for task-based computation, and strate- gies for improving load balancing and application performance.
Chapter 6 discusses further enhancement of DLTC, focusing on communication
and data locality.
Chapter 7 describes related work.
Chapter 8 points out possible future directions that can be explored.
Chapter 9 concludes the dissertation.
5 CHAPTER 2
Background
In this chapter, we begin by introducing the computational context of interest
— tensor contraction expressions from the Coupled Cluster (CC) theory. Next, we review a prior system, the Tensor Contraction Engine (TCE), and point out per- formance improvement opportunities that motivate our work. Our approach is mainly built on top of two programming models: the Partitioned Global Address
Space (PGAS) programming model and the task parallel programming model. We will introduce key features offered by these two programming models that pro- vide data management, communication operations, and dynamic load balancing functions used by our system.
2.1 Coupled Cluster Theory
We focus on tensor contraction expressions that occur in Coupled Cluster (CC)
[7] theory from the field of quantum chemistry. CC theory has evolved intoa
popular method for computing an approximate solution to the time-independent
Schrödinger equation of the form
H|Ψ⟩ = E|Ψ⟩
6 where H is the Hamiltonian, |Ψ⟩ is the wave function, and E is the energy of the ground state.
The Schrödinger equation [64] is a partial differential equation that describes how the quantum state of some physical system changes with time. The Hamilto- nian operator represents the energy of the nuclei and electrons in a molecule. The wave function is described in exponential form
|Ψ⟩ = e ̂ |Φ⟩ where T̂ is the cluster operator that generates a linear combination of excited de- terminants from the reference wave function, |Φ⟩.
The cluster operator T̂ in CC is written in the form
T̂ = T̂ + T̂ + T̂ + ⋯ + T̂
where T̂ is the operator of all single excitations, T̂ is the operator of all double excitations, and so on. Each T̂ is computed by a series of tensor contractions on tensors of dimension d ∈ {2, 4, ..., 2n}.
Variants of CC methods are derived depending on the number of excitations modeled. For example, one of the most popular methods — Coupled Cluster Sin- gles and Doubles (CCSD) — truncates the operator at
T̂ = T̂ + T̂ with ̂ T |Φ⟩ = t |Φ ⟩ and ̂ T |Φ⟩ = t |Φ ⟩
7 where O is the number of occupied orbitals and V is the number of virtual orbitals.
If we O = V = N, the computation cost for CCSD is O(N ).
The abbreviation for a particular Coupled Cluster method usually begins with
the letters “CC” and followed by:
• S — for single excitations (singles)
• D — for double excitations (doubles)
• T — for triple excitations (triples)
• Q — for quadruple excitations (quadruples), and so forth.
The complexity and computation cost of an implementation of an equation in- crease sharply with the highest level of excitation. For many applications, sufficient accuracy can be obtained with CCSD [71]. The more accurate but more expensive
CCSD(T) [60] improves upon CCSD by using an additional perturbation estimate of
the energy contributed by the T̂ operator, enclosed within parentheses. CCSD(T),
which answers many of the questions that arise in studies of chemical systems, is
recognized by many to be the gold standard for its good compromise between ac-
curacy and computation cost. Even more complex variants, CCSDT [52, 53, 65],
CCSDTQ [39], and CCSDTQP [50] are extremely computation and memory inten-
sive, and are only used for very high accuracy calculations on small molecules.
In Chapter 4, we evaluate the performance of OpMin using tensor contraction
expressions from CCSD, CCSDT, and Equation-of-Motion Coupled Cluster Singles
and Doubles (EOM-CCSD), an extension of CCSD for modeling excited states. In
Chapter 5 and Chapter 6, we assess DLTC using Coupled Cluster Doubles (CCD)
and CCSD equations.
8 2.2 Tensor Contraction Expressions
CC methods can be expressed in terms of tensor contraction expressions, which
are often represented as a collection of sum-of-products of multi-dimensional ar-
rays. A tensor is a generalized matrix with any number of indices. Here we explain
tensor indices and their ranges using a matrix (2-D tensor) example. Let matrix M
be a matrix denoted as
M[r, c]
where r is an index of range R, and c is an index of range C. An index name simply
refers to a dimension of the tensor, without any implication on the tensor’s actual
data layout in memory. Tensor reordering may be necessary when accessing a ten-
sor in a contraction.
Consider the following example of a tensor contraction expression