<<

A Framework for Performance Optimization of Contraction Expressions

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Pai-Wei Lai, M.S. Graduate Program in Computer Science and Engineering

The Ohio State University 2014

Dissertation Committee:

P. Sadayappan, Advisor

Gagan Agrawal

Atanas Rountev © Copyright by

Pai-Wei Lai

2014 ABSTRACT

Attaining high performance and productivity in the evaluation of scientific ap- plications is a challenging task for computer scientists, and is often critical in the advancement of many scientific disciplines. In this dissertation, we focus on the de- velopment of high performance, scalable parallel programs for a class of scientific computations in quantum chemistry — tensor contraction expressions.

Tensor contraction expressions are generalized forms of multi-dimensional - matrix operations, which form the fundamental computational constructs in elec- tronic structure modeling. in these computations exhibit various types of symmetry and sparsity. Contractions on such tensors are highly irregular with sig- nificant computation and communication cost, if data locality is not considered in the implementation. Prior efforts have focused on implementing tensor contrac- tions using block-sparse representation. Many parallel programs of tensor contrac- tions have been successfully implemented, however, their performances are unsat- isfactory on emerging computer systems.

In this work, we investigate into several performance bottlenecks of previous ap- proaches, and present responding techniques to optimize operations, parallelism, workload balance, and data locality. We exploit symmetric properties of tensors to minimize the operation count of tensor contraction expressions through algebraic

ii transformation. Rules are formulated to discover symmetric properties of inter- mediate tensors; cost models and algorithms are developed to reduce operation counts. Our approaches result in significant operation count reduction, compared to many other state of the art computational chemistry softwares, using examples from real-world tensor contraction expressions from the coupled cluster methods.

In order to achieve high performance and scalability, multiple programming models are often used in a single application. We design a domain-specific frame- work which utilizes the partitioned global address space programming model for data management and inter-node communication. We employ the task parallel execution model for dynamic load balancing. Tensor contraction expressions are decomposed into a collection of computational tasks operating on tensor tiles. We eliminate most of the synchronization steps by executing independent tensor con- tractions concurrently, and present mechanisms to improve their data locality. Our framework shows improved performance and scalability for tensor contraction ex- pressions from representative coupled cluster methods.

iii To Li-Lun, for the laughter everyday;

To Aaron, for the crying every night;

And to Ola, for the purring once in a while.

iv ACKNOWLEDGMENTS

I have relied on the help of many people in this effort. I would like to express my greatest appreciation to my advisor, Dr. P. Sadayappan, for his continuous guidance and support over the past five years. I feel very fortunate to have worked with him and have learned a great deal from him. I am grateful to my dissertation committee, Dr. Gagan Agrawal and Dr. Atanas Rountev, and the graduate faculty representative, Dr. Christopher Miller, for their valuable suggestions to improve my work.

I am sincerely grateful to have had the opportunity to work so closely with many of the best minds in computer science and chemistry. I thank my mentor, Dr. Sri- ram Krishnamoorthy, for two memorable summer internships at Pacific Northwest

National Laboratory, and Dr. Karol Kowalski, Dr. Edward Valeev, Dr. Marcel

Nooijen, and Dr. Dmitry Lyakh, for helping me understand the chemistry aspects of my research. Special thanks to Dr. Albert Hartono and Dr. Huaijian Zhang, for their assistance in the enhancement of OpMin; and Dr. Wenjing Ma, for providing important insights into the development of DLTC.

I am thankful to all my friends in Columbus for making my life here truly amaz- ing and enjoyable: Qingpeng Niu, Humayun Arafat, Naznin Fauzia, Kevin Stock,

Mahesh Ravishankar, Sanket Tavarageri, Martin Kong, Justin Holewinski, Tom Hen- retty, Samyam Rajbhandari, Akshay Nikam, Venmugil Elango, and many others.

v I’m also particularly grateful for the constant support of Yu-Keng Shih, Chun-Ming

Chen, En-Hsiang Tseng, Debbie Lee, Ko-Chih Wang, Kang-Che Lee, Tzu-Hsuan

Wei, my Saturday morning pickup game friends, and countless friends from the

Taiwanese Student Association.

Finally, I am deeply grateful for the love and support of my family: my inspiring parents, Feng-Wei and Mei-Chen; my lovely wife, Li-Lun; my adorable son, Aaron; my brother, Pai-Ching; and my cat, Ola. Thank you for always backing me up.

Words cannot express how much I love you all.

Pai-Wei Lai Columbus, Ohio August 25, 2014

vi VITA

January 9, 1984 ...... Born: Taipei, Taiwan

June 2001 ...... B.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan June 2005 ...... M.S. Computer Science, National Tsing Hua University, Hsinchu, Taiwan 2009 — present ...... Graduate Research Associate, The Ohio State University, Columbus, OH, USA Summer 2011 ...... Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA Summer 2012 ...... Ph.D. Intern, Pacific Northwest National Lab, Richland, WA, USA

vii PUBLICATIONS

Qingpeng Niu, Pai-Wei Lai, S.M. Faisal, Srinivasan Parthasarathy, and P. Sadayap- pan: “A Fast Implementation of MLR-MCL Algorithm on Multi-core Processors”. To appear in International Conference on High Performance Computing (HiPC’14), Goa, India, December 17–20, 2014.

Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “Communication-Optimal Framework for Contracting Distributed Tensors”. To appear in Supercomputing (SC’14), New Orleans, LA, USA, November 16–21, 2014.

Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoor- thy, and P. Sadayappan: “CAST: Contraction Algorithms for Symmetric Tensors”. To appear in International Conference on Parallel Processing (ICPP’14), Minneapolis, MN, USA, September 9–12, 2014.

Pai-Wei Lai, Humayun Arafat, Venmugil Elango, and P. Sadayappan: “Accelerat- ing Strassen-Winograd’s Algorithm on GPUs”. In International Conference on High Performance Computing (HiPC’13), Bengaluru (Bangalore), India, December 18–21, 2013.

Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, Sriram Krishnamoorthy, and P. Sa- dayappan: “A Framework for Load Balancing of Tensor Contraction Expressions via Dynamic Task Partitioning”. In Supercomputing (SC’13), Denver, CO, USA, November 17–22, 2013.

Pai-Wei Lai, Huaijian Zhang, Samyam Rajbhandari, Edward Valeev, Karol Kowal- ski, and P. Sadayappan: “Effective Utilization of Tensor Symmetry in Operation Optimization of Tensor Contraction Expressions”. In International Conference on Computational Science (ICCS’12), Omaha, NE, USA, June 4–6, 2012.

viii FIELDS OF STUDY

Major : Computer Science and Engineering

Studies in: High Performance Computing Prof. P. Sadayappan Software Engineering Prof. Atanas Rountev Artificial Intelligence Prof. Eric Fosler-Lussier

ix TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Figures ...... xiv

List of Tables ...... xvii

List of Algorithms ...... xix

List of Listings ...... xx

Chapters:

1. Introduction ...... 1

2. Background ...... 6

2.1 Coupled Cluster Theory ...... 6 2.2 Tensor Contraction Expressions ...... 9 2.3 Tensor Contraction Engine ...... 11 2.4 Partitioned Global Address Space Programming Models ...... 13 2.5 Task Parallel Programming Models ...... 15 2.6 Domain Specific Languages ...... 17

x 3. Overview ...... 19

3.1 Operation Minimizer (OpMin) ...... 21 3.1.1 Language parser ...... 21 3.1.2 Operation optimizer ...... 22 3.1.3 Code generator ...... 23 3.2 Dynamic Load-balanced Tensor Contractions (DLTC) ...... 23 3.2.1 Dynamic task partitioning ...... 24 3.2.2 Dynamic task execution ...... 25

4. Operation Minimization on Symmetric Tensors ...... 27

4.1 Introduction ...... 27 4.2 Symmetry Properties of Tensors ...... 29 4.2.1 Antisymmetry ...... 29 4.2.2 Vertex symmetry ...... 30 4.3 Methods ...... 31 4.3.1 Derivation rules ...... 32 4.3.2 Cost models ...... 34 4.3.3 Operation minimization algorithms ...... 36 4.4 Results and Discussion ...... 42 4.4.1 Experimental setup ...... 42 4.4.2 Importance of Symmetry Properties ...... 43 4.4.3 Performance evaluation of OpMin algorithms ...... 47 4.4.4 Performance comparison of OpMin vs. NWChem ...... 48 4.4.5 Performance comparison of OpMin vs. PSI3 ...... 50 4.4.6 Performance comparison of OpMin vs. Genetic ...... 52 4.4.7 Choosing a small set of optimized forms ...... 52 4.5 Summary ...... 55

5. Dynamic Load-balanced Tensor Contractions ...... 57

5.1 Introduction ...... 57 5.2 Problem ...... 60 5.3 Methods ...... 63 5.3.1 Domain-specific primitives ...... 64 5.3.2 NWChem proxy application ...... 75 5.3.3 Exploiting inter-contraction parallelism ...... 76 5.3.4 Dynamic task partitioning ...... 78 5.3.5 Task distribution and execution ...... 80 5.3.6 Double buffering ...... 81

xi 5.4 Results and Discussion ...... 83 5.4.1 Experimental Setup ...... 83 5.4.2 Performance comparison of DLTC vs. CTF ...... 85 5.4.3 Performance comparison of DLTC vs. NWChem ...... 90 5.5 Summary ...... 90

6. Data Locality Enhancement of DLTC via Caching ...... 92

6.1 Introduction ...... 92 6.2 Preliminary ...... 94 6.2.1 Block-sparse representation ...... 94 6.2.2 Subroutines generated by TCE ...... 98 6.2.3 Target equations ...... 101 6.3 Methods ...... 103 6.3.1 Communication traffic analysis ...... 104 6.3.2 Communication optimization ...... 107 6.4 Results and Discussion ...... 108 6.4.1 Experimental setup ...... 108 6.4.2 Hit rate analysis ...... 109 6.4.3 Communication volume analysis ...... 111 6.4.4 Caching performance ...... 111 6.5 Summary ...... 113

7. Related Work ...... 115

7.1 Matrix Multiplication Algorithms ...... 115 7.2 Optimizing TCE ...... 117 7.3 Other Implementations for Tensor Contractions ...... 120

8. Future Work ...... 123

9. Conclusion ...... 125

Appendices:

A. Tensor Contraction Expressions ...... 126

A.1 Input equations to OpMin ...... 126 A.2 Equations implemented in the DLTC ...... 156

xii Bibliography ...... 158

xiii LIST OF FIGURES

Figure Page

2.1 The workflow of the Tensor Contraction Engine...... 12

2.2 The PGAS programming model...... 14

3.1 Overview of system...... 20

4.1 Operation count for V/O = 40 optimized at different ratios...... 54

4.2 Percentage difference from optimal operation count for different V/O ratios optimized at V/O = 5 and V/O = 40...... 54

5.1 DAG representation of CCD-T2...... 62

5.2 Iteration space of iterators...... 71

5.3 Layered DAG representation of CCD-T2 in DLTC...... 77

5.4 Influence of tile size selection. The profiling is performed onCCD. Problem size: 100. Number of cores: 512...... 78

5.5 Partitioning independent expressions into a number of tasks in a task pool...... 80

5.6 Double buffering...... 82

5.7 Performance results for CCD. Number of cores: 512...... 86

5.8 Performance results for CCSD. Number of cores: 512...... 87

xiv 5.9 Total execution time breakdown of DLTC for CCSD on 512 cores. Problem size: 100...... 87

5.10 Strong scaling comparison for CCD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100...... 88

5.11 Strong scaling comparison for CCSD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100...... 88

5.12 Performance comparison of DLTC vs. NWChem for CCSD on Uracil. 89

5.13 Performance comparison of DLTC vs. NWChem for CCSD on GFP. . 89

6.1 DAG representation of CCSD-T1...... 96

6.2 Block-sparse representation of a 2-D tensor...... 97

6.3 TCE implementation of CCSD-T1...... 99

6.4 Layered DAG representation of CCSD-T1 in DLTC...... 101

6.5 Viewing the F integral as four sub-tensors...... 102

6.6 The memory model and task execution model in the TCE...... 104

6.7 Total execution time breakdown of CCD-T2 on 256 cores. Problem size: 100...... 105

6.8 Communication time breakdown of CCD-T2 CCD-T2 on 256 cores for 2-D and 4-D tiles. Problem size: 100...... 106

6.9 Communication time breakdown of CCD-T2 on 256 cores for multi- plication and addition expressions. Problem size: 100...... 106

6.10 Communication time breakdown of 4-D tiles for different tensors. The profiling is performed for CCD-T2 on 256 cores. Problem size: 100...... 106

xv 6.11 The memory model and task execution model in DLTC with data replication and caching...... 108

6.12 Best hit rate achievable of DLTC for CCD-T2. Problem size: 100. ... 109

6.13 Best hit rate achievable of DLTC for CCSD-T2. Problem size: 100. .. 110

6.14 Communication volume of DLTC for CCD-T2 and CCSD-T2. Prob- lem size: 100...... 110

6.15 Communication time of DLTC for CCD-T2. Problem size: 100. .... 112

6.16 Communication time of DLTC for CCSD-T2. Problem size: 100. ... 112

6.17 Total execution time of DLTC for CCD-T2. Problem size: 100...... 113

6.18 Total execution time of DLTC for CCSD-T2. Problem size: 100. .... 114

xvi LIST OF TABLES

Table Page

4.1 Derivation rules for addition expressions...... 32

4.2 Derivation rules for multiplication expressions...... 33

4.3 Characteristics of input equations to OpMin for performance evalu- ation...... 44

4.4 Operation count comparison of dense vs. contrac- tions...... 44

4.5 Performance evaluation of OpMin with different O and V values for UCCSD-T2 (V = a ⋅ O, a and b in the range of 1 to 3)...... 45

4.6 Performance evaluation of combining STO, FACT, and CSE algorithms in OpMin...... 46

4.7 O and V values of HO for different sets...... 48

4.8 Operation count comparison of OpMin vs. NWChem for CCSD and CCSDT equations. Various combinations of basis sets (cc-pVDZ/cc-

pvTZ) and symmetry point groups (c1/c2v) are tested for HO. ... 49

4.9 Operation count comparison of OpMin vs. NWChem for EOMCCSD

equations. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for HO. .. 50

4.10 Operation count comparison of OpMin vs. PSI3 for CCSD equations. Various combinations of basis sets (cc-pVDZ/cc-pvTZ) and symme-

try point groups (c1/c2v) are tested for HO...... 51

xvii 4.11 Operation count comparison of OpMin vs. Genetic for CCSD and CCSDT equations. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for

HO using c1 point symmetry group. Numbers of Genetic are ex- tracted from [24, 28]...... 52

5.1 Tensor contraction expressions of CCD-T2...... 61

5.2 SLOC comparison of DLTC vs. NWChem TCE...... 87

6.1 Tensor contraction expressions of CCSD-T1...... 95

6.2 Input and output tensors of CCD-T2 and CCSD-T2 equations. .... 103

xviii LIST OF ALGORITHMS

Algorithm Page

2.1 Task parallel execution model...... 15

4.1 Single term optimization ...... 38

4.2 Heuristic factorization ...... 40

4.3 Common sub-expression elimination ...... 42

6.1 TCE implementation ...... 100

xix LIST OF LISTINGS

1.1 Example of a tensor contraction expression from the CC theory of

quantum chemistry...... 2

2.1 A simple loop nest implementation of S[a, b, i, j] += A[a, k] ∗ B[b, l] ∗

C[k, l, i, j]...... 10

3.1 Example of an input to OpMin...... 22

5.1 Example of specifying a tensor contraction: I[h5, h1] += T[p6, p7, h1, h8]∗

V[h5, h8, p6, p7]...... 68

5.2 Example of 2-D matrix-matrix multiplication using three loops. ... 72

5.3 Creating a single iterator for 2-D matrix-matrix multiplication. .... 72

5.4 Creating two iterators for 2-D matrix-matrix multiplication by divid-

ing on K...... 72

5.5 Task function for 2-D matrix-matrix multiplication...... 73

5.6 Task function for computing multiplication expressions...... 73

5.7 Task function for computing addition expressions...... 74

5.8 NXTVAL:A centralized load balancing using a global shared

counter...... 76

5.9 Dynamic task partitioning...... 81

5.10 Task function for multiplication with double buffering...... 83

xx CHAPTER 1

Introduction

Computational methods which model the structures and interactions among molecules play an important role in many research fields, such as chemical, phys- ical, and biological sciences. Computationally intensive components in electronic structure calculations are often expressible as a set of tensor contraction expressions

(or simply tensor contractions). They consume a large fraction of computer resources at supercomputer centers nationwide. Many implementations for computing ten- sor contraction expressions have limitations on the problem sizes that are solvable, due to constrained memory and time.

Just as a vector is described by an array, a tensor is a generalized matrix often represented by a multi-dimensional array in a computer system. A tensor contrac- tion is the sum-of-products of tensor elements. Listing 1.1 shows an example of a tensor contraction expression from the Coupled Cluster (CC) [7] theory. The CC theory is one of the most prevalent methods used for solving chemical problems, providing accurate quantum-mechanical description of ground and excited states of chemical systems. The accuracy provided by the CC methods comes at a great computation cost. The computation cost for the example shown in Listing 1.1 is

1 Listing 1.1 Example of a tensor contraction expression from the CC theory of quan- tum chemistry.

1 r_vo[p1,h1]=f_vo[p1,h1]

2 + f_vv[p1,p2]*t_vo[p2,h1] 3 - f_oo[h2,h1]*t_vo[p1,h2]

4 + 2.0 * f_ov[h2,p2]*t_vvoo[p1,p2,h1,h2] 5 - f_ov[h2,p2]*t_vvoo[p2,p1,h1,h2]

6 - v_vovv[p1,h2,p2,p3]*t_vvoo[p3,p2,h1,h2] 7 + 2.0 * v_vovv[p1,h2,p2,p3]*t_vvoo[p2,p3,h1,h2]

8 - v_vovo[p1,h2,p2,h1]*t_vo[p2,h2] 9 + 2.0 * v_ovvo[h2,p1,p2,h1]*t_vo[p2,h2]

10 + v_oovo[h2,h3,p2,h1]*t_vvoo[p1,p2,h2,h3] 11 - 2.0 * v_oovo[h2,h3,p2,h1]*t_vvoo[p1,p2,h3,h2]

12 - f_ov[h2,p2]*t_vo[p2,h1]*t_vo[p1,h2] 13 + 2.0 * v_vovv[p1,h2,p2,p3]*t_vo[p2,h1]*t_vo[p3,h2]

14 - v_vovv[p1,h2,p2,p3]*t_vo[p3,h1]*t_vo[p2,h2] 15 + v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vvoo[p1,p3,h3,h2]

16 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vvoo[p1,p3,h2,h3] 17 + v_oovv[h2,h3,p2,p3]*t_vo[p1,h2]*t_vvoo[p3,p2,h1,h3]

18 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p1,h2]*t_vvoo[p2,p3,h1,h3] 19 + 4.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h2]*t_vvoo[p1,p3,h1,h3]

20 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h2]*t_vvoo[p3,p1,h1,h3]

21 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p3,h2]*t_vvoo[p1,p2,h1,h3] 22 + v_oovv[h2,h3,p2,p3]*t_vo[p3,h2]*t_vvoo[p2,p1,h1,h3]

23 - 2.0 * v_oovo[h2,h3,p2,h1]*t_vo[p2,h2]*t_vo[p1,h3] 24 + v_oovo[h2,h3,p2,h1]*t_vo[p1,h2]*t_vo[p2,h3]

25 - 2.0 * v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vo[p1,h2]*t_vo[p3,h3] 26 + v_oovv[h2,h3,p2,p3]*t_vo[p2,h1]*t_vo[p3,h2]*t_vo[p1,h3];

O(N) where N is the size of each dimension of tensor indices. When N is suffi- ciently large, it is fairly common to run the computation on thousands of computer cores for many hours. Therefore, improving the performance in efficiency and scal- ability for tensor contraction expressions is critical for scientific progress.

To manually implement tensor contraction expressions requires a vast amount of time and effort for domain experts. The development process is complicated, tedious, and difficult to debug, especially for methods of very sophisticated formu- lation. A number of automated approaches have been developed to accelerate the development process of tensor contractions. For instance, the Tensor Contraction

2 Engine (TCE) [6, 9, 10, 30, 32] was an effort by collaboration of computer scientists and quantum chemists to ease the process of developing efficient implementations for tensor contraction expressions.

The TCE automatically compiles tensor contraction expressions specified in high- level mathematical forms into efficient computer programs. The TCE attempts to form the most efficient sequence of contractions and minimize the memory usage via a series of optimization phases: algebraic transformation, memory minimiza- tion, loop fusion, space-time transformation, etc. As a result, the TCE successfully provided many first sequential and parallel computer implementation for a wide range of many-body methods. These programs are integrated and distributed in production chemistry software, such as NWChem [74], one of the best-known com- putational chemistry software suites. With the help of the TCE, chemists can fo- cus their efforts on developing their algorithms, instead of spending inordinate amounts of time on parallel programming and debugging.

However, as both computational methods and computer architectures advance and become increasingly complex, the performance and scalability of the programs generated by the TCE have become less satisfactory. Significant performance tun- ing is required to scale these programs to more than a few hundreds to thousands of cores. Several performance bottlenecks in these implementations have been discov- ered. For example, the parallelism of tensor contractions is only exploited within each contraction. The load balancing scheme is simple, and incurs high communi- cation overhead on large number of cores. Data locality is not taken into consider- ation in the TCE.

3 Symmetry properties that exist among one or multiple index groups of tensors provide significant opportunities for memory preservation and computation re- duction. To date, the data representation, computation algorithms, and communi- cation patterns for efficient parallel contraction of tensor expressions with symmet- ric properties are not very well understood. Most of the existing studies make use of tensor symmetries to a limited extent and perform redundant work and commu- nication.

In this dissertation, we will discuss several efforts that target performance and productivity enhancements for implementing tensor contraction expressions. Two main aspects are explored in this dissertation. In the first part, we present Opera- tion Minimizer (OpMin), a domain-specific compiler which exploits tensor symme- tries for performing effective algebraic transformation of tensor contraction expres- sions. In the second part, we introduce Dynamic Load-balanced Tensor Contrac- tions (DLTC), a framework built on top of the Partitioned Global Address Space

(PGAS) model and the task parallel execution model, for high-performance and scalable execution of tensor contraction expressions. Our system focuses on opti- mizations pertaining to operation minimization, load balancing, synchronization reduction, communication minimization, and data locality enhancement.

The major contributions of this dissertation are:

• Utilization of symmetry properties of tensors in the operation minimization

problem of tensor contraction expressions.

• Design and implementation of a new domain-specific library that supports

task-based parallel execution.

4 • Various performance and scalability enhancements that are demonstrated to

be faster than other contemporary approaches for real-world chemistry prob-

lems.

The rest of the dissertation is organized as follows:

Chapter 2 briefly introduces background information and the patterns of tensor contractions targeted in this work, along with the programming models the system is built upon.

Chapter 3 gives an overview description of the developed framework.

Chapter 4 explains how OpMin makes use of tensor symmetries in addressing

the operation minimization problem.

Chapter 5 describes DLTC abstractions for task-based computation, and strate- gies for improving load balancing and application performance.

Chapter 6 discusses further enhancement of DLTC, focusing on communication

and data locality.

Chapter 7 describes related work.

Chapter 8 points out possible future directions that can be explored.

Chapter 9 concludes the dissertation.

5 CHAPTER 2

Background

In this chapter, we begin by introducing the computational context of interest

— tensor contraction expressions from the Coupled Cluster (CC) theory. Next, we review a prior system, the Tensor Contraction Engine (TCE), and point out per- formance improvement opportunities that motivate our work. Our approach is mainly built on top of two programming models: the Partitioned Global Address

Space (PGAS) programming model and the task parallel programming model. We will introduce key features offered by these two programming models that pro- vide data management, communication operations, and dynamic load balancing functions used by our system.

2.1 Coupled Cluster Theory

We focus on tensor contraction expressions that occur in Coupled Cluster (CC)

[7] theory from the field of quantum chemistry. CC theory has evolved intoa

popular method for computing an approximate solution to the time-independent

Schrödinger equation of the form

H|Ψ⟩ = E|Ψ⟩

6 where H is the Hamiltonian, |Ψ⟩ is the wave function, and E is the energy of the ground state.

The Schrödinger equation [64] is a partial differential equation that describes how the quantum state of some physical system changes with time. The Hamilto- nian operator represents the energy of the nuclei and electrons in a molecule. The wave function is described in exponential form

|Ψ⟩ = ê |Φ⟩ where T̂ is the cluster operator that generates a linear combination of excited de- terminants from the reference wave function, |Φ⟩.

The cluster operator T̂ in CC is written in the form

T̂ = T̂ + T̂ + T̂ + ⋯ + T̂

where T̂ is the operator of all single excitations, T̂ is the operator of all double excitations, and so on. Each T̂ is computed by a series of tensor contractions on tensors of dimension d ∈ {2, 4, ..., 2n}.

Variants of CC methods are derived depending on the number of excitations modeled. For example, one of the most popular methods — Coupled Cluster Sin- gles and Doubles (CCSD) — truncates the operator at

T̂ = T̂ + T̂ with ̂ T|Φ⟩ = t |Φ ⟩ and ̂ T|Φ⟩ = t |Φ ⟩

7 where O is the number of occupied orbitals and V is the number of virtual orbitals.

If we O = V = N, the computation cost for CCSD is O(N).

The abbreviation for a particular Coupled Cluster method usually begins with

the letters “CC” and followed by:

• S — for single excitations (singles)

• D — for double excitations (doubles)

• T — for triple excitations (triples)

• Q — for quadruple excitations (quadruples), and so forth.

The complexity and computation cost of an implementation of an equation in- crease sharply with the highest level of excitation. For many applications, sufficient accuracy can be obtained with CCSD [71]. The more accurate but more expensive

CCSD(T) [60] improves upon CCSD by using an additional perturbation estimate of

the energy contributed by the T̂ operator, enclosed within parentheses. CCSD(T),

which answers many of the questions that arise in studies of chemical systems, is

recognized by many to be the gold standard for its good compromise between ac-

curacy and computation cost. Even more complex variants, CCSDT [52, 53, 65],

CCSDTQ [39], and CCSDTQP [50] are extremely computation and memory inten-

sive, and are only used for very high accuracy calculations on small molecules.

In Chapter 4, we evaluate the performance of OpMin using tensor contraction

expressions from CCSD, CCSDT, and Equation-of-Motion Coupled Cluster Singles

and Doubles (EOM-CCSD), an extension of CCSD for modeling excited states. In

Chapter 5 and Chapter 6, we assess DLTC using Coupled Cluster Doubles (CCD)

and CCSD equations.

8 2.2 Tensor Contraction Expressions

CC methods can be expressed in terms of tensor contraction expressions, which

are often represented as a collection of sum-of-products of multi-dimensional ar-

rays. A tensor is a generalized matrix with any number of indices. Here we explain

tensor indices and their ranges using a matrix (2-D tensor) example. Let matrix M

be a matrix denoted as

M[r, c]

where r is an index of range R, and c is an index of range C. An index name simply

refers to a dimension of the tensor, without any implication on the tensor’s actual

data layout in memory. Tensor reordering may be necessary when accessing a ten-

sor in a contraction.

Consider the following example of a tensor contraction expression

S += AB C ,

For notational brevity, we will use the Einstein summation convention [23] (or sim-

ply ) — when an index appears twice in an expression, it implies summation over all the values of that index. In this form, the summation symbol is omitted. The previous expression is expressed in compact notation as

S += AB C or

S[a, b, i, j] += A[a, k] ∗ B[b, l] ∗ C[k, l, i, j] where S is a 4-D tensor computed from two 2-D tensors, A and B, and one 4-D tensor,

C.

9 Listing 2.1 A simple loop nest implementation of S[a, b, i, j] += A[a, k] ∗ B[b, l] ∗ C[k, l, i, j]. 1 for (a = 0; i < N; ++a) 2 for (b = 0; j < N; ++b)

3 for (i = 0; k < N; ++i) 4 for (j = 0; i < N; ++j)

5 for (k = 0; j < N; ++k) 6 for (l = 0; k < N; ++l) {

7 S[a,b,i,j] += A[a,k] * B[b,l] * C[k,l,i,j]; 8 }

9 } 10 }

11 } 12 }

13 }

The indices a, b, i, and j are referred to as external indices, due to their preserva- tion in the output tensor S after computation. An external index comes from either one of the input tensors. The indices k and l appear in two of the input tensors (sim- ilar to the common summation index in 2-D matrix-matrix multiplication), and are referred to as contracted, or summation indices. A contracted index appear in ex- actly two input tensors and vanishes after the contraction, i.e., does not appear in the result tensor.

Listing 2.1 shows a straightforward implementation of this expression in a C-

style psuedo code. A loop nest of six loops, one for each index, is used in this

example. In this example, since all the indices are in the same range N, the memory

cost for storing the tensors is O(N), and the computation cost for calculating this expression is O(N).

A tensor can contain millions to billions of elements, depending on its dimen- sionality and the size of each dimension (the value of index range). Tensor indices can be in different ranges and appear in arbitrary order. The typical range value

10 of a tensor index is usually from tens to a few hundreds. In CC equations, tensor

indices span two types of range: occupied (O) and virtual (V), where O describes the

number of electrons in a modeled system, and V the number of modeled excited

states, which depends on the accuracy of the solution sought. Higher range val-

ues provide better accuracy. In this dissertation, a common problem size N is often

used for representing both O and V in our experiments and analyses.

2.3 Tensor Contraction Engine

The long development time for efficient programs for new computational mod-

els is often a limiting factor in the rate of scientific research progress. Therefore,

automated code synthesis approaches have been developed for many specific do-

mains, such as the SPIRAL project [59] for digital signal processing, and the Tensor

Contraction Engine (TCE) [6, 9–11, 32] for many body methods in quantum chem-

istry.. The TCE is a domain-specific compiler that generates optimized implemen-

tations of tensor contraction expressions.

Figure 2.1 illustrates the work flow of the TCE. The input language for theTCE

is a high-level mathematical language which allows tensor contraction expressions

to be specified in a form natural to chemistry experts. An input tensor contraction

expression is parsed and transformed into a simple expression tree form for fur-

ther optimizations. In Figure 2.1, two major paths represent the structure of the

“prototype” TCE [32] (shorter path) and the “optimizing” TCE, which performs additional optimizations (longer path). The “prototype” TCE was developed by So

11 Original Expression

Language Parsing

Simple Expression Tree

Operation Minimization

Loop Fusion Simpler Optimizations

...

Optimized Expression Tree

Code Generation

Generated Code

Figure 2.1: The workflow of the Tensor Contraction Engine.

Hirata [32] at Pacific Northwest National Laboratory in 2003. It generated paral- lel implementations of a broad range of quantum chemistry methods. These pro- grams have been integrated and distributed as part of NWChem [74] and UTChem

[77] computational chemistry software packages. Details of the algorithms imple- mented in the TCE are described in Section 6.2.

The “optimizing” TCE initiated several research studies aimed at improving the performance and functionality of the “prototype” TCE in the following broad

12 categories of problems encountered: (1) operation minimization, (2) memory mini- mization, (3) space-time transformation, (4) data locality optimization, (5) data dis- tribution and partitioning, and (6) computation optimization. Details on these ef- forts in the development of the “optimizing” TCE are presented in Section 7.2.

In this dissertation, we revisit the operation minimization problem and focus on new algorithms for optimizing operation count for tensor contraction expressions with symmetry properties. We also identify performance bottlenecks in the TCE and provide solutions by developing a different approach for implementation of tensor contractions than the TCE. Rather than designing a compiler, we present a domain-specific library for abstracting tensor contraction expressions into tasks.

2.4 Partitioned Global Address Space Programming Models

A Partitioned Global Address Space (PGAS) programming model provides pro- grammers with a global view of shared data and allows asynchronous accesses to data. The PGAS programming model is especially attractive for developing irreg- ular and dynamic applications on distributed memory systems.

As shown in Figure 2.2, PGAS is a distributed-shared memory model, which combines merits of both shared memory and distributed memory. A process can access a local address space and a global address space, which is partitioned and distributed across the memories of multiple nodes. Data stored in the global ad- dress space can be accessed through efficient one-sided operations that copy data between the global and local address spaces. Communication takes place when data moves between local and remote memories.

13 Distributed Memory Shared Memory

Address Space

Process

PGAS

Figure 2.2: The PGAS programming model.

The PGAS programming model relaxes conventional two-sided communication semantics, and allows programmers to access remote data without cooperation of the remote processor. The one-sided communication operations are provided by

Remote Direct Memory Access (RDMA) over the network infrastructure. Language extensions and libraries such as Unified Parallel C (UPC) [73], Co-Array Fortran

(CAF) [54], Titanium [20], and Global Arrays (GA) [51] are built on the concepts of the PGAS programming model.

GA is a PGAS model in library form used in many scientific applications, such as NWChem [74] and MOLPRO [76]. GA is fully interoperable with MPI and built on top of Aggregate Remote Memory Copy Interface (ARMCI), a RDMA commu- nication library that offers a rich set of one-sided communication operations. In

14 Algorithm 2.1: Task parallel execution model. Let T be a task pool; Add tasks t, t, t, ... into T; while t ← GetNextTask(T) do Execute t; end

GA programs, one-sided communication operations such as GET, PUT, and ACC (ac-

cumulate) allow application developers to read and write data across nodes, using

the abstraction of a global shared address space. All needed inter-processor com-

munication is automatically managed by the GA runtime.

In this work, we employ GA to manage the tensor data. We convert tensor con-

tractions into a number of get-compute-accumulate tasks — a process first fetches a

portion of tensor data into its local buffer, performs computation locally, and then

writes the result to a remote memory location after computation.

2.5 Task Parallel Programming Models

To create a parallel program, the first step is to identify concurrency and de-

compose the computation into units of work, or tasks. In general, parallelism can

be identified with respect to either data being processed (data parallelism), or the operations being performed (task parallelism).

Algorithm 2.1 shows a high-level picture of the task parallel execution model. In

the task parallel execution model, the application developer first creates tasks and

adds them into a task pool. The tasks are then executed in parallel. For irregular

and dynamic parallel applications, many task parallel programming models [22,

15 43] have been explored and shown to achieve high performance and scalability on modern large-scale clusters.

The task parallel execution model is also widely used in performing dynamic load balancing. Static partitioning of computations can possibly lead to imbalanced workload. A number of studies such as Shared Collections of Task Objects (Scioto)

[21, 22] and Task Scheduling Library (Tascel) [43] were conducted and showed good performance for irregular and dynamic parallel computations.

In these models, programmers express their computation as a dynamic collec- tion of tasks. A task is the basic unit of work identified by its task descriptor. A task descriptor provides references to the location of input and output data in the global address space. New tasks can be added into and executed from the task pool. Task execution is scheduled by a runtime system that performs dynamic load balancing and provides opportunities for efficient recovery from faults. For each process, the runtime system maintains a local task queue which is split into a reserved segment and a shared segment. The reserved segment allows lock-free local access, and the shared segment allows one-sided access for remote task stealing.

In our system, tensor contraction expressions are broken down into tasks which operate on tensor tiles stored in global address space. The computational tasks can be executed at any process and the runtime system schedules and balances these tasks automatically. We implement our framework and report experimental results using Scioto and Tascel, which are GA-based and MPI-based libraries, respectively.

16 2.6 Domain Specific Languages

A domain-specific language (DSL) is a computer language designed specifi- cally for problems and applications in a particular domain. In contrast to general purpose languages, DSLs offer the potential for higher productivity of application developers and enable better portability across different computer hardware plat- forms. Defining a DSL can be worthwhile for a problem to be expressed clearlyin a higher level abstraction.

A DSL usually takes one of two forms: external or internal (embedded). An ex- ternal DSL is independent from any other language, which allows greater flexible in the design its, but requires a customized parser and grammar. An embedded

DSL lives inside another programming language and inherits characteristics from its host language. The domain-specific components provided by an embedded DSL are fully interoperable with the host language and allow programmers greater flex- ibility in mixing domain-specific code with other computations not expressible us- ing the domain-specific abstractions.

Both external and embedded DSLs have received widespread attention for en- abling solutions to be expressed at a high-level abstraction of the problem domains they cover. The goal is to let domain experts themselves understand, modify, and develop both DSLs and their programs easily, in order to enhance the productiv- ity and portability of the implementations while preserving their performance. All these facts encourage us to adopt the DSL approach for developing tensor contrac- tion implementations.

To define the syntax of a DSL for tensor contractions, we first identify several important characteristics of the problem. Like other programming languages, a

17 DSL for tensor contractions must support domain-specific data structures for stor- ing tensors and provide efficient accesses to data. These data structures should also support symmetry properties of tensors, for instance, permutation and spa- tial symmetries. A DSL for tensor contractions should support operations such as multiplication (contraction) and addition.

Our system consists of two major parts. The first part, OpMin, provides a DSL for specifying tensor contraction expressions for operation minimization. The sec- ond part, DLTC, is an embedded DSL in library form for task-based parallel exe- cution of tensor contractions. The DLTC abstractions provide a clean interface for developing tensor contractions, hiding the details of data manipulation and paral- lelism inside. One can still take full control of the code if needed, because the DLTC is fully compatible with its host language and many other programming models it integrated with.

18 CHAPTER 3

Overview

In this chapter, we provide an overview of our system. Our goal is to attain high performance and productivity for computing tensor contraction expressions.

Our system can be divided into two main parts: Operation Minimizer (OpMin) and

Dynamic Load-balanced Tensor Contractions (DLTC). In this chapter, we provide an overview of both components and present details of OpMin and DLTC in next few chapters.

As shown in Figure 3.1, the work flow of the system is divided into three stages:

operation minimization, dynamic task partitioning, and dynamic task execution.

In the first stage, we revisit the operation minimization problem and present new

algorithms for algebraic transformation of tensor contraction expressions with sym-

metry properties. The operation minimization tool is implemented using Python

and functions as a domain-specific compiler. After the operation counts of tensor

contraction expressions are reduced, the optimized expressions are subsequently

computed using a domain-specific library we designed in the second stage. We

abstract tensor contraction expressions into fine-grained tasks for parallel execu-

tion, and discuss methodologies for dynamic task partitioning and load balancing

19 Original Expression

● Single Term Optimization OpMin Operation Minimization ● Factorization ● Common Sub-expression Elimination

Optimized Expression

● Synchronization Reduction Dynamic Task Partitioning ● Inter-expression parallelism

DLTC Fine-grained Tasks

● Dynamic Load-balancing Task Parallel Execution ● Data Replication ● Data Caching

Results

Figure 3.1: Overview of system.

to overcome the performance bottlenecks discovered in the implementation gener- ated by the TCE. Lastly, in the third stage, the tasks are executed with several per- formance optimizations aimed at improving data locality, an important but missing component in the previous TCE approach.

20 3.1 Operation Minimizer (OpMin)

A key process in the development of tensor contraction expressions is operation minimization, which attempts to optimize the operation count of a tensor compu- tation. We present the Operation Minimizer (OpMin) for the effective algebraic transformation of tensor contraction expressions with symmetric tensors. OpMin comprises three parts: a frontend language parser, an operation optimizer, and a backend code generator.

3.1.1 Language parser

The frontend of OpMin is a high-level language parser, which serves as the clos- est layer to scientists, allowing tensor contractions to be specified in a form similar to the derived equations, instead of directly coding them in a general-purpose pro- gramming language, such as FORTRAN. The input to OpMin is a sequence of ten- sor contraction expressions. The indices, ranges, tensors, and symmetry properties are defined using a small set of high-level domain-specific language constructs. Op-

Min supports arbitrary input equations in the sum-of-product form. The informa- tion extracted from the input is used for algebraic transformation in the operation optimizer.

As shown in Listing 3.1, the range names and values are described using key- word range. Names of indices are defined by keyword index. Tensors are declared using keyword array, with their names, , and symmetry properties spec- ified. Two symmetry properties are supported in OpMin: the antisymmetry is de- scribed using round brackets, and vertex symmetry is described using angle brackets

21 Listing 3.1 Example of an input to OpMin. 1 range O = 10;

2 range V = 100; 3 index h1, h2, h3 = O;

4 index p1, p2, p3 = V;

5 array f_oo([O][O]), f_ov([O][V]); 6 array f_vo([V][O]), f_vv([V][V]);

7 array v_oovo([O,O][V,O]:(0,1)(2)(3)); 8 array v_oovv([O,O][V,V]:<0,1><2,3>);

9 array v_ovvo([O,V][V,O]), v_vovo([V,O][V,O]); 10 array v_vovv([V,O][V,V]);

11 array t_vvoo([V,V][O,O]:<0,1><2,3>); 12 array t_vo([V][O]), r_vo([V][O]);

13 r_vo[p1, h1] = 1.0 * f_vo[p1,h1] 14 + 1.0 * f_vv[p1,p2] * t_vo[p2,h1]

15 - 1.0 * f_oo[h2,h1] * t_vo[p1,h2] 16 + 2.0 * f_ov[h2,p2] * t_vvoo[p1,p2,h1,h2]

17 - 1.0 * f_ov[h2,p2] * t_vvoo[p2,p1,h1,h2] 18 - 1.0 * v_vovv[p1,h2,p2,p3] * t_vvoo[p3,p2,h1,h2]

19 + 2.0 * v_vovv[p1,h2,p2,p3] * t_vvoo[p2,p3,h1,h2] 20 - 1.0 * v_vovo[p1,h2,p2,h1] * t_vo[p2,h2]

21 + 2.0 * v_ovvo[h2,p1,p2,h1] * t_vo[p2,h2] 22 + 1.0 * v_oovo[h2,h3,p2,h1] * t_vvoo[p1,p2,h2,h3]

23 - 2.0 * v_oovo[h2,h3,p2,h1] * t_vvoo[p1,p2,h3,h2] 24 - 1.0 * f_ov[h2,p2] * t_vo[p2,h1] * t_vo[p1,h2]

25 + 2.0 * v_vovv[p1,h2,p2,p3] * t_vo[p2,h1] * t_vo[p3,h2];

on subsets of indices. Finally, tensor contraction expressions are presented using the Einstein notation.

The language parser analyzes the input expressions according to a small set of rules and a simple grammar. An expression is converted into a parse tree which contains contracting relations of the tensors and operation cost. A simple error checking is performed to ensure that tensor ranges and indices match in every ex- pression.

3.1.2 Operation optimizer

Operation minimization is one of the most important optimizations for com- puting tensor contraction expressions. It transforms the input expression into an

22 equivalent form with reduced arithmetic operation count. In this work, we ad- dress the algebraic transformation of tensor contraction expressions with symmet- ric tensors through the utilization of commutativity, associativity, and distribu- tivity properties of these expressions. The details of the operation minimization algorithms and experimental results will be presented in Chapter 4.

3.1.3 Code generator

The output from the operation optimizer is an optimized sequence of binary contractions and additions in the form of expression trees. The backend of Op-

Min is a code generator which can transform the optimized expressions from the tree form into code. One can modify the code generator to convert the optimized expressions into programs in any general purpose programming language.

3.2 Dynamic Load-balanced Tensor Contractions (DLTC)

After the original tensor contraction expressions are optimized through OpMin, the optimized expressions must be transformed into executable code for the com- putation. Developing scalable application programs for tensor contraction expres- sions requires an understanding and exploitation of the parallelism and data lay- out in the computation. We present Dynamic Load-balanced Tensor Contractions

(DLTC) for fast computation of tensor contraction expressions. As shown in Fig- ure 3.1, DLTC consists of two stages: dynamic task partitioning and dynamic task execution.

23 3.2.1 Dynamic task partitioning

Instead of the traditional compiler approach taken by the TCE, the DLTC pro- vides a solution in a library form where tensor contraction expressions are parti- tioned into smaller tasks. A key aspect of this approach is to organize tensors into globally addressable tiles. The computation is defined in terms of sets of computa- tional tasks operating on these tensor tiles. We present a novel scheme to dynam- ically partition the entire workload into tasks. The details of the DLTC algorithms are described in Chapter 5.

Abstraction of tensor contractions

To increase productivity in software development, we define task functions in a general library form for reuse without loss of efficiency. We build a library-based that serves as a middle-layer interface to encapsulate low-level data ma- nipulation and computation, and enables services of a high-level task scheduler and load-balancer. Raising the abstraction level in the development of general ten- sor contraction libraries significantly decreases the effort put into the TCE compiler.

The program size of the implementation is also reduced. Moreover, it opens a num- ber of opportunities for further performance optimizations. The domain-specific abstractions provided by the DLTC are described in Section 5.3.

Synchronization reduction

The DLTC places components of tensor contraction expressions into multiple levels. Within each level, computations are independent and their tasks can be ex- ecuted concurrently. In this fashion, the DLTC increases the amount of parallelism from one expression to multiple expressions. As a result, a large number of barrier

24 synchronization steps enforced in prior approaches are eliminated. The discussion of inter-contraction parallelism and synchronization reduction is elaborated in Sec- tion 5.3.

High performance computation kernels

The DLTC employs the PGAS model for data management and communication operations and highly-tuned Basic Linear Algebra Subprograms (BLAS) libraries for matrix-matrix multiplications. The DLTC enable easy hookup to any other com- putation kernels provided by different hardware platforms. New containers, iter- ators, and algorithms tailored to specific chemical problems can also be developed and added into the DLTC library. The DLTC is operable with several programming models, such as MPI, GA, and Tascel. Utilizing multi-threaded OpenMP index re- ordering kernels (which is not available in the NWChem TCE module), the DLTC can also maximally utilize available of on-node memory to achieve higher perfor- mance.

3.2.2 Dynamic task execution

After the tasks for tensor contractions are created, they are grouped into a few levels and executed.

Dynamic load balancing

Dynamic load balancing is a challenging problem and has been widely stud- ied in the literature. The computational tasks for tensor contractions must be dis- tributed in a load balanced fashion for high performance. Since static task distri- bution is insufficient to achieve effective load balance, automatic re-balancing of

25 workloads at runtime is performed. Chapter 5 discusses dynamic task scheduling and load balancing for tensor contraction expressions using the Scioto and Tascel libraries. Experimental performance data is also reported and discussed.

Data locality enhancement

The manner in which tensors and tasks are partitioned and distributed among the processors of a parallel system has great impact on performance. Automatic management of task and data distribution is closely coupled with memory, com- munication, and data locality issues. In the DLTC, we design an execution model where tasks and data are distributed arbitrarily in the computer system. We pro- file the execution of DLTC and identify communication traffic among tensors using two examples from the CC methods. A few tensors are selected and replicated to avoid communication on very small tensor tiles. A caching scheme is presented to reuse the tensor tiles that have been previously accessed. This topic is discussed in

Chapter 6.

26 CHAPTER 4

Operation Minimization on Symmetric Tensors

4.1 Introduction

Accurate modeling and prediction of electronic structures in quantum chem- istry involves computational methods whose equations consist of hundreds to thou- sands of tensor contraction expressions. Manual implementation of such methods is tedious and error-prone — programs become obsolete quickly due to rapid ad- vances in hardware and methodologies. Therefore, automated derivation, opti- mization, and implementation of tensor contraction expressions is critical.

The Tensor Contraction Engine (TCE) [6, 9, 32] was a collaborative project be- tween computer scientists and quantum chemists to automate the development of high-performance parallel programs for many-body methods, such as the Coupled

Cluster (CC) methods. The CC theory is a numerical technique widely used for solving chemical problems due to its simplicity and high accuracy. Efficient par- allel programs are automatically synthesized by specifying the methods in a high- level domain-specific language.

A key process in fast implementation of tensor contractions is to reduce the total number of arithmetic operations through the process of algebraic transformation —

27 finding an equivalent form of an input expression that minimizes the numberof

operation counts.

The Operation Minimizer (OpMin) is a software tool for solving the algebraic

transformation problem for tensor contractions with symmetric properties. The

solution presented in this chapter is an extension of the techniques developed by

Sibiryakov [66], Hartono el al. [29, 31], and Zhang [78]. OpMin takes a tensor

contraction expression described in a domain-specific language, searches for an

equivalent form of such expression with reduced operation counts, and returns

the optimized form of the expression.

In this chapter, we focus on how OpMin exploits the symmetry properties of

tensors to reduce computation and storage cost. We begin by formulating defini-

tions of two kinds of symmetric tensors. Next, we describe derivation rules and

cost models for computing these tensors. There are three operation minimization

algorithms employed in OpMin:

1. Single term optimization — to find the best sequence of binary tensor con-

tractions to achieve a multi-tensor contraction.

2. Factorization — to apply the distributive law across multiple terms.

3. Common sub-expression elimination — to reduce memory requirement and

arithmetic operations by reusing intermediate tensors.

At the end, we demonstrate the effectiveness of OpMin on real-world tensor contractions using equations from the CC and the Equation-of-Motion (EOM) CC methods. The rest of the chapter is organized as follows:

Section 4.2 defines two symmetry properties of tensors.

28 Section 4.3 describes rules to derive the symmetries of intermediate tensors, cost models, and three optimization techniques employed in OpMin.

Section 4.4 reports on performance evaluation of OpMin.

Section 4.5 summarizes the chapter.

4.2 Symmetry Properties of Tensors

A tensor contraction expression is formed by a summation of several terms,

and each term is formed by a contraction of two or more tensors. Tensors in the

equations from CC methods often expose two kinds of symmetry properties: anti-

symmetry and vertex symmetry. The computation and memory cost can be greatly

reduced if these properties are exploited.

4.2.1 Antisymmetry

A tensor is antisymmetric on an index subset (or group) if interchanging any

two indices of the subset results in the same value with an alternated sign (+/−).

We denote the index subset that is antisymmetric using the overline symbol. For

example, a 2-D tensor (matrix) A that is antisymmetric on indices i and j

A = −A

is denoted as

A

where i and j must be in the same range. For another example, a 4-D tensor

A

indicates that

A = −A,

29 and a 6-D tensor

A

means that

A = −A = −A = A = A = −A.

The example below shows how the total number of arithmetic operations, or

simply operation count, is reduced by exploiting antisymmetry. Consider a tensor contraction with two input tensors A and B, and an output tensor C. Let all indices be in the same range M.

C += AB

The operation count for computing this expression is 2M: one multiplication and one addition for all possible values of indices. However, if there is an antisym-

() index group (i, j) in tensor A, only distinct elements out of all possible pairs of (i, j) need to be computed. Therefore, the operation count can be reduced from 2M to 2M(M + 1) = M + M ≈ M 2 if M is sufficiently large. Roughly half of the computation can be eliminated by changing the sign from the other half.

4.2.2 Vertex symmetry

Vertex symmetry applies over two correlated index subsets within a tensor. In-

terchanging any two indices in one subset and the corresponding indices in another

subset results in the same value without changing the sign. We use the widehat sym-

bol to denote the vertex symmetry property in a tensor. For example, given a 4-D

30 tensor

V where indices i and j are in the same range; and indices k and l are in the same range. Tensor V is said to be vertex symmetric if

V = V.

For another example, a 6-D tensor

V means that

V = V = V = V = V = V.

Like antisymmetry, vertex symmetry of a tensor can also provide cost reduction.

The reduction factors in operation count for these two symmetries are presented in the next section.

4.3 Methods

In this section, we address three questions:

1. Given a tensor expression with symmetry properties already known, how can

the symmetry properties of intermediate tensors be discovered?

2. What are the reduction factors in the general form if these symmetry proper-

ties are exploited?

3. How can canonical forms be constructed for tensors with symmetries in the

operation minimization algorithms?

31 Rule Expression Reduction Factor

A1 I = A + B

A2 I = A + B

A3 I = A + B

A4 I = A − A

A5 I = A + A

Table 4.1: Derivation rules for addition expressions.

4.3.1 Derivation rules

To answer the first question, we formulate derivation rules to deduce the sym- metry of intermediate tensors. To be more precise, given the symmetry properties of input tensors (right hand side) of a binary tensor expression, we want to discover the symmetry properties of its output tensor (left hand side).

Table 4.1 lists the derivation rules and their reduction factors for expressions which sum two tensors. The derivation rules for addition expressions are described below:

• Rule A1: if an antisymmetric index subset exists in both input tensors, the

output tensor remains antisymmetric for that subset.

• Rule A2: if both input tensors possess the same vertex symmetric index sub-

set, the output tensor preserves vertex symmetry in the same subset.

32 Rule Expression Reduction Factor

M1 I = A × B

M2 I = A × B

M3 I = A × B

I = A × B M4

M5 I = A × A

Table 4.2: Derivation rules for multiplication expressions.

• Rule A3: if an input tensor is vertex symmetric while the other tensor is anti-

symmetric in both index subsets, the output tensor remains vertex symmetric

but loses the antisymmetry property.

• Rule A4: subtracting two instances of the same tensor, with different orders of

indices that differ exactly in two indices, forms a new antisymmetry property

in the output tensor.

• Rule A5: summing two instances of the same tensor, with both top and bottom

pairs of indices interchanged, forms a new vertex symmetry property in the

output tensor.

Table 4.2 shows the derivation rules and their reduction factors for expressions which multiply two tensors. The descriptions of derivation rules for multiplication expressions are as follows:

33 • Rule M1: multiplying two input tensors with an antisymmetric group on ex-

ternal indices preserves the antisymmetry property in the output tensor.

• Rule M2: both input tensors must possess vertex symmetry in order to pre-

serve of this property in the output tensor.

• Rule M3: if the contraction indices are antisymmetric in one input tensor and

vertex symmetric in the other tensor, a new antisymmetry is formed on the

corresponding indices in the output tensor. In Table 4.2, the index group (k, l)

of the output tensor becomes antisymmetric because the contraction indices

(m, n) are antisymmetric in the first tensor and vertex symmetric in the second

tensor.

• Rule M4: if antisymmetry appears in the contraction indices, no symmetry is

preserved in the output tensor; however, the reduction factor of still applies.

• Rule M5: the product of two instances of the same tensors with two indices

forms a vertex symmetric index group in the output tensor.

If the size (the number of indices) of an antisymmetric index subset or a vertex

symmetric index subset is M, the reduction factor is ! . For instance, the reduction

factors of Rule A1 to A5 are all , because M is 2. The only exception is Rule M2

where Pulay [58] proposed a technique to reduce the factor to . The readers are referred to [78] for detailed mathematical proofs of these derivation rules.

4.3.2 Cost models

Next, we present the cost models for contracting symmetric tensors. Consider a

general tensor contraction expression involving three tensors with antisymmetry:

34 I, A, and B. Tensor A has a total of m + k index subsets, and tensor B has a total of k + n index subsets. As a result, the output tensor I has a total of m + n external indices groups:

a, a, ..., a and

b, b, ..., b.

This expression is written as

I,,..., = A,,..., B,,..., . ,,..., ,,..., ,,...,

Let the size of index subsets

|a| = x

and

|b| = y.

There are k independent antisymmetric index subsets, from s to s, where

|s| = z.

The operation count reduction for this tensor contraction expression is computed

as 1 1 1 reduction = x! y! z! Similarly, consider the following tensor contraction expression with a total num-

ber of n independent vertex symmetric groups

,,..., ,,..., ,,..., I = A B . ,,..., ,,..., ,,...,

35 Assume that the size of each vertex symmetry group are the same, that is,

|a| = |b| = |s| = x.

The reduction factor for this expression is computed as

reduction = f(x)

where x = 2 f(x ) = x > 2 ! The derivation rules and cost models play important roles in creating canonical form of tensors and expressions, particularly in the factorization and common sub- expression elimination algorithms.

4.3.3 Operation minimization algorithms

OpMin applies three operation minimization algorithms to reduce operation

count: (1) single term optimization, (2) factorization, and (3) common sub-expression

elimination, by making effective use of antisymmetry and vertex symmetry proper- ties of tensors.

Single term optimization

A tensor contraction expression is often formed by a summation of several terms.

Each term contracts two or more tensors. The single term optimization problem is

in the same sense as the classic matrix chain multiplication problem — the order in

which we parenthesize a single term of a tensor contraction expression affects the

operation count for computing the term.

36 Consider the following tensor contraction expression

D = AB C where A, B, C, and D are tensors with indices i, j and k of range O, and indices p, q and r of range V.

If we directly implement the expression using nested loops (six loops), the op- eration count for computing this expression is 3OV (two multiplication and one addition for all possible values of indices). However, the operation count can be reduced if we break down the computation into two steps with an intermediate tensor I:

I = AB (cost 2O V )

D = I C (cost 2O V )

In this fashion, the total computation cost becomes 4OV.

On the contrary, if we select a different contracting order for the same expres- sion, the operation count increases:

I = B C (cost 2O V )

R = I A (cost 2O V )

The computation cost becomes 4OV.

The above two examples show scenarios where different orders for the same contraction can lead to different operation count. Therefore, it is important to select the best order for contracting the tensors of a term. The single term optimization attempts to minimize the operation count within one term, by searching for the best sequence of binary tensor contractions.

37 Algorithm 4.1 Single term optimization

input : A single term T,T, ..., T with n tensors. output: An optimal contraction sequence value.exp for this input. begin table = ∅ for i = 1 to n do S = all possible subsets of {T,T,...,T} with length i for each set in {s,s,...,s} ∈ S do if i = 1 then value.expr = (s) value.cost = 0 else if i = 2 then value.expr = (s × s) value.cost = computeCost(s, s) else minCost = ∞

D = set of all possible ways to split {s,s,...,s} into two parts for each set in {d,d,...,d,d,d,...,d} ∈ D do left = table.getValue({d, d, ..., d}) right = table.getValue({d, d, ..., d}) currentCost = computeCost(left.expr, right.expr) + left.cost + right.cost if currentCost < minCost then value.exp = (left.exp × right.exp) value.cost = currentCost minCost = currentCost end end end

key ={s,s,...,s} table.insert(key, value) end end

value = table.getValue({T,T, ..., T}) return value.exp end

38 Algorithm 4.1 shows the single term optimization algorithm employed in Op-

Min. The function computeCost computes the cost of contracting two tensors.

Factorization

To further reduce operation count, OpMin performs factorization to find com- mon factors across multiple terms. A number of algorithms have been developed in [31, 66] for factorization of tensor contraction expressions. For example, the ex- haustive search algorithm presented in [66] is guaranteed to find the best factor- ization with minimal operation count for a given input expression. However, it is extremely time consuming in practice, due to the exponentially growing time complexity for tensor contractions with a large number of terms.

Algorithm 4.2 presents a heuristic search approach based on the divide-and- conquer concept. First, we perform single term optimization on each term of the input tensor contraction expression to find the best contraction sequence and cor- responding operation count. Next, we sort these terms according to their operation counts, and divide them into smaller chunks. For each chunk of terms, we perform the direct descent search algorithm developed in [31] to find the best local factor- ization, and then iteratively combine chunks until there is only one chunk left. The search space of this factorization algorithm is controlled by three tuning param- eters: the total number of chunks, the number of terms in a chunk, and a cut-off threshold for the direct descent search function. The readers are referred to [31] for details of the direct descent search algorithm.

39 Algorithm 4.2 Heuristic factorization input : A tensor contraction expression E with multiple terms. output: A factorized form F of E. begin num = starting number of chunks maxSize = maximum number of expressions can be stored in one chunk perform single term optimization on every term in E and sort the terms based on cost divide the terms in E into num chunks C = the set of all chunks F = ∅ while True do G = ∅

for c in C do curResult = DirectDescent(c) G = G ∪ curResult

end if |G| <= 1 then F = F ∪ G break; end C = ∅

Let G = {g, g, g, ...} i = 0 while i < |G| − 1 do g = combine(g, g) if size(g) > maxSize then F = F ∪ g else C = C ∪ g end i = i + 2 end end return F end

40 Common sub-expression elimination

Common sub-expression elimination (CSE) is a classic optimization scheme uti-

lized in traditional compilers. The goal of CSE is to identify intermediate tensors

that can be calculated once and stored for reuse multiple times in the expression.

CSE is also widely used in manual formulations of the CC methods. To make use

of CSE, the first step is to define a canonical form of a sub-expression. We incorpo-

rate index mapping and permutation techniques to recast a sub-expression into its

canonical form. The steps for recasting a sub-expression into its canonical form are

described as follows:

1. Sort the tensors in their alphabetical order.

2. Assign priority to larger coefficients in case of conflicts.

3. Rename all indices based on their order of appearance.

For example, consider the following binary tensor contraction expression

I = c ⋅ B ⋅ A where c is the coefficient. The sub-expression (right hand side) is recast as

I = c ⋅ A ⋅ B

where e stands for external index, and i stands for internal indices.

Algorithm 4.3 describes how OpMin eliminates common sub-expressions by

maintaining a hash table to store the canonical forms of sub-expressions and their

operation count. The input to the CSE algorithm is a set of tensor contraction ex-

pressions. Each sub-expression is transformed into its canonical form and checked

41 Algorithm 4.3 Common sub-expression elimination input : A set of total n equations E = {e, e, ..., e}. output: A subset F of E with common sub-expression eliminated. begin table = ∅ F = ∅ for i = 1 to n do I = the intermediate of e C = the canonical form of e if table.has(C) then V = table.getValue(C) replace I with V if I appears in RHS of any equations in E

else table.insert(C, I)

F ← F ∪ e end end return F end

if it has been computed previously. If it has been already computed, we replace all the occurrences of such sub-expression that appears in the right hand side of other expressions with the one in the hash table.

4.4 Results and Discussion

4.4.1 Experimental setup

We evaluated the performance of OpMin using several equations from CC the- ory. The CC equations involve tensors with indices that span two types of ranges: occupied (O) and virtual (V). Range O describes the number of electrons in a mod- eled system, and range V pertains to the accuracy of the solution — higher values provide better accuracy. In practice, V is usually greater than O by a factor of 3 to

30.

42 The experiments were conducted on a desktop computer with one quad-core

Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz processor. We evaluated OpMin perfor-

mance by comparing its operation count with that from two other computational

chemistry software suites: NWChem (version 6.0) [74] and PSI3 (version 3.4.0) [3].

We also compared the operation counts from OpMin with a genetic algorithm de-

veloped by Engels-Putzka and Hanrath [24, 28] (marked as Genetic).

The operation counts for NWChem and PSI3 were obtained by adding Perfor-

mance Application Performance Interface (PAPI) [2] function calls into their source

code. PAPI is a library designed for accessing hardware performance counters

available on modern micro-processors. The operation counts for Genetic were ex-

tracted from the published papers [24, 28].

Table 4.3 displays the input equations used for performance evaluation. We tested five different methods from CC theory: UCCSD, RCCSD, UEOMCCSD, RE-

OMCCSD, and UCCSDT. The letter ‘U’ and ‘R’ stands for ‘Unrestricted’ and ‘Re- stricted’ Hartree-Fock theory; ‘EOM’ stands for ‘Equation-of-Motion’; ‘CC’ stands for ‘Coupled Cluster’; and ‘S’, ‘D’, ‘T’ stands for ‘Single’, ‘Double’, and ‘Triple’ exci- tations. Each method can be further broken down into equations of different exci- tations. We only tested the triples (T3) in UCCSDT because it is the most dominant part of UCCSDT. The total numbers of terms in these equations are also provided.

4.4.2 Importance of Symmetry Properties

To understand the importance and the cost reduction provided by symmetry

properties of tensors, we removed the symmetry properties of the tensors in the ex-

pressions and perform operation optimization on these tensors without symmetry.

43 Equation Description Excitations # of Terms

Unrestricted Coupled Cluster T1 50 UCCSD Singles and Doubles. T2 277

Restricted Coupled Cluster T1 25 RCCSD Singles and Doubles. T2 96

Unrestricted Equation-of-Motion X1 84 UEOMCCSD Coupled Cluster Singles and Doubles. X2 572

Restricted Equation-of-Motion X1 42 REOMCCSD Coupled Cluster Singles and Doubles. X2 198

Unrestricted Coupled Cluster UCCSDT T3 2102 Singles, Doubles and Triples.

Table 4.3: Characteristics of input equations to OpMin for performance evaluation.

Original Optimized Equation Dense Symmetric Speedup Dense Symmetric Speedup

UCCSD-T2 3.83 ⋅ 10 2.91 ⋅ 10 1.32 1.58 ⋅ 10 9.51 ⋅ 10 1.66

RCCSD-T2 1.34 ⋅ 10 1.31 ⋅ 10 1.02 6.08 ⋅ 10 3.91 ⋅ 10 1.55

UEOMCCSD-X2 1.01 ⋅ 10 7.76 ⋅ 10 1.30 2.42 ⋅ 10 1.72 ⋅ 10 1.41

REOMCCSD-X2 3.50 ⋅ 10 3.45 ⋅ 10 1.01 1.01 ⋅ 10 7.00 ⋅ 10 1.44

UCCSDT-T3 1.91 ⋅ 10 1.46 ⋅ 10 1.31 2.91 ⋅ 10 7.15 ⋅ 10 4.07

Table 4.4: Operation count comparison of dense vs. symmetric tensor contractions.

44 O V Original Optimized

10 10 2.99 ⋅ 10 1.05 ⋅ 10

10 20 4.72 ⋅ 10 7.10 ⋅ 10

10 30 2.38 ⋅ 10 2.29 ⋅ 10

10 100 2.91 ⋅ 10 9.51 ⋅ 10

10 200 4.65 ⋅ 10 9.59 ⋅ 10

10 300 2.35 ⋅ 10 3.93 ⋅ 10

10 1000 2.90 ⋅ 10 3.30 ⋅ 10

10 2000 4.64 ⋅ 10 4.76 ⋅ 10

10 3000 2.35 ⋅ 10 3.32 ⋅ 10

Table 4.5: Performance evaluation of OpMin with different O and V values for UCCSD-T2 (V = a ⋅ O, a and b in the range of 1 to 3).

Table 4.4 shows the speedup of symmetric tensor contractions over tensors with- out symmetry. We observed that the speedup is amplified after the optimization, and it shows that symmetry properties of tensors have a large impact for opera- tion minimization, because of finding more common factors and sub-expressions.

For example, the speedup of UCCSDT-T3 is amplified from 1.31×to 4.07×after the operation minimization.

45 Equation STO STO + FACT STO + FACT + CSE Speedup

UCCSD-T1 9.87 ⋅ 10 8.76 ⋅ 10 8.56 ⋅ 10 1.15

UCCSD-T2 2.36 ⋅ 10 1.40 ⋅ 10 9.51 ⋅ 10 2.48

RCCSD-T1 6.29 ⋅ 10 3.16 ⋅ 10 2.80 ⋅ 10 2.25

RCCSD-T2 8.62 ⋅ 10 4.56 ⋅ 10 3.91 ⋅ 10 2.20

UEOMCCSD-X1 1.30 ⋅ 10 1.21 ⋅ 10 1.18 ⋅ 10 1.10

UEOMCCSD-X2 4.08 ⋅ 10 2.61 ⋅ 10 1.72 ⋅ 10 2.37

REOMCCSD-X1 8.14 ⋅ 10 4.70 ⋅ 10 4.22 ⋅ 10 1.93

REOMCCSD-X2 1.54 ⋅ 10 8.22 ⋅ 10 7.00 ⋅ 10 2.20

UCCSDT-T3 3.99 ⋅ 10 1.62 ⋅ 10 7.15 ⋅ 10 5.58

Table 4.6: Performance evaluation of combining STO, FACT, and CSE algorithms in OpMin.

46 4.4.3 Performance evaluation of OpMin algorithms

Table 4.6 shows the operation count resulting from combining single term op-

timization, factorization, and common sub-expression elimination algorithms in

OpMin. In this experiment, we set O to 10 and V to 100 as representative range val- ues. Column ‘STO’ indicates the operation count obtained using only single term optimization; column ‘STO+FACT’ applies both single term optimization and fac- torization; and column ‘STO+FACT+CSE’ shows the operation count of combining all three operation minimization algorithms. The speedup is computed as

Speedup = (Operation count of ‘STO’) / (Operation count of ‘STO+FACT+CSE’).

The FACT and CSE algorithms together provided an extra operation count reduc-

tion 2.36× in average for these equations, compared to applying only the STO al-

gorithm. The effectiveness of FACT and CSE algorithms is visible, and we observe

larger speedup numbers on higher excitation equations. In particular, the speedup

is 5.58× for UCCSDT-T3, which is the largest equation evaluated (with more than

two thousand terms).

We select UCCSD-T2 as a representative method to test different combinations

of O and V values as

V = a ⋅ O

where a and b are integers in the range from 1 to 3. The results are shown in Ta-

ble 4.5. We observed that the operation count reduction becomes more significant

when the ratio of V/O increases. The reduction grows from roughly two orders of

magnitudes to four orders of magnitudes as the ratio grows from 1 to 300.

47 Basis O V

cc-pVDZ 5 19

cc-pVTZ 5 53

Table 4.7: O and V values of HO for different basis sets.

4.4.4 Performance comparison of OpMin vs. NWChem

NWChem is an ab initio computational chemistry software package designed to

run on high-performance parallel supercomputers, developed by the Environmen-

tal Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Lab-

oratory (PNNL). Many implementations in NWChem are generated by the TCE.

We collect the operation count of CCSD, CCSDT, and EOMCCSD equations by

recording the number of double precision floating point operations performed by

NWChem. For each equation, we tested the HO molecules using two different ba- sis sets: (1) cc-pVDZ, and (2) cc-pVTZ. The basis sets differ in the values of O and

V, as shown in Table 4.7.

Table 4.8 shows the operation count obtained from OpMin and NWChem for

CCSD and CCSDT. We also report operation count of these equations with spa-

tial symmetry exploited (denoted as c2v). Here, the operation count reduction is

presented as speedup, computed as

Speedup = (Operation count of NWChem) / (Operation count of OpMin).

All numbers from OpMin are better than those from NWChem, with speedups

ranging from 1.08× to 4.62×. OpMin achieved an average speedup of 1.68× for

48 Equation Basis Symmetry NWChem OpMin Speedup

c1 1.24 ⋅ 10 8.56 ⋅ 10 1.45 cc-pVDZ c2v 2.00 ⋅ 10 7.73 ⋅ 10 2.59 UCCSD c1 2.60 ⋅ 10 2.40 ⋅ 10 1.08 cc-pVTZ c2v 3.03 ⋅ 10 1.90 ⋅ 10 1.59

c1 9.31 ⋅ 10 3.71 ⋅ 10 2.51 cc-pVDZ c2v 1.51 ⋅ 10 3.27 ⋅ 10 4.62 RCCSD c1 2.01 ⋅ 10 8.84 ⋅ 10 2.27 cc-pVTZ c2v 2.35 ⋅ 10 6.77 ⋅ 10 3.47

c1 1.36 ⋅ 10 8.38 ⋅ 10 1.62 cc-pVDZ c2v 1.27 ⋅ 10 6.65 ⋅ 10 1.91 UCCSDT c1 8.28 ⋅ 10 4.76 ⋅ 10 1.74 cc-pVTZ c2v 5.19 ⋅ 10 3.26 ⋅ 10 1.59

Table 4.8: Operation count comparison of OpMin vs. NWChem for CCSD and CCSDT equations. Various combinations of basis sets (cc-pVDZ/cc-pvTZ) and symme- try point groups (c1/c2v) are tested for HO.

49 Equation Excitation Basis NWChem OpMin Speedup

cc-pVDZ 1.23 ⋅ 10 3.75 ⋅ 10 3.28 X1 cc-pVTZ 1.48 ⋅ 10 4.57 ⋅ 10 3.24 UEOMCCSD cc-pVDZ 2.05 ⋅ 10 1.53 ⋅ 10 1.34 X2 cc-pVTZ 3.81 ⋅ 10 3.29 ⋅ 10 1.16

cc-pVDZ 6.65 ⋅ 10 1.40 ⋅ 10 4.75 X1 cc-pVTZ 7.70 ⋅ 10 1.40 ⋅ 10 5.50 REOMCCSD cc-pVDZ 1.57 ⋅ 10 7.20 ⋅ 10 2.18 X2 cc-pVTZ 2.98 ⋅ 10 1.51 ⋅ 10 1.97

Table 4.9: Operation count comparison of OpMin vs. NWChem for EOMCCSD equations. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for HO.

UCCSD, 3.22× for RCCSD, and 1.72× for UCCSDT. For UEOMCCSD and REOM-

CCSD equations, OpMin attained an average speedup of 2.26× and 3.6×, as shown in Table 4.9.

4.4.5 Performance comparison of OpMin vs. PSI3

Table 4.10 compares the operation count of OpMin and PSI3, an open-source

software package of ab initio quantum chemistry programs. We report numbers

only for UCCSD and RCCSD since PSI3 has not yet implemented other methods.

The performance of OpMin is competitive with PSI3, if not superior. The average

speedup is 1.19× for UCCSD, but 0.84× for RCCSD. PSI3 performed better on the

50 Equation Basis Symmetry PSI3 OpMin Speedup

c1 9.80 ⋅ 10 8.56 ⋅ 10 1.14 cc-pVDZ c2v 1.10 ⋅ 10 7.73 ⋅ 10 1.42 UCCSD c1 2.42 ⋅ 10 2.40 ⋅ 10 1.01 cc-pVTZ c2v 2.23 ⋅ 10 1.90 ⋅ 10 1.17

c1 2.32 ⋅ 10 3.71 ⋅ 10 0.63 cc-pVDZ c2v 3.67 ⋅ 10 3.27 ⋅ 10 1.12 RCCSD c1 5.31 ⋅ 10 8.84 ⋅ 10 0.60 cc-pVTZ c2v 6.82 ⋅ 10 6.77 ⋅ 10 1.01

Table 4.10: Operation count comparison of OpMin vs. PSI3 for CCSD equations. Vari- ous combinations of basis sets (cc-pVDZ/cc-pvTZ) and symmetry point groups (c1/c2v) are tested for HO.

51 Equation Basis Genetic OpMin Speedup

cc-pVDZ 3.58 ⋅ 10 8.56 ⋅ 10 4.18 UCCSD cc-pVTZ 1.02 ⋅ 10 2.40 ⋅ 10 4.25

cc-pVDZ 1.88 ⋅ 10 8.38 ⋅ 10 2.24 UCCSDT cc-pVTZ 1.60 ⋅ 10 4.76 ⋅ 10 3.36

Table 4.11: Operation count comparison of OpMin vs. Genetic for CCSD and CCSDT equa- tions. Two basis sets (cc-pVDZ/cc-pvTZ) are tested for HO using c1 point symmetry group. Numbers of Genetic are extracted from [24, 28].

RCCSD equation because it implemented a manually optimized equation. How- ever, in the case that spatial symmetry is exploited, the operation count is compa- rable (average of 1.07×).

4.4.6 Performance comparison of OpMin vs. Genetic

Table 4.11 compares the operation count of OpMin and Genetic. We present numbers only for UCCSD and UCCSDT because these were the only data available in their document for the equations we evaluated. The average speedup is 4.22× for UCCSD, and 2.8× for UCCSDT.

4.4.7 Choosing a small set of optimized forms

OpMin is capable of optimizing tensor contractions of any value of O and V. In practice, it is time consuming to generate optimal expression for all possible combi- nations of O and V. We show how to select a few combinations of O and V to produce tensor expressions within 3% of the optimal form for all O and V of interest.

52 The rationale is that the relative cost contribution to the total operation count of

any two forms of the same order in a tensor expression remains the same, no matter

what values of O and V are, as long as the ratio between O and V does not change.

Consider a tensor expression of order n. The computation cost of two different

forms of this tensor expression can be represented as

O ⋅ V()

and

O ⋅ V().

Assume that the ratio

V/O = c.

Without loss of generality, we set

O = x and

V = c ⋅ x where x is some non-negative integer.

The relative cost contribution between these terms is computed as

O ⋅ V() x ⋅ (c ⋅ x)() = = c() O ⋅ V() x ⋅ (c ⋅ x)()

As we can see, the result depends only on a, b, and c, but not on x. This outcome

implies that the optimal tensor expression resulting from different pairs of O and V

that have the same ratio should be similar, because the relative cost contribution of

the highest order terms are the same.

53 Figure 4.1: Operation count for V/O = 40 optimized at different ratios.

Figure 4.2: Percentage difference from optimal operation count for different V/O ratios op- timized at V/O = 5 and V/O = 40.

54 We examined the above equation using UCCSD-T2 and reported the result in

Figure 4.1 and Figure 4.2. Figure 4.1 shows the ratio of the operation count opti- mized at various V/O to the optimal operation count optimized at V/O = 40. The x-axis represents various V/O ratios, and the y-axis indicates the ratio of operation count optimized at these ratios with respect to the optimal operation count. For the ratio V/O from 1 to 15, the operation counts are roughly 10-15% more than the optimal. As the ratio of V/O becomes closer to 40, the ratio is close to 1, implying that the operation count is similar to the optimally optimized case. Furthermore, the operation count ratio changed only after a large increase in the V/O ratio.

Figure 4.2 shows the difference of operation count in percentage between the optimal form to other forms of different V/O ratio. We can see that using two pairs of V/O ratio (5 and 40) is sufficient to span the entire V and O space for O ranging from 10 to 80 and V ranging from 10 to 400, while staying within 3% of the optimal operation count. In Figure 4.1, the ratio of V/O = 10 served as a good transition point for deciding which V/O ratio to use. For all V/O smaller than 10, we can use the tensor expression optimized from V/O = 5; for V/O greater than 10, the tensor expression generated from V/O = 40 can be used.

4.5 Summary

In this chapter, we presented OpMin for operation minimization of arbitrary tensor expressions with two different kinds of symmetry properties. We demon- strated the effectiveness of OpMin using many equations from CC theory. OpMin is especially effective in optimizing complex equations, such as CCSDT and EOM-

CCSD, which are difficult to implement manually. The equations optimized by

55 OpMin require fewer operation counts compared to NWChem, PSI3, and a genetic algorithm implementation. Furthermore, we have shown that tensor contraction expressions optimized for a few combinations of O and V can be used as effective solutions for all possible combinations of O and V of interest.

56 CHAPTER 5

Dynamic Load-balanced Tensor Contractions

5.1 Introduction

Tensor contractions are generalized higher dimensional matrix-matrix multipli-

cation operations that often arise in the area of quantum chemistry, such as the Cou-

pled Cluster (CC) methods [7], one of the popular many-body methods. Rapid de- velopment of high-performance implementations for new and sophisticated com- putational methods is crucial to accelerate the speed of scientific progress. As man- ual development of these calculations is very tedious and time-consuming, auto- mated approaches, such as the Tensor Contraction Engine (TCE) [6, 9, 32], exist

and allow quantum chemistry experts to specify tensor contraction expressions in

a high-level form, from which efficient parallel programs are automatically syn-

thesized. For instance, the production computational chemistry software package

NWChem [74] contains more than two million lines of FORTRAN code generated

by the TCE.

The TCE utilizes a block-sparse data representation to exploit symmetric and

spatial properties of tensors. Only tensor blocks (or tiles) containing nonzero ele-

ments that cannot be derived by transposing other symmetric blocks are explicitly

57 stored, using the Global Arrays (GA) [51]. In NWChem, each tensor contraction

is translated into a set of nested loops that perform computation on these tensor

blocks. The nested loops are parallelized over multiple processors, using a central-

ized dynamic load balancing scheme. In the past, exploiting intra-contraction par-

allelism was sufficient for computing tensor contractions. However, a few recent

studies [33, 56] have reported scalability challenges with strong scaling on hun- dreds and thousands of cores.

The original TCE is a domain-specific compiler that translates high-level expres- sions representing tensor contractions into actual programs with fixed parameters.

In NWChem, CC methods are generated via the TCE as several subroutines of FOR-

TRAN code. For instance, there are 19 subroutines for the single excitation (T1) and

44 subroutines for the double excitations (T2) of the Coupled Cluster Singles and

Doubles (CCSD) method. All of these subroutine programs highlight similar code structure and patterns. The large volume of code causes long compilation time and inconvenience for manipulation and performance tuning for these programs, especially for methods with high excitations. Moreover, manual performance op- timizations conducted for the generated code have resulted in between the code generated by the original TCE and the current code of some methods in

NWChem after hand-tuning.

To address all the issues mentioned, we presents Dynamic Load-balanced Ten- sor Contractions (DLTC), a light-weighted, domain-specific library for task parallel execution of tensor contraction expressions on distributed memory systems. DLTC exploits inter-contraction parallelism to remove many synchronization steps and enhance load balancing among processors — we are not aware of previous efforts

58 that address automatic parallelization of multiple contractions. DLTC provides domain-specific primitives that enable dynamic decomposition of tensor contrac- tions into fine-grained units of work (or tasks). These tasks are scheduled, executed, and dynamically load-balanced by a task parallel execution runtime. DLTC allows developers to express arbitrary tensor contractions in a high-level, domain-specific abstraction, while the details of intra- and inter-contraction parallelization, data distribution, and load balancing remain hidden.

The following are the principal contributions of DLTC:

• DLTC is a sustainable library for implementing arbitrary tensor contraction

expressions on distributed memory systems. The library approach improves

programming productivity and reduces code volume, compared to the com-

piler approach used by the TCE.

• DLTC provides an embedded domain-specific language for users to express

their tensor contractions. Task decomposition, data distribution, and dynamic

load-balanced execution are all taken care of automatically.

• By executing tasks from multiple independent contractions concurrently, DLTC

eliminates many barrier synchronization steps in the code generated by the

TCE, leading to a better performance.

• DLTC presents a scheme to overlap computation and communication, by us-

ing two sets of local buffers in each process, further reducing the total execu-

tion time.

59 We evaluate DLTC and demonstrate its efficiency and scalability by comparing

it with other implementations, using examples from CC methods. The rest of the

chapter is organized as follows:

Section 5.2 describes the problem being addressed.

Section 5.3 describes the domain-specific primitives in DLTC, and explains the

actions taken by DLTC step-by-step.

Section 5.4 discusses performance evaluation.

Section 5.5 summarizes this chapter.

5.2 Problem

A typical CC method consists of tens to hundreds of tensor contraction expres-

sions. The input to DLTC is a set of binary tensor expressions whose total num-

ber of arithmetic operations has been reduced through algebraic transformation.

Throughout this chapter, we will use the double excitation (T2) of the Coupled

Cluster Doubles (CCD) method to explain DLTC algorithms.

Table 5.1 lists the tensor contraction expressions of CCD-T2 obtained from the

NWChem TCE module. An index name beginning with either letter p or letter

h designates that it is either a virtual orbital of range V or an occupied orbital of

range O. In practice, V is usually greater than O by a factor of tens to hundreds.

For conciseness, we remove the coefficients of these expressions. Table 5.1 also

illustrates the computation cost in big-O notation for each expression. We use a

problem size N to represent both O and V such that N = O = V.

Figure 5.1 illustrates the Direct Acyclic Graph (DAG) representation of the CCD-

T2 tensor contraction expressions in Table 5.1. A vertex (or node) indicates either

60 Node Expression Cost

1 R[p3, p4, h1, h2] += V[p3, p4, h1, h2] O(N )

2 I[h5, h1] += F[h5, h1] O(N )

3 I[h5, h1] += T[p6, p7, h1, h8] ∗ V[h5, h8, p6, p7] O(N )

4 R[p3, p4, h1, h2] += T[p3, p4, h1, h5] ∗ I[h5, h2] O(N )

5 I[p3, p5] += F[p3, p5] O(N )

6 I[p3, p5] += T[p3, p6, h7, h8] ∗ V[h7, h8, p5, p6] O(N )

7 R[p3, p4, h1, h2] += T[p3, p5, h1, h2] ∗ I[p4, p5] O(N )

8 I[h7, h9, h1, h2] += V[h7, h9, h1, h2] O(N )

9 I[h7, h9, h1, h2] += T[p5, p6, h1, h2] ∗ V[h7, h9, p5, p6] O(N )

10 R[p3, p4, h1, h2] += T[p3, p4, h7, h9] ∗ I[h7, h9, h1, h2] O(N )

11 I[h6, p3, h1, p5] += V[h6, p3, h1, p5] O(N )

12 I[h6, p3, h1, p5] += T[p3, p7, h1, h8] ∗ V[h6, h8, p5, p7] O(N )

13 R[p3, p4, h1, h2] += T[p3, p5, h1, h6] ∗ I[h6, p4, h2, p5] O(N )

14 R[p3, p4, h1, h2] += T[p5, p6, h1, h2] ∗ V[p3, p4, p5, p6] O(N )

Table 5.1: Tensor contraction expressions of CCD-T2.

61 1 2 3 5 6

4 7

8 9 11 12 14

10 13

Figure 5.1: DAG representation of CCD-T2.

a binary tensor contraction (C += A ⋅ B) or a unary tensor addition (X += Y). An edge between two nodes stands for the constraint that the predecessor node must be computed before the successor, due to dependence on intermediate tensors. A darker node has a higher arithmetic complexity than a lighter node.

The TCE compiler converts tensor expressions into parallel programs by first finding a topological ordering of this input DAG, based on the dependences among the contractions. Each node is translated into a separate subroutine program with parallelized nested loops. The nodes are computed serially, one after another. A shared global counter is used for assigning tasks and balancing tasks among pro- cessors for each contraction.

One drawback of the TCE approach is that all processors are constrained to par- ticipate in the computation of each contraction, regardless of the computational complexity. The computational workload of a larger contraction can be orders of magnitude higher than that of a small contraction. For example, node 14 in Table

62 5.1 has a computational cost of O(N) arithmetic operations while node 2 only has a cost of O(N). As the computationally dominant contractions are well-optimized and scalable, the less expensive ones may become performance bottlenecks due to communication and synchronization overheads. For example, Ozog et al. [55, 56] reported a noticeable overhead due to the use of the shared global counter in the

NWChem TCE programs, limiting the scalability on large scale systems.

DLTC addresses these issues by partitioning the input DAG of expressions into a sequence of levels. Each level consists of a number of independent nodes that can be executed concurrently because they only depend on tasks at some previous level. Tasks are dynamically partitioned from these nodes, based on their problem size and available resources. Through a task parallel execution runtime, a process steals tasks from other processes whenever it runs out of tasks. A double buffering technique is employed to overlap communication with computation.

5.3 Methods

In this section, we discuss the design and implementation of a task-based exe- cution library in the form of an embedded domain-specific language (DSL), to com- pute a set of tensor contraction expressions. DLTC integrates with existing PGAS programming models and task parallel programming models, and addresses the computation of tensor contraction expressions by dynamically dividing the work- load into tasks operating on data stored in a global address space. DLTC defines a domain-specific programming model designed for high performance and produc- tivity.

63 We first introduce the domain-specific primitives provided by DLTC. Next,we present a proxy application replicating the functionality of the NWChem TCE mod- ule to serve as a baseline for performance characterization. The optimizations in

DLTC are then explained step-by-step with simplified code snippets.

5.3.1 Domain-specific primitives

Domain-specific languages (DSL) have gained popularity in recent years for their ability to offer a high-level abstraction to application developers while en- abling high performance on diverse computer architectures. Hence, we design

DLTC as a domain-specific library so that any modification to it can be immediately effective in production code. Such an approach can reduce a substantial amountof development time and code size for complex tensor expressions. DLTC provides a small set of domain-specific primitives for users to specify tensor contraction ex- pressions.

Most of the domain-specific primitives are defined as C++ classes. Before we present these primitives, a few data structures are defined:

1 /*Definetwotypesofindexranges*/ 2 enum IndexRange { O, V };

3 4 /*Defineanumberofindexnames*/

5 enum IndexName { h1, h2, h3, ..., p1, p2, p3, ... }; 6

7 /*Defineadatastructureforindex-valuemapping*/ 8 std::map IndexValueMap;

Tensor The Tensor class provides collective functions for creating and destroying tensors, as well as non-collective functions for getting, putting, and accumulating a particular block (or tile) of the tensor. The core functions are presented in the

C-style code snippet below:

64 1 /*** 2 * Abstraction for tensors

3 */ 4 class Tensor {

5 public: 6 /* Default constructor */

7 Tensor(); 8

9 /* Default destructor */ 10 ~Tensor();

11 12 /* Constructs a Tensor instance of the specified dimensionality and ranges */

13 Tensor(const vector dim, int id); 14

15 /* Return the ID of the tensor */ 16 int id();

17 18 /* Return the dimensionality of the tensor */

19 int dim(); 20

21 /* Set index names of each dimension when specifying a contraction */ 22 void setIndexName(const vector& in)

23

24 /* Fetch a particular tile of the tensor to buffer */ 25 void getTile(double* buf, int bufSize, const IndexValueMap& ivm);

26 27 /* Copy the data in buffer to a particular tile of tensor */

28 void putTile(double* buf, int bufSize, const IndexValueMap& ivm); 29

30 /* Accumulate the data in buffer to a particular tile of tensor */ 31 void accTile(double* buf, int bufSize, const IndexValueMap& ivm, double coef);

32 33 ...

34 }; 35

36 /* Convenient functions for creating tensors */ 37 Tensor Tensor2(int id, IndexRange i1, IndexRange i2, int id);

38 Tensor Tensor4(int id, IndexRange i1, IndexRange i2, IndexRange i3, IndexRange i4); 39 ...

40 41 /* Convenient functions for setting index names */

42 void setIndexName2(Tensor& t, IndexName i1, IndexName i2); 43 void setIndexName4(Tensor& t, IndexName i1, IndexName i2, IndexName i3, IndexName i4);

44 ...

65 Note that the getTile and accTile member functions are interfaced with PGAS programming models, wrapping the one-sided communication operations pro- vided by the PGAS model. DLTC also provides other customized functions to han- dle permutation, spin, and spatial symmetry properties of tensors. The setIndexName function is used to specify the contracting indices.

Two types of tensor expressions are supported in the current implementation of DLTC: (1) Addition, for unary tensor addition, and (2) Multiplication, for binary tensor contraction.

Addition The Addition class defines a unary summation of X += Y. The core func- tions are shown below:

1 /*** 2 * Abstraction for unary addition: X += coef * Y

3 */ 4 class Addition {

5 public: 6 /* Default constructor */

7 Addition(); 8

9 /* Default destructor */ 10 ~Addition();

11 12 /* Constructs a Addition instance of the specified unary addition */

13 Addition(const Tensor& tX, const Tensor& tY, double coef); 14

15 /* Return the LHS tensor of the addition */ 16 Tensor tX();

17 18 /* Return the RHS tensor of the addition */

19 Tensor tY(); 20

21 /* Return the coefficient of the addition */

22 double coef(); 23

24 /* Return the index names of the addition (same for both input and output tensor) */ 25 vector extIds();

26 27 /* Create an iterator that cycles through external indices */

28 Iterator createExtIter(); 29 };

66 Multiplication The Multiplication class defines a binary multiplication of C +=

A ⋅ B, as shown below:

1 /*** 2 * Abstraction for binary multiplication: C += coef * A * B

3 */ 4 class Multiplication {

5 public: 6 /* Default constructor */

7 Multiplication(); 8

9 /* Default destructor */ 10 ~Multiplication();

11 12 /* Constructs a Multiplication instance of the specified binary contraction */

13 Multiplication(const Tensor& tC, const Tensor& tA, const Tensor& tB, double coef); 14

15 /* Return the LHS tensor of the multiplication */ 16 Tensor tC();

17 18 /* Return the first RHS tensor of the multiplication */

19 Tensor tA(); 20

21 /* Return the second RHS tensor of the multiplication */ 22 Tensor tB();

23 24 /* Return the coefficient of the multiplication */

25 double coef(); 26

27 /* Return the external index names of the multiplication */ 28 vector extIds();

29 30 /* Return the summation index names of the multiplication */

31 vector sumIds();

32 33 ...

34 35 /* Create an iterator that cycles through external indices */

36 Iterator createExtIter(); 37

38 /* Create an iterator that cycles through summation indices */ 39 Iterator createSumIter();

40 41 ...

42 };

The functions createExtIter and createSumIter are used to generate iterators that cycles through external and summation indices, respectively. DLTC also provides

67 many other useful functions, such as functions to return the external indices from input tensors, functions to return indices of tensors in the order of memory layout

(for the use of index reordering), etc.

Listing 5.1 Example of specifying a tensor contraction: I[h5, h1] += T[p6, p7, h1, h8] ∗ V[h5, h8, p6, p7]. 1 /* Define tensor names */

2 enum TensorName { 3 R_VVOO, // Result

4 F_OO, F_VV, // F integrals 5 T_VO, T_VVOO, // T amplitudes

6 V_OOOO, V_OOOV, V_OOVO, ... // V integrals 7 I_1, I_2, ... // intermediates

8 }; 9 /* Creating tensor instances */

10 Tensor i = Tensor2(I_1,O,O); 11 Tensor t = Tensor4(T_VVOO,V,V,O,O);

12 Tensor v = Tensor4(V_OOVV,O,O,V,V); 13 /* Specifying contraction */

14 setIndexName2(i,h5,h1); 15 setIndexName4(t,p6,p7,h1,h8);

16 setIndexName4(v,h5,h8,p6,p7);

17 double coef = 1.0; 18 /* Creating a multiplication instance */

19 Multiplication m = Multiplication(i,t,v,coef);

Listing 5.1 is an example of a code snippet showing how node 3 in Table 5.1 is specified. The mapping of high-level tensor expressions into domain-specific primitives is achieved via an automated code generator. The tensors are created by specifying their names, number of dimensions, and their ranges. For symmetric tensors, the creation involves an extra step that determines whether a particular block will be nonzero or symmetric with respect to an existing block, via a num- ber of logical tests on the value of tile indices. The data of each tensor is evenly distributed to all processes in the global address space.

68 An addition or multiplication node is created by specifying its input and output tensors, along with a coefficient (default is 1.0 if not provided). Setting the names of indices determines which indices are external and contracted. Index reordering may be required before calling matrix-matrix multiplication kernels for computa- tion. For instance, a contraction of

C[i, j, m, n] += A[i, j, k, l] ∗ B[k, l, m, n] is essentially the same as a 2-D matrix-matrix multiplication, treating a consecutive pair of tensor indices as a “macro” index that runs over a range corresponding to the product of the ranges of the tensor indices. However, tensor A in the following contraction using the same tensors

C[i, j, m, n] += A[i, k, j, l] ∗ B[k, l, m, n] will require an index reordering (layout transform) among index k and j. In DLTC, index reordering is implemented as a number of concurrent multi-threaded func- tions using OpenMP.

After the tensors and expressions are defined, DLTC decomposes each expres- sion into smaller computational tasks represented by iterators.

Iterator Each iterator can be viewed as a subset of the computation in the original expression. An iterator traverses through a range of the iteration space of the in- dices in the original loop nest. In DLTC, an iterator represents a task to be executed on any process.

69 1 /*** 2 * Abstraction for iterator

3 */ 4 class Iterator {

5 public: 6 /* Default constructor */

7 Iterator(); 8

9 /* Default destructor */ 10 ~Iterator();

11 12 /* Constructs a Iterator instance */

13 Iterator(const vector& ids, const vector& lo, const vector& hi); 14

15 /* Return the index value mapping of next iteration */ 16 bool next(IndexValueMap* ivm);

17 18 /* Check if the iteration ends */

19 bool hasNext(); 20

21 /* Reset the iterator */ 22 void reset();

23

24 ... 25 };

26 /* Convenient function for creating an iterator */ 27 Iterator genIter(int dim, IndexName ids[], int lo[], int hi[]);

We use standard 2-D matrix-matrix multiplication as an example. Given the fol-

lowing expression

C[i, j] += A[i, k] ∗ B[k, j] where C, A, and B are 2-D matrices, and i, j, and k are indices of the same range N.A straightforward implementation for this expression is to use a loop nest with three loops, as shown in Listing 5.2.

Listing 5.3 displays an identical computation by creating a single iterator that is

capable of cycling through the entire range of all three indices. An iterator treats

each iteration within the iteration space as a lattice point if plotted explicitly as

a graph. Figure 5.2 shows the concept of using iterators for representing nested

loops. The first iterator cycles through the entire iteration space for the three loops.

70 i k j One iterator

Three iterators (break at dimension i)

Three iterators (break at dimension j)

Three iterators (break at dimension k)

Figure 5.2: Iteration space of iterators.

71 Listing 5.2 Example of 2-D matrix-matrix multiplication using three loops.

1 for (i = 0; i < N; ++i) {

2 for (j = 0; j < N; ++j) { 3 for (k = 0; k < N; ++k) {

4 C[i][j] += A[i][k] * B[k][j]; 5 }

6 } 7 }

Listing 5.3 Creating a single iterator for 2-D matrix-matrix multiplication.

1 enum IndexName { I, J, K };

2 int id = { I, J, K }; 3 int lo = { 0, 0, 0 };

4 int hi = { N, N, N }; 5 Iterator it = genIter(3, id, lo, hi);

However, we can break down the workload by partitioning the iteration space.

Figure 5.2 also shows three different ways of breaking down the iterator on different dimensions. One can view them as different ways of slicing the 3-D iteration space into three 2-D planes. Each of the new iterators now cycles through a portion of the original iteration space, and together their work equals the original workload.

Listing 5.4 shows the code snippet to break down the original iterator into two on dimension K.

Listing 5.4 Creating two iterators for 2-D matrix-matrix multiplication by dividing on dimension K. 1 enum IndexName { I, J, K };

2 int id = { I, J, K }; 3 int lo1 = { 0, 0, 0 };

4 int hi1 = { N, N, N / 2 }; 5 int lo2 = { 0, 0,N/ 2 + 1 };

6 int hi2 = { N, N, N }; 7 Iterator it1 = genIter(3, id, lo1, hi1);

8 Iterator it2 = genIter(3, id, lo2, hi2);

72 Listing 5.5 Task function for 2-D matrix-matrix multiplication.

1 matrix_multiply(Iterator* it) { 2 IndexValueMap ivm;

3 while (it->next(ivm)) { 4 int i = ivm[I];

5 int j = ivm[J];

6 int k = ivm[K]; 7 C[i][j] += A[i][k] * B[k][j];

8 } 9 }

Given an iterator and the corresponding expression it is generated from, a task function defines how the actual computation is performed on a process. Listing

5.5 shows the task function which takes the original iterator and the generating ex- pression as input parameters to perform the actual computation. Within the task function, a loop calls the next function provided by the iterator as long as it returns true. The mapping of index names and their values for a particular iteration is stored in map ivm. The entire computation can be partitioned arbitrarily into mul- tiple iterators with different iteration spaces.

Listing 5.6 Task function for computing multiplication expressions.

1 task_mul(Multiplication* m, Iterator* it) {

2 ... /* allocate local buffers: bufA, bufB, bufC */ 3 IndexValueMap ivm;

4 while (it->next(ivm)) { 5 m->tA.getTile(bufA, bufAsize, ivm);

6 TRANS(bufA, bufAt, m); 7 m->tB.getTile(bufB, bufBsize, ivm);

8 TRANS(bufB, bufBt, m); 9 DGEMM(bufAt, bufBt, bufCt);

10 TRANS(bufCt, bufC, m); 11 m->tC.accTile(bufC, bufCsize, ivm, m->coef);

12 } 13 ... /* free local buffers */

14 }

73 In DLTC, computation is organized in terms of tile (or block) indices rather than individual tensor elements. Listing 5.6 illustrates the code snippet of a task function for multiplying two dense tensors in DLTC. In this task function, local buffers (bufA, bufB, and bufC) are allocated for temporary storage of input and output tensor tiles.

The iterator it cycles through an iteration space of tile indices for the tensor con- traction. The index-to-value mapping is stored in ivm. The information provided by the iterator it and the expression determines which tensor tiles to fetch, compute, and store the results.

For each particular iteration, two input tensor tiles are fetched (copied) to the local buffers using one-sided getTile operations provided by DLTC tensors. An in- dex reordering is required if the memory layout of this particular tensor block is not suitable for computation, by making calls to TRANS. We use vendor-provided Basic

Linear Algebra Subroutine (BLAS) libraries to perform matrix multiplication DGEMM.

The calculated result is subsequently transposed (if needed), and accumulated to a particular tile of the output tensor (accTile).

Listing 5.7 Task function for computing addition expressions.

1 task_add(Addition* a, Iterator* it) { 2 ... /* allocate local buffers bufX, bufY*/

3 IndexValueMap ivm; 4 while (it->next(ivm)) {

5 a->tY.getTile(bufY, bufYsize, ivm); 6 TRANS(bufY, bufYt, a);

7 a->tX.accTile(bufY, bufYsize, ivm, a->coef); 8 }

9 ... /* free local buffers */ 10 }

74 Listing 5.7 illustrates a task function for tensor addition. Similarly, within each

iteration of the addition task function, an input tile is fetched and added to an out-

put tile. The cost for an addition task is usually much smaller than a multiplication

task.

5.3.2 NWChem proxy application

By using DLTC to exploit only the intra-expression parallelism, we can design a

proxy application which model the load-balancing algorithm implemented in TCE,

often referred to as the NXTVAL scheme. DLTC implements the NXTVAL scheme using the following procedures:

1. Create an iterator that cycles through the entire iteration space for a particular

expression.

2. Replicate this iterator on every process.

3. Execute a customized task function with the NXTVAL function embedded in

the loop.

Listing 5.8 shows the pseudo code of this model in DLTC. All processes exe-

cute this task function at the same time with the replicated iterator mentioned in

the previous steps. In the NXTVAL scheme, every process performs an atomic get- and-increment of a shared global counter to determine the next task it should exe- cute, by comparing the global counter with a local counter. The value of the global counter uniquely identifies the work to be done in the current iteration. All pro- cesses continue this procedure until no tasks remain to be executed.

75 Listing 5.8 NXTVAL: A centralized load balancing scheme using a global shared counter. 1 NXTVAL(Expression* exp, Iterator* it) {

2 ... /* allocate local buffers */ 3 int count = 0;

4 int next = NXTVAL(); 5 IndexValueMap ivm;

6 while (it->next(ivm)) { 7 if (next == count) {

8 ... /* compute task */ 9 next = NXTVAL();

10 } 11 count += 1;

12 } 13 ... /* free local buffers */

14 }

5.3.3 Exploiting inter-contraction parallelism

To exploit parallelism across different contractions, we begin by dividing the

nodes into different groups, so that tasks in the same group are independent, i.e.,

can be executed concurrently. Partitioning the vertices of a DAG graph into layers,

or levels, can be formalized as a variation of the layering problem [8, 72], which is

described as follows:

Consider a DAG graph of tensor expressions G = (V, E) with a set of vertices V and a set of directed edges E. Let L = L,L, ..., L be a partition of the vertex set of

G into h >= 1 subsets such that each directed edge E = u, v where u is from L, v is

from L, and i < j. The partitioning process is called layering, and the partitioned

sets of L,L, ..., L are referred to as layers (or levels) of G. We seek a layering such

that the number of layers, or height, is minimized, while the total extra memory

required for intermediate tensors in each layer does not exceed an upper bound

M. Assume that the memory is sufficient for all of the intermediates; a longest-

path algorithm can find the layering with minimal height in linear time[49]. The

76 1 2 3 5 6

8 9 11 12 14

4 7 10 13

Figure 5.3: Layered DAG representation of CCD-T2 in DLTC.

Coffman-Graham algorithm [19] is a greedy algorithm shown to be effective for

solving a layering problem with a bounded number of vertices in each layer. The

worst-case time complexity of the Coffman-Graham algorithm is O(|V|). Rather

than checking the number of vertices, we modified the algorithm so that it checks

the total amount of extra memory, as shown in the following steps.

1. Represent the partial order of the input DAG by removing all of the transitive

edges.

2. Construct a topological ordering for the reduced graph.

3. For each vertex v in the reverse of the topological ordering, add v to the lowest

level that is at least one level higher than the highest level of any outgoing

vertices of v and does not exceed the extra memory requirement M.

By layering the DAG into a minimal number of levels, many of the synchroniza- tion points that would otherwise exist between the expressions can be eliminated.

Figure 5.3 shows the minimal height example of a layering for CCD T2.

77 DGEMM TRANSP COMM SYNC OTHER 350

300

250

200

150

100 Execution time(sec) Execution 50

0 2X 3X 4X 5X 6X

Figure 5.4: Influence of tile size selection. The profiling is performed on CCD. Problem size: 100. Number of cores: 512.

5.3.4 Dynamic task partitioning

The idea of dynamic task partitioning is to break down the iteration space into

smaller sub-spaces represented by multiple iterators. Given a particular expres-

sion, the TCE compiler approach produces a concrete translation from the expres-

sion into a subroutine program. The task partitioning is fixed, and the only con-

trollable parameter is the tile size set at the beginning of program execution that

is shared across all expressions. This rigid characteristic is one drawback of using

the compiler approach, because the overall performance is highly dependent upon

a proper tile size selection.

Figure 5.4 shows how different tile sizes affect the performance of a static task partitioning scheme. The profiling is performed using the NWChem proxy appli- cation on CCD-T2 with a problem size of N = 100, running on 64 node (512 cores) of

78 a x86/ Infiniband cluster. The total execution time of each process is split intosev- eral categories: (1) matrix multiplication (DGEMM), (2) tile transposition (TRANS),

(3) communication (COMM), (4) synchronization (SYNC), and the rest (Other).

We report the average execution time of all processors. The time breakdown is presented in a stacked bar graph. The x-axis of Figure 5.4 shows the ratio of

N over the tile width chosen, and the y-axis shows the execution time in seconds.

As we can see from the figure, while choosing a larger tile size (2X) requires less communication time, insufficient number of tasks leads to load imbalance. Onthe other hand, choosing a smaller block size (6X) improves workload balance, but the cost for DGEMM, COMM, and TRANS increase due to a larger number of smaller tasks and tiles. Predicting the best tile size is often difficult, because there isa tension between having large enough tile sizes for local BLAS efficiency, but small enough tile sizes to avoid inefficiency due to load-imbalance. The best execution time is obtained by a careful selection of tile size which compromises between the overhead of communication and synchronization. Even for the best execution time, the synchronization time is nearly 20% of the total execution time.

To address this issue, the granularity of a task is determined by both iteration space of the iterator and the tile sizes chosen in DLTC. As shown in Figure 5.5, rather than a static task decomposition scheme for all expressions, DLTC attempts to split work in a fashion such that all tasks have similar amount of work. DLTC decides the task granularity at runtime based on the information provided from the expressions and the problem size.

This dynamic task partitioning scheme helps to maintain a better load balancing among processes. The pseudo code in Listing 5.9 shows a dynamic task partitioning

79 Task Pool

4 4 4 7 13

13 10 7 10 7 13 10 10 10

7 10 4 10

13 7 10 13

13 4 13

Figure 5.5: Partitioning independent expressions into a number of tasks in a task pool.

scheme that allows three different kinds of granularities, given an expression to partition. Based on the computation cost of the expression, different thresholds determine whether or not the entire expression forms a single iterator, or multiple iterators decomposed at the external or contracted index level. Tasks partitioned at the external level produce iterators that iterate over contracted indices. If tasks are decomposed at the contracted index level, each iterator is essentially a single block-block computation. The threshold can be set as the number of operations or simply the complexity of expressions. More complex partitioning schemes can be developed on the basis of architecture-specific or empirically driven performance models, which require further investigation and profiling of test executions.

5.3.5 Task distribution and execution

In a task parallel programming model, the computation is expressed as a set of tasks to be executed concurrently. A task descriptor is used to obtain the location

80 Listing 5.9 Dynamic task partitioning.

1 task_partitioning(Expression* exp, TaskPool* tp) { 2 vector its;

3 if (exp->cost < threshold_one_task) { 4 ... /* create a single task */

5 }

6 else if (exp->cost < threshold_ext_tasks) { 7 ... /* create external index level tasks */

8 } 9 else if (exp->cost < threshold_sum_tasks) {

10 ... /* create contracted index level tasks */ 11 }

12 else ... 13 ...

14 for (int i = 0; i < its.size(); ++i) { 15 tp->addTask(its[i]);

16 } 17 }

of input and output data in the global address space. In DLTC, iterators serve as the task descriptors to identify tasks. For each layer of expressions, DLTC creates a separate task pool. The iterators from the expressions in the same layer go to the same task pool. The tasks in a task pool are initially distributed in a round-robin fashion, and are executed concurrently without taking data locality into account.

The task parallel execution runtime performs task stealing when any process runs out of tasks. As a result, few synchronization steps are needed between the layers.

No global locks are used for task distribution and load-balancing. In DLTC, we use random victim selection and steal-half strategies for dynamic load balancing.

5.3.6 Double buffering

With the support of PGAS models for one-sided, non-blocking communication operations, we present a scheme to overlap the data movement with computation.

81 Figure 5.6: Double buffering.

Figure 5.6 shows a double buffering technique for hiding communication. Byal- locating two sets of local buffers in memory, a process fetches data to the firstset of local buffers and then swaps to process the second set, without waiting forthe fetch to be completed. Swapping between buffer sets is repeated until the endof the iteration. No adaption to the layering process is required to support double buffering.

Clearly, this approach results in faster computation because the time originally spent on waiting is used to perform computation on a separate buffer. Although this technique requires twice the amount of memory for local buffers, the cost is relatively small compared to the memory needed to store tensors. The pseudo code for multiplication with double buffering is shown in Listing 5.10. The pointer p indicates which set of local buffers is processed in the current iteration.

82 Listing 5.10 Task function for multiplication with double buffering.

1 task_mul_db(Multiplication* m, Iterator* it) {

2 IndexValueMap ivm[2]; 3 ... /* allocate two sets of local buffers */

4 bool p = 0; // a pointer to current buffer 5 bool first = true;

6 while (it->next(ivm[p])) { 7 m->tA.nbGetTile(p, bufA[p], bufAsize, ivm[p]);

8 m->tB.nbGetTile(p, bufB[p], bufBsize, ivm[p]); 9 p = !p; // toggle

10 if (first) { 11 first = false;

12 }

13 else { 14 m->tA.wait(p);

15 m->tB.wait(p); 16 TRANS(bufA[p], bufAt[p], m);

17 TRANS(bufB[p], bufBt[p], m); 18 DGEMM(bufAt[p], bufBt[p], bufCt[p]);

19 TRANS(bufCt[p], bufC[p], m); 20 m->tC.accTile(bufC[p], bufCsize, ivm[p], m->coef);

21 } 22 }

23 if (!first) { // last iteration 24 p = !p; // toggle

25 m->tA.wait(p); 26 m->tB.wait(p);

27 TRANS(bufA[p], bufAt[p], m); 28 TRANS(bufB[p], bufBt[p], m);

29 DGEMM(bufAt[p], bufBt[p], bufCt[p]); 30 TRANS(bufCt[p], bufC[p], m);

31 m->tC.accTile(bufC[p], bufCsize, ivm[p], m->coef); 32 }

33 ... /* free local buffers */ 34 }

5.4 Results and Discussion

5.4.1 Experimental Setup

We have evaluated DLTC on an x86/InfiniBand cluster at The Ohio State Uni- versity. The cluster is configured with 160 nodes, each node with two quad-core

Intel Xeon CPU E5640 @ 2.67GHz processors and 12GB of memory. All nodes

83 are connected to a full fat tree QDR InfiniBand (40Gbps). The system runs MVA-

PICH2 (version 1.9) [1] MPI implementation on Redhat Linux Enterprise (version

5.4). DLTC is implemented in C/C++ and compiled using the Intel Compiler (ver- sion 13.0). The Intel Math Kernel Library (MKL) is used for DGEMM calls for multiply-

ing matrices. The tensor transposition TRANS functions are implemented and par-

allelized using OpenMP. We report the experimental results of DLTC using both

GA/Scioto and Tascel programming models. DLTC-GA/Scioto utilizes GA (ver-

sion 5.0) [51] for PGAS model and GA-based Scioto [22] for task parallel execution

runtime. DLTC-Tascel employs MPI-based Tascel (version 2.0) [43] for both PGAS

model and task parallel execution runtime. Different settings for performance tun-

ing were tested, and we report the best execution time obtained for each implemen-

tation. Although the CC methods are computed iteratively, we only focus on the

performance of a single iteration.

The NXTVAL scheme described in Listing 5.8 serves as the baseline for perfor-

mance evaluation. It serves as a representative for NWChem’s dynamic behavior

on tensor contractions because it employs exactly the same task distribution and

load-balancing scheme based on GA (marked as NWChem-Proxy). The difference

is that NWChem has static subroutines (generated by the TCE) for every expression,

while NWChem-Proxy runs on the same task function with iterators dynamically

generated for every expression on-the-fly. The overhead for runtime parameter

generation is very small (within 3% of the total execution time).

84 5.4.2 Performance comparison of DLTC vs. CTF

We compared the parallel performance of the DTLC against the Cyclops Tensor

Framework (CTF) [68, 69], a framework designed by Solomonik et al. for contract- ing distributed tensors. CTF employs a cyclic data representation to preserve ten- sor structures and to produce regular tensor decomposition. The CTF execution is split into a computation phase using the Scalable Universal Matrix Multiplication

(SUMMA) [75] algorithm, and a data redistribution phase between contractions.

CTF has been shown to be effective on computing single permutation symmetric contraction expressions, especially on torus networks, due to its organized com- munication and topology mapping of tensor data onto processors. However, all processors are forced to participate in every contraction, and redistribution of ten- sors is needed between contractions in CTF. We compare the performance of DLTC with CTF on dense tensor contractions, as spin and spatial symmetries in tensors are not yet implemented in CTF.

In the first set of experiments, we compared the parallel performance ofCCD and CCSD implemented in DLTC against implementations in CTF and NWChem-

Proxy. Figures 5.7 and Figure 5.8 show the results for CCD and CCSD computation as the problem size is varied on 64 nodes (512 cores). For the problem size of 100,

DLTC-GA/Scioto completed a CCD computation in 41 seconds and CCSD in 67 seconds — 2× and 2.4× faster than CTF, respectively. For smaller problems, the speedup is more evident due to the smaller percentage of time spent in computa- tion.

The execution time breakdown analysis for the problem size of 100 with the

CCSD calculation shows that both DLTC-GA/Scioto and CTF have roughly the

85 150

120

90 NWChem-Proxy CTF DLTC-Tascel 60 DLTC-GA/Scioto

30 Execution time (second)

0 60 70 80 90 100 Problem size

Figure 5.7: Performance results for CCD. Number of cores: 512.

same DGEMM time. However, it occupied 54% of total time for DLTC-GA/Scioto but only 18% for CTF, possibly due to a higher overhead in synchronization and data redistribution. The time breakdown for DLTC-GA/Scioto is shown in Figure

5.9. DLTC-Tascel has a similar time break down as DLTC-GA/Scioto, but a higher data movement overhead due to the active-message-based communication func- tions provided by Tascel.

Figure 5.10 and Figure 5.11 show the results of parallel strong scaling. We col- lected the data by running a fixed problem size N = 100 as we scale the number of cores. The speedup is computed on the basis of the best time obtained from DLTC-

GA/Scioto on 16 nodes (128 cores). DLTC-GA/Scioto achieved a higher speedup over CTF — up to 2.6× for CCD and over 3× for CCSD. Both figures depict better scalability of DLTC, and show that NWChem-Proxy scaled poorly as the number of cores increases.

86 250

200

150 NWChem-Proxy CTF DLTC-Tasel 100 DLTC-GA/Scioto

50 Execution time (second)

0 60 70 80 90 100 Problem size

Figure 5.8: Performance results for CCSD. Number of cores: 512.

8%

18% DGEMM COMM 54% TRANS OTHER 20%

Figure 5.9: Total execution time breakdown of DLTC for CCSD on 512 cores. Problem size: 100.

Method NWChem TCE DLTC

CCSD-T1 2775 304

CCSD-T2 9561 471

Table 5.2: SLOC comparison of DLTC vs. NWChem TCE.

87 8 7 6 5 DLTC-GA/Scioto DLTC-Tascel 4 CTF

Speedup 3 NWChem-Proxy 2 1 0 128 256 384 512 640 768 896 1,024 Number of cores

Figure 5.10: Strong scaling comparison for CCD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100.

8 7 6 5 DLTC-GA/Scioto DLTC-Tascel 4 CTF

Speedup 3 NWChem-Proxy 2 1 0 128 256 384 512 640 768 896 1,024 Number of cores

Figure 5.11: Strong scaling comparison for CCSD. Speedups are based on the best time from DLTC-GA/Scioto on 128 cores. Dotted line indicates ideal speedup. Problem size: 100.

88 60

50

40 NWChem 30 DLTC 20

10 Execution time (second)

0 16 32 48 64 80 96 112 128 Number of cores

Figure 5.12: Performance comparison of DLTC vs. NWChem for CCSD on Uracil.

400 350 300 250 NWChem 200 DLTC 150 100

Execution time (second) 50 0 64 128 192 256 320 384 448 512 Number of cores

Figure 5.13: Performance comparison of DLTC vs. NWChem for CCSD on GFP.

89 5.4.3 Performance comparison of DLTC vs. NWChem

We implemented customized iterators which are capable of cycling through in-

dices with permutation, spin, and spatial symmetric properties of tensors. The full

implementation of CCSD was generated and tested on two real-world molecule

simulations: Uracil and green fluorescent protein (GFP). Figure 5.12 and Figure

5.13 report the total execution time for Uracil and GFP. DLTC completed the com- putation with speedup of 1.7× and 2.5× over NWChem on Uracil and GFP, respec-

tively. Table 5.2 shows the number of source lines of code (SLOC) generated by

the TCE and DLTC for CCSD-T1 and T2 expressions. The volume of source code is

significantly reduced, because the numerous contractions are performed using the

same task functions in DLTC.

5.5 Summary

In this chapter, we introduced DLTC, a library-based framework for dynamic

task partitioning of tensor contraction expressions. DLTC maps tensor contractions

into a number of tasks with more regular granularity by taking advantage of several

domain-specific abstractions. These tasks are executed in a dynamic load-balanced

fashion with a double buffering technique for hiding communication latency.

We have evaluated DLTC’s performance in comparison with NWChem and CTF

using CC methods on an x86/InfiniBand cluster. Our experiments show that DLTC

achieves improved overall performance, and suggest that viewing the entire set

of tensor expressions as a whole and exploiting parallelism across independent

expressions offers better efficiency and scalability. Furthermore, DLTC reduces the

90 source code volume and mitigates the amount of work needed in programming and performance tuning for application developers.

The prototype DLTC is capable of implementing arbitrary tensor contraction expressions without any restriction on the number of dimensions (as long as they fit in the memory for execution). The parallel performance of DLTC is expected to be further enhanced by taking data locality into account. For example, because tensors are partitioned into a collection of dense tiles, it is feasible to map the tasks associated with the data they operate on, to enhance data locality and reduce com- munication cost. Data locality enhancement is the topic discussed in the next chap- ter.

91 CHAPTER 6

Data Locality Enhancement of DLTC via Caching

6.1 Introduction

Irregular applications cover a wide range of fields that often present fine-grained synchronization and communication on large data sets. The parallelism in irregu- lar applications is difficult to exploit due to their complex behavior, such as unpre- dictable memory access patterns, irregular communication patterns, complex con- trol structures, and so on. Current supercomputers are designed for optimal data locality for regular applications; thus developing irregular applications demands a substantial effort, and often results in unsatisfactory performance. Maximizing the performance of irregular applications on current and emerging computer architec- tures becomes a key requirement for future systems.

In this chapter, we seek to explore solutions for supporting efficient execution of a specific irregular application — tensor contractions. The data and computation of tensors are often unstructured and irregular due to their symmetric properties by nature. Therefore, it is usual to incur load imbalance and irregular communi- cation patterns in the implementation. In particular, our motivating examples are the quantum many-body methods such as the Coupled Cluster (CC) methods from

92 the quantum chemistry field, which is implemented in NWChem, a well-known computational chemistry software suite. NWChem utilizes the Tensor Contraction

Engine (TCE) [6, 9, 10, 30, 32] to automatically generate computer programs for ten- sor contraction expressions. To parallelize and compute the tensor contractions,

NWChem uses the Global Arrays (GA) [51], a partitioned global address space

(PGAS) model, to allow processes to fetch tensor tiles via one-sided communication operations. The tensor tiles may be distributed physically on different processors and fetched on-demand during execution.

The DLTC is a domain-specific, task parallel execution library that improves the

TCE approach on distributed system by eliminating many synchronization steps and providing better load balancing via dynamic task partitioning. In this chapter, we present an enhanced version of DLTC to optimize data locality while computing tensor contractions.

The rest of this chapter is organized as follows:

Section 6.2 briefly introduces the background knowledge of existing applica- tions being addressed, and points out their optimization opportunities.

Section 6.3 presents the optimization techniques for enhancing data locality used in DLTC.

Section 6.4 presents performance evaluation results.

Section 6.5 summarizes the chapter.

93 6.2 Preliminary

In order to identify opportunities of performance improvement, it is crucial to

understand how tensor contractions are implemented on distributed memory sys-

tems by the NWChem TCE module and DLTC. In NWChem, the tensor contraction

expressions of CC methods are compiled into a number of computation kernels by

the TCE. All of these kernels expose similar code structure formed by nested loops

which comprise two types of operations: (1) Calls to GA function that fetch and

store tensor tiles over the network (communication), and (2) Calls to subroutines

that and compute on the tensor tiles (computation).

The iteration space of the loop nest is parameterized by the input data to NWChem,

and thus not determined at compile time. However, the value of parameters that

affects the behavior of the loop nest is constant during the execution of the program.

The body of the nested loops is reconstructed into tasks.

In this section, we will use CCSD-T1 as our example to explain the TCE ap-

proach. Table 6.1 illustrates the CCSD-T1 equations extracted from NWChem source

code, which consists of 19 tensor contraction expressions.

6.2.1 Block-sparse representation

To maintain better local memory usage at a manageable level, the indices of

tensors are rearranged into subsets that contain indices with the same symmetry

properties. Figure 6.2 explains the block-sparse representation step-by-step, using

a matrix example. Figure 6.2(a) is the original form of an 8×8 matrix (2-D tensor).

Figure 6.2(b) shows the same tensor after index range tiling — the indices are di- vided into groups and now the numbers represent the tile indices of the matrix.

94 Node Expression Cost

1 R[p2, h1] += F[p2, h1] O(N )

2 I[h7, h1] += F[h7, h1] O(N )

3 I[h7, p3] += F[h7, p3] O(N )

4 I[h7, p3] += −1.0 ∗ T[p5, h6] ∗ V[h6, h7, p3, p5] O(N )

5 I[h7, h1] += T[p3, h1] ∗ I[h7, p3] O(N )

6 I[h7, h1] += −1.0 ∗ T[p4, h5] ∗ V[h5, h7, h1, p4] O(N )

7 I[h7, h1] += −0.5 ∗ T[p3, p4, h1, h5] ∗ V[h5, h7, p3, p4] O(N )

8 R[p2, h1] += −1.0 ∗ T[p2, h7] ∗ I[h7, h1] O(N )

9 I[p2, p3] += F[p2, p3] O(N )

10 I[p2, p3] += −1.0 ∗ T[p4, h5] ∗ V[h5, p2, p3, p4] O(N )

11 R[p2, h1] += T[p3, h1] ∗ I[p2, p3] O(N )

12 R[p2, h1] += −1.0 ∗ T[p3, h4] ∗ V[h4, p2, h1, p3] O(N )

13 I[h8, p7] += F[h8, p7] O(N )

14 I[h8, p7] += T[p5, h6] ∗ V[h6, h8, p5, p7] O(N )

15 R[p2, h1] += T[p2, p7, h1, h8] ∗ I[h8, p7] O(N )

16 I[h4, h5, h1, p3] += V[h4, h5, h1, p3] O(N )

17 I[h4, h5, h1, p3] += −1.0 ∗ T[p6, h1] ∗ V[h4, h5, p3, p6] O(N )

18 R[p2, h1] += −0.5 ∗ T[p2, p3, h4, h5] ∗ I[h4, h5, h1, p3] O(N )

19 R[p2, h1] += −0.5 ∗ T[p3, p4, h1, h5] ∗ V[h5, p2, p3, p4] O(N )

Table 6.1: Tensor contraction expressions of CCSD-T1.

95 3 4 9 10 13 14

2 5 6 7 11 15

8 16 17

1 12 19 18

Figure 6.1: DAG representation of CCSD-T1.

If permutation symmetry is exploited, only roughly half of the tiles are stored in the memory because the other half can be obtained by transposition (Figure 6.2(c)).

The blue color of tiles indicates the ones that are kept. Finally, if spatial symmetry is exploited, only a subset of tiles of tensors are stored (Figure 6.2(d)).

In NWChem, a logical test is used to check if a particular tile is nonzero using

XOR operation on the tile indices. A tile contains nonzero elements only if the re- sult of XOR on the tile indices is zero. Spatial symmetry is exploited to reduce both computation and communication cost. The data movement and computation are carried out at the tile level. This block-sparse representation of tensors is critical and determines the parallel structure of TCE generated code. The size of the tile defines the local memory usage and the task granularity.

96 0 1 2 3 4 5 6 7 0 1 2 3

0 0 1 2 3 1 4 2 5 6 3 7

(a) Original (b) Index range tiling

0 1 2 3 0 1 2 3

0 0

1 1

2 2

3 3

(c) Permutation symmetry (d) Spatial symmetry

Figure 6.2: Block-sparse representation of a 2-D tensor.

97 6.2.2 Subroutines generated by TCE

In the NWChem TCE, tensor contractions are executed serially one after the other, introducing many synchronizations between contractions, and therefore lim- iting the overall performance. As shown in Figure 6.3, the TCE generates a separate subroutine for computing each expression. A subroutine is basically a chain of get- compute-accumulate operations dynamically parallelized through a global shared variable.

Algorithm 6.1 illustrates the subroutine for computing node 7 of CCSD-T1 in

Table 6.1:

I[h7, h1] += −0.5 ∗ T[p3, p4, h1, h5] ∗ V[h5, h7, p3, p4] where tensors are tiled over their dimensions and distributed across the system in globally addressable arrays, denoted as I, T, V. I, T, and V are implemented as 1-D global arrays to support permutation, spin, and spatial symmetries of the blocks.

The TCE generates nested loops for all the indices involved in the contraction; in this case, they are h7, h1, p3, p4, h1, and h5. The indices given for nested loops, global arrays and local buffers are tile-level indices. The NXTVAL function is an operation which query-and-add a global counter gc, implemented using GA (as described previously in Section 5.3.2). Each process also iterates via all possible tasks using a local counter lc. Whenever the value of the local counter lc matches the next task assigned to a process, it executes the innermost code block.

The Symmetry function is a condensation of several logical tests for checking whether a tensor tile is nonzero or not. Each tile index represents a set of contiguous indices, and the operations are performed on tensor tiles instead of single elements.

98 1 2 3 4 5 ...

subroutine 10 9 8 7 6 6

subroutine 11 12 13 14 15 15

19 18 17 16 ...

Figure 6.3: TCE implementation of CCSD-T1.

In the innermost code block, the GA_GET operations are called for fetching tensor tiles from global arrays to local buffers. Since the data layout in global array may not be always suitable for local computation, the TCE_SORT function is called for reordering tensor tiles. Vendor provided BLAS function are used in the DGEMM function for matrix-matrix multiplication. After the computation, the result tensor block is reordered if needed and accumulated, via TCE_SORTACC, a function that combines TCE_SORT and GA_ACC for efficiency.

Similar to the TCE, DLTC also uses a block-sparse representation of distributed tensors in global address space. However, DLTC exploits parallelism across con- tractions by automatically organizing the expressions into layers, as shown in Fig- ure 6.4. The details on how DLTC dynamically create tasks from expressions were covered previously in Section 5.3. Here, we will focus on the further enhancements

99 Algorithm 6.1 TCE implementation input : Tiled Global Arrays: I, T, V begin Allocate local buffer i, t, v Initialize global counter gc = 0 (collective call) Initialize local counter lc = 0 next = NXTVAL(gc) for h7, h1 ∈ O do if next == lc then if Symmetry(h7, h1) =true then for p3, p4 ∈ V do for h5 ∈ O do if Symmetry(p3, p4, h1, h5) =true then if Symmetry(h5, h7, p3, p4) =true then GA_GET: fetch T(p3, p4, h1, h5) into t GA_GET: fetch V(h5, h7, p3, p4) into v TCE_SORT: reorder t as T(h1, p3, p4, h5) TCE_SORT: reorder v as V(p3, p4, h5, h7) DGEMM: compute i(h1, h7) += t(h1, p3, p4, h5) ∗ v(p3, p4, h5, h7) end end end end TCE_SORTACC: reorder and accumulate i into I(h7, h1) end next = NXTVAL(gc) end lc = lc + 1 end Deallocate local buffer i, t, v GA_SYNC: synchronize all processes and proceed to next subroutine end

100 3 4 9 10 13 14

2 5 6 7 11 15 16 17

1 8 12 18 19

Figure 6.4: Layered DAG representation of CCSD-T1 in DLTC.

to reduce data movements between processes, because both TCE and the previous

described DLTC do not consider data locality while scheduling and executing tasks.

6.2.3 Target equations

In this chapter, we discuss techniques of performance optimization on two ex-

amples of the CC methods: CCD-T2 and CCSD-T2. We focus on the T2 excitations

of both equations because they are the most computation intensive part. As de-

scribed in Table 5.1, CCD-T2 consists of 14 expressions (9 multiplications and 5 summations) and 13 tensors. CCSD-T2 involves 44 expressions (30 multiplications and 14 summations) and 28 tensors.

Tensors of the CC methods can be categorized into F and V integrals, T ampli-

tude, result tensor R, and a number of intermediate tensors (denoted as I,I,I, ...).

The F integral is a 2-D tensor which can be viewed as an array of size

(O + V) × (O + V),

101 O V

O FOO FOV

V

FVO FVV

F integral

Figure 6.5: Viewing the F integral as four sub-tensors.

and the V integral is viewed as an array of size

(O + V) × (O + V) × (O + V) × (O + V).

We can assume a simpler form of data layout where the integrals are stored

separately as different sub-tensors. As shown in Figure 6.5, F can be represented

by sub-tensors F, F, F, and F. Similarly, V is formed by sub-tensors such as

V, V, V, V, and so on.

The CC methods are iterative computations, where the computation result, the

R tensors, are used to update the T tensors. For instance, the result of computing the

T1 equation, R, is used to update the value of T, for input to the next iteration.

Similarly, T is updated by R computed in the T2 equation.

Table 6.2 categorizes the tensors in CCD-T2 and CCSD-T2 into either 2-D or 4-D tensors. We focus on the tensors that are unchanged in the computation. Again, we use a problem size N for O and V for conciseness of our analysis. However, the

102 Equation Dimension Tensor

2 F, F CCD-T2

4 V, V, V, V, V, T, R

2 F, F, F, T

CCSD-T2 V, V, V, V, V, V, V, 4

V, V, T, R

Table 6.2: Input and output tensors of CCD-T2 and CCSD-T2 equations.

principles for optimization are the same for different values of index ranges for O and V..

6.3 Methods

Figure 6.6 shows the memory model and task execution model employed in the

TCE. The memory of a process is divided into two parts: a global part and a local part. The global part of memories of processes are aggregated using a PGAS model, such as Global Arrays. All tensors are created and initially distributed in the global address space. Each process maintains its own local address space as a workspace for task execution.

The tensor contraction expressions are broken down into smaller tasks. Each task involves copying input tensor tiles from the global address space, computing these tiles, and then accumulating the result back to the global address space. For

103 Global Address Space

A B C

Local Address Space Local Address Space ...

Process 0 Process 1 ...

Figure 6.6: The memory model and task execution model in the TCE.

example, as illustrated in Figure 6.6, process 0 fetches two input tiles A and B, com- putes in the local address space, and then puts/accumulates the result tile C back to the global address space.

There are several opportunities of communication reduction in this task exe- cution model. First, the copying of a tile can be avoided if it is already located in the same process, such as the tile A in Figure 6.6. Second, the tiles fetched may

be accessed again in other tasks. It is beneficial to store the tiles obtained from GET

calls in the local address space for possible reuse in the future. Moreover, it may

be worthwhile to replicate some tiles that are frequently accessed in order to avoid

substantial data transfer, or tiles that are small in order to avoid communication of

very small messages.

6.3.1 Communication traffic analysis

To effectively reduce the communication cost in DLTC, first we need to under-

stand the communication traffic in the execution of tensor contraction tasks. Figure

6.7 shows the total execution time breakdown for computing CCD-T2 on 32 nodes

104 6% 11% Computation 49% Communication Transpose 34% Other

Figure 6.7: Total execution time breakdown of CCD-T2 on 256 cores. Problem size: 100.

(256 cores) of an Infiniband cluster. The profiling was performed using problem

size N = 100. The data are the average time across all the processes.

Approximately half of the total time was spent on computation, and roughly one

third of the total time was spent on data transfer. Figure 6.8 illustrates the further

breakdown of communication time on 2-D and 4-D tensor tiles. We observe that

more than 96% of time is spent on fetching 4-D tensor tiles. Figure 6.9 shows that

multiplication expressions dominate the communication time for data transfer.

There are a total of seven 4-D tensors in the CCD-T2 equation, as shown in Ta-

ble 6.2. We collect the communication time spent on these tensors and plot their percentage in a pie chart in Figure 6.10. Among all tensors, about half of the time

is spent on fetching tiles of T. T takes the most time partly because it is the

most frequently accessed 4-D tensor in CCD-T2. From the communication traffic

analysis, we observed that most of the communication time is spent on multiplica-

tion expressions and 4-D tensors.

105 4%

4-D tiles 2-D tiles

96%

Figure 6.8: Communication time breakdown of CCD-T2 CCD-T2 on 256 cores for 2-D and 4-D tiles. Problem size: 100.

2%

Multiplication Summation

98%

Figure 6.9: Communication time breakdown of CCD-T2 on 256 cores for multiplication and addition expressions. Problem size: 100.

1%

9% T_VVOO 9% V_OOVV V_VVVV 8% 52% I_OOOO I_OVOV 21% Other

Figure 6.10: Communication time breakdown of 4-D tiles for different tensors. The profil- ing is performed for CCD-T2 on 256 cores. Problem size: 100.

106 6.3.2 Communication optimization

We present two schemes to reduce communication cost.

Replicating 2-D tensors

Although the communication cost is small on 2-D tensors, they are great candi- dates for data replication. Since the memory cost of a 2-D tensor is smaller than a

4-D tensor by two orders of magnitude, it is often possible to replicate these small tensors. For a concrete example, if the problem size N for CCSD-T2 is 100, the memory required for a 4-D tensor in the equation is approximately 763 MB (dou- ble precision floating point format); however, it is only 78 KB for a 2-D tensor. Thus, we can easily replicate 2-D tensors at the beginning of the program, and eliminate all the communication time for transferring 2-D tensor tiles.

Caching 4-D tensor tiles

For 4-D tensor tiles, we propose a software caching scheme to maximize the usage of local memory space by storing the accessed tiles in a cache. Figure 6.11 illustrates the new memory model and task execution model in DLTC, with data replication and caching. In the new model, the local memory space of a process is divided into three separate parts. The first part is for storing replicated tensors, in our case, they are 2-D tensor tiles. The second part serves as the workspace for executing tasks. It can be relatively small compared to tensors, because we only require sufficient local buffer sizes for fetching tensor tiles. The rest of thememory is used for caching — when a process needs some tile of tensor, it first looks into its cache in the local address space. No communication is needed if the cache contains the tile. DLTC employs the Least Recent Used (LRU) cache replacement algorithm.

107 Global Address Space

Distributed tensor 1

Distributed tensor 2

...

Work Work Replicated Cache Replicated Cache tensors space tensors space ... Local Address Space Local Address Space

Process 0 Process 1 ...

Figure 6.11: The memory model and task execution model in DLTC with data replication and caching.

6.4 Results and Discussion

6.4.1 Experimental setup

We performed all the profiling and experiments on x86/InfiniBand cluster “RI”,

with 160-node dual-socket quad-core Intel Xeon CPU E5640 @ 2.67GHz processors

and 12GB of memory per node at The Ohio State University. The computer nodes

are connected with a full fat tree QDR InfiniBand (40Gbps). The system runs MVA-

PICH2 (version 1.9) [1] MPI implementation on Redhat Linux Enterprise (version

5.4). The communication libraries used were ARMCI from Global Arrays (version

5.2), which is heavily optimized for InfiniBand clusters. We employ Intel Math

Kernel Library (MKL) for BLAS. In our experiments, we launched one process per node and utilized all the cores for multi-threaded matrix-matrix multiplication and tensor transposition.

108 100

80

2-D tiles 60 4-D tiles Best hit rate (%) 40

64 128 256 512 1024 Number of cores

Figure 6.12: Best hit rate achievable of DLTC for CCD-T2. Problem size: 100.

6.4.2 Hit rate analysis

In the first set of experiments, we assess the best hit rate achievable bythe

caching scheme. In other words, we are interested in the best possible performance

gain if every process can store all the accessed tiles in its local memory.

Figure 6.12 and Figure 6.13 show the best possible hit rate percentage for CCD-

T2 and CCSD-T2 on different number of cores. The hit rate numbers are computed by only recording the IDs of the accessed tiles without actually storing them.

As expected, the best hit rate declines as the number of cores grows larger. The effectiveness of caching decreases with core count because of a smaller fraction of total data being local. The hit rate of 2-D tiles is larger than 4-D tiles, showing a better data reuse on smaller tiles.

109 100

80

2-D tiles 60 4-D tiles Best hit rate (%) 40

64 128 256 512 1024 Number of cores

Figure 6.13: Best hit rate achievable of DLTC for CCSD-T2. Problem size: 100.

7

6

5

4 CCSD-T2 3 CCD-T2

2

1

Communication volume (GB) 0 64 128 256 512 1024 Number of cores

Figure 6.14: Communication volume of DLTC for CCD-T2 and CCSD-T2. Problem size: 100.

110 6.4.3 Communication volume analysis

We are also concerned about the communication volume of each process be-

cause it can give us an idea of the cache size needed to store all the accessed tensor

tiles. To obtain the amount of data transferred, we record and accumulate the size

of tiles whenever a call to getTile is performed. Figure 6.14 shows the average com-

munication volume, computed by gathering the numbers from all processes for the

CCD-T2 and CCSD-T2 equations. From the figure, we observe that the communi-

cation volume roughly halves when the core count is doubled. The total amount of

communication remains the same. The maximum cache size can then be estimated

by computing the volume of cold misses:

Size = (1 − HitRate) × Volume.

6.4.4 Caching performance

Figure 6.15 and Figure 6.16 depict the communication time spent for CCD-T2

and CCSD-T2 equations on problem size N = 100 as we scale the number of cores.

The enhancement by caching is most evident for smaller number of cores, with an improvement of up to 2.8×. The speedup declines as the number of cores grows, indicating that caching effectiveness decreases on larger number of cores, with 1.3× left on 1024 cores. For CCSD-T2, the speedup of communication shows numbers ranging from 4.9× to 1.4× as the core count grows. This result is expected because the rate of tile reuse decreases when a smaller volume of communication is per- formed.

The total execution times for both equations are illustrated in Figure 6.15 and

Figure 6.16. We observe performance improvement ranging from 1.3× to 1.5× for

111 30 3

25 2.5

20 2 Speedup 15 1.5 No cache 10 1 Speedup With cache

5 0.5

Communication time (second) 0 0 64 128 256 512 1024 Number of cores

Figure 6.15: Communication time of DLTC for CCD-T2. Problem size: 100.

50 5

40 4

30 3 Speedup

No cache 20 2 Speedup With cache

10 1

Communication time (second) 0 0 64 128 256 512 1024 Number of cores

Figure 6.16: Communication time of DLTC for CCSD-T2. Problem size: 100.

112 80 2

60 1.5

Speedup 40 1 No cache

Speedup With cache 20 0.5

Total execution time (second) 0 0 64 128 256 512 1024 Number of cores

Figure 6.17: Total execution time of DLTC for CCD-T2. Problem size: 100.

CCD-T2, and from 1.2× to 1.9× for CCSD-T2. Although the speedup for commu- nication is smaller for larger core count, we still obtain performance improvement because the percentage of communication is higher for larger core counts. More- over, when the communication time is reduced, the workload of a task becomes smaller, which also improves the workload balance.

6.5 Summary

In this chapter, we have presented a data locality enhancement for DLTC using data replication and caching techniques. We have profiled and analyzed the com- munication traffic, providing meaningful insights by breaking down the numbers on different types of tensors and expressions in the equations. Our experimental results show that there is considerable room for performance improvement of the prototype DLTC by communication optimization. Caching is especially effective on smaller number of cores, given that tensor tiles are more likely to be reused.

113 120 2.4

100 2

80 1.6 Speedup 60 1.2 No cache 40 0.8 Speedup With cache

20 0.4

Total execution time (second) 0 0 64 128 256 512 1024 Number of cores

Figure 6.18: Total execution time of DLTC for CCSD-T2. Problem size: 100.

114 CHAPTER 7

Related Work

In this chapter, we survey earlier and recent work related to ours, both to point

out the contributions of other researchers and to place our contributions in the

proper context. We discuss the related work in several areas: matrix multiplica-

tion algorithms, optimizing Tensor Contraction Engine, and other contemporary

implementations of tensor contraction expressions.

7.1 Matrix Multiplication Algorithms

Tensor contractions are generalized forms of matrix-matrix multiplication. There

is a wealth of literature that investigates fast implementations of matrix-matrix mul-

tiplication on distributed-memory parallel systems.

Probably the most classic parallel algorithm for matrix multiplication (C = A×B)

is the Cannon’s algorithm [13] developed in 1969, based upon a checkerboard block decomposition of matrices. In Cannon’s algorithm, the processors are arranged in a 2-D grid. Matrix tiles are moved around so that every processor multiplies a pair of input tiles and contributes the result to an output tile. Tiles of matrix A are cycled horizontally, while tiles of matrix B are cycled vertically. The Cannon’s algorithm is often recognized as roll-roll-compute because of the way it executes.

115 In 1988, Fox’s algorithm [25] was developed in a broadcast-roll-compute fashion.

The tiles of matrix A are broadcasted horizontally instead of cycled; the tiles of ma- trix B remains cycled vertically. Both Cannon’s and Fox’s algorithm lack general- ization because their grids and matrix tiles are partitioned as squares — removing this constraint is nontrivial.

In 1997, a more flexible algorithm named Scalable Universal Matrix Multipli- cation Algorithm (SUMMA) [75] was designed to enable a better generalization of matrix-matrix multiplication. The SUMMA algorithm employs a broadcast-broadcast- compute technique to transmit a column of matrix A horizontally, and a row of ma- trix B vertically along the processor grid. Each processor computes on the received data and contributes the result to a portion of matrix C. Further optimizations such as blocking (transferring a panel of rows / columns) and pipelining (overlapping broadcasting and computation) can be employed to improve the performance of

SUMMA. It is a more general algorithm and highly tunable to trade communica- tion cost with memory requirement, yielding a practical use in many linear algebra software libraries.

Cannon’s, Fox’s, and SUMMA algorithms are considered as 2-D algorithms be- cause their processor topology is organized as a 2-D grid, and communication oc- curs among processors in the same row and column. Each processor performs

(2N)/P computation, communicates Θ((N)/S) along the row (or column), and re- quires 3(N)/P local memory to store data, where N is the dimension of the matri- ces and P is the number of processors. However, the communication cost can be reduced if there is extra memory available. 3-D algorithms [4] are developed at the

116 cost of a factor of P/ additional memory, providing a factor of P/ less data move-

ment than 2-D algorithms, by viewing the processor grid as a S×S ×S mesh (where

S = P/). Very recent work by Solomonik and Demmel presented 2.5-D algorithms

[67] that view processors as a S × S × C mesh, where C can be chosen in the range of

1 to S (if C = 1, 2.5-D becomes 2-D; if C = S, 2.5-D becomes 3-D).

The well-known Strassen’s algorithm has been developed to compute matrix- matrix multiplication in O(N.) operations and implemented in many studies. In

particular, Lipshitz et al. [44] focused on a communication-avoiding parallel im-

plementation of Strassen’s algorithm on distributed memory systems.

Even smaller exponents of N for multiplying matrices have been found and

proved in new algorithms such as the famous Coppersmith-Winograd’s algorithm

of O(N.). However, they are not useful in practice because they only works for

matrices that are too huge to handle in present computers. The numerical stability

of these algorithms is also reduced, compared to conventional O(N) algorithms.

In this dissertation, we focus on utilizing matrix-matrix multiplication kernels of

O(N) algorithms and implementations.

7.2 Optimizing TCE

In this section, a number of prior efforts that sought to improve upon the pro-

totype TCE are presented, and categorized into the following sub-problems:

117 Operation minimization: Manually optimized factorizations have been published

for tensor contraction expressions such as CCSD [71], and implemented, for ex-

ample, in the PSI3 [3] software. The TCE [6, 9, 32] developed automated algo-

rithms for operation minimization of tensor contraction expressions. An opera-

tion minimization heuristic exploiting antisymmetry in tensors was implemented

by Hirata in the prototype TCE developed at Pacific Northwest National Labora-

tory (PNNL) [32]. Many parallel programs have been derived and incorporated into the NWChem quantum chemistry suite [74]. Recent studies by Engels-Putzka

and Hanrath [24, 28] have presented genetic algorithms for performing operation

minimization of tensor contraction expressions. In Section 4.4, we compared the

operation counts obtained by OpMin and demonstrated improvement over these

solutions. Operation minimization of tensor contraction expressions previously

addressed by Sibiryakov [66] and Hartono et al. [29–31] did not model symmetry

properties of tensors. Our work was built on top of their work and presented in

Chapter 4.

Memory minimization: Loop structure transformations such as loop fusion can

reduce memory requirements during the computation without increasing the num-

ber of operations. Lam et al. [41] modeled the relationship between loop fusion and

memory usage, and presented an algorithm to eliminate some dimensions of inter-

mediate tensors for reducing memory usage. Cociorva et al. [15] explored possible

space-time trade-offs to generate implementations that fit within a specified mem-

ory limit for tensor contraction expressions in the operation minimized form after

the algebraic transformation.

118 Space-time transformation: Allam et al. [5] developed a fast integer linear pro- gramming formulation that encodes legal rules for loop fusion through logical con- straints and effective approximation of memory usage. Lam et al.[40, 42] presented

efficient algorithms for finding a memory-optimal evaluation order for agiven n-

node expression tree of tensor contractions.

Data locality optimization: It is important to maximize the data reuse and min-

imize the data movement between disk and main memory as well as memory and

cache. Cociorva et al. [17, 18] provided integrated approaches of appropriate loop

tiling for data locality enhancement, and loop fusion for reducing memory usage of

intermediate tensors, while keeping the total memory usage within a given limit.

Gao et al. [26] explored integrated fusion and tiling transformations to minimize

the traffic between the memory and disk. Effective strategies to prune the search

space of loop structures are also presented in [27]. These approaches focused on

selecting proper tile sizes, and merging suitable loops.

Data distribution and partitioning: Several studies have focused on effective data partitioning and distribution to reduce communication time. Cociorva et al. [14]

investigated the interactions between data distribution and memory minimization

via loop fusion, and presented strategies for communication minimization under

memory constraints. They also developed an approach to minimize inter-processor

communication for tensor contraction expressions without exceeding memory limit

[16]. Krishnamoorthy et al. [35] described a static task distribution scheme based on

the hypergraph partitioning. The relationships between tasks are modeled as a hy-

pergraph: nodes represent tasks, and edges represent common data blocks shared

119 by the tasks. The goal in their scheme is to find an optimal graph partitioning based

on the weights of nodes and edges with the workload balance maintained.

Computation Optimization: A classic way of implementing tensor contractions

is to reconstruct them into matrix-matrix multiplications (computation) via index

reordering (transpose). Many algorithms have been developed to improve the per-

formance of computation and transpose kernels. Krishnamoorthy et al. [34, 37, 38]

presented fast algorithms for parallel out-of-core transposition and redistribution

of the tensors from the available format provided in the TCE into another form for

computation. Furthermore, they developed efficient transformation algorithms for

the blocked layout of tensors [35] and extra supports for disk-resident arrays [36].

Lu et al. [46, 47] proposed performance models that empirically measure the cost

of matrix-matrix multiplications and index permutations. The cost models pro-

vide suggestions to employ data layout optimization, select library calls, and lay-

out transformation, leading to reduced overall execution time. Hartono et al. [30]

reduced the time for index reordering using the CCSD(T) method as an example.

Ma et al. [48] designed and implemented a fast graphical processing unit (GPU) implementation of the (T)-part of CCSD(T).

7.3 Other Implementations for Tensor Contractions

Several fundamentally different approaches to the TCE and DLTC targeted at

optimizing communication for a single contraction have been proposed. Solomonik

et al. created the first implementation of communication optimized algorithm for

distributed tensor contractions — Cyclops Tensor Framework (CTF) [70]. CTF de-

composes tensors into a regular, cyclic data representation, with the objective to

120 overcome load imbalance and irregular communication, particularly for torus net- works. CTF focuses on improving single contraction performance and requires re- distributing tensors between two contractions.

Similarly, Rajbhandari et al. [61, 62] developed communication efficient algo- rithms and cost models for contracting distributed tensors, namely Contraction

Algorithms for Symmetric Tensors (CAST). However, the relative advantages of these approaches and DLTC depend upon the hardware architectures and the de- gree of tensor sparsity. DLTC has advantages in dynamic load balancing and extra parallelism across different expressions, while CTF and CAST exhibit organized communication pattern and topological mapping of tasks.

An Inspector-Executor approach developed by Ozog et al. [56] provides a two- step framework to improve TCE performance. In the inspection step, the irregu- larity of tensor computation are discovered and stored as a list of tasks. Then, in the execution step, these tasks are distributed based on a cost estimation model.

The Inspector-Executor method improves the overall performance of TCE by elim- inating the global shared counter and a better initial task distribution. However, the task granularity remains the same as the original TCE approach, and neither dynamic load-balancing nor inter-contraction execution are performed. Ghosh et al. [57] conducted analysis for communication patterns of the NWChem TCE pro- grams and provided communication improvements by reordering one-sided com- munications and a caching scheme which shares similarity with our work. How- ever, they do not exploit parallelism across contractions; so the reuse of tensor tiles is limited.

121 An alternate effective implementation of CC methods is developed in ACES III

[45]. In ACES III, the CC methods are described in a domain-specific language called Super Instruction Assembly Language (SIAL) [63]. SIAL provides similar functionalities as the TCE but with a higher level of abstraction for tensor contrac- tions. The programs are then executed by the Super Instruction Processor (SIP), a parallel virtual machine that schedules the tasks and handles data transfer between tasks.

Very recent work by McCraw et al. discussed a dataflow-based programming model, Parallel Runtime Scheduling and Execution Control (ParSEC) [12], and per- formed an example to demonstrate its applicability for CC methods. Although their result was preliminary and limited to only one expression, it indicated po- tential usefulness of dataflow-based execution of tensor contraction expressions, which can eliminate the remaining synchronization steps in our system.

122 CHAPTER 8

Future Work

In this chapter, we point out possible directions for improvement of the ap- proaches presented.

A primary issue for high scalability is communication. In DLTC, computations are described by a set of tasks that can be executed on any processor. Each task involves reading, computing, and writing on few tensor tiles. Tensors are stored in the global address space; communication is required if a task and the tensor tiles it operates on are not co-located. The default settings of DLTC assume that every tensor is partitioned and distributed across all processors. Similarly, computational tasks are also distributed across all processors without any constraint on how tasks are scheduled and moved. The following can be explored to further improve data locality and to reduce the communication overhead:

• Divide processors into smaller groups and partition tensors and expressions

within each groups. A process first search for victims within the same group

before searching for processes outside of the group.

• Tensor tile IDs in the cache can be used to guide task stealing so that a pro-

cessor selects victim tasks based on data locality considerations.

123 • To attain high performance on modern large-scale computing platforms has

become more and more challenging due to both the increasing complexity

and heterogeneity of the systems. The dataflow-based programming model

can be a possible solution to eliminate all the synchronization steps between

layers of expressions introduced in DLTC.

Finally, as described in Chapter 7, another important approach of high-performance parallel tensor contraction implementations views each expression as a whole. For example, CTF [70] and CAST algorithm [61] use SUMMA-like algorithms to or- ganize communication and computation. The relative merits of DLTC and CAST depend upon the details of computer architectures and the degree of tensor irreg- ularity. It is a promising direction to merge the best from both sides. We are de- signing a hybrid system where the larger contractions are computed using CAST, assisted by the task scheduling and dynamic load balancing using DLTC.

124 CHAPTER 9

Conclusion

The work presented in this dissertation has exploited symmetry properties of tensors in the operation minimization, and has developed a task parallel execu- tion framework with dynamic task partitioning, load balancing, and data locality optimization for tensor contraction expressions.

The OpMin work provided important insights for automated algebraic trans- formation of tensor contraction expressions involving symmetric tensors. The op- timized expressions from OpMin require substantially fewer operations for a num- ber of popular CC methods, compared to other state of the art implementations.

The DLTC work provided a domain-specific framework for efficient and produc- tive implementation of tensor contraction expressions. DLTC has shown speedup over other real-world chemistry software with improved parallelism, workload bal- ance, and data locality.

Our efforts can allow chemists to focus on developing high-level theories, while implementation details can be taken care of automatically. We have made some advances towards performance optimization of tensor contraction expressions. Di- rections for further work have also been proposed.

125 APPENDIX A

Tensor Contraction Expressions

This appendix contains tensor contraction expressions used in the dissertation.

A.1 Input equations to OpMin

The following tensor contraction expressions are the input equations to OpMin for operation minimization. These expressions are in their raw forms before oper- ation reduction. UCCSDT-T3 is not included due to its large size.

UCCSD-T1

1 ra_vo[p1a,h1a] = 2 1.0 * fa_vo[p1a,h1a]

3 + 1.0 * fb_ov[h2b,p2b] * tab_vvoo[p1a,p2b,h1a,h2b] 4 + 1.0 * fa_ov[h2a,p2a] * taa_vvoo[p1a,p2a,h1a,h2a]

5 - 1.0 * vab_ooov[h2a,h3b,h1a,p2b] * tab_vvoo[p1a,p2b,h2a,h3b] 6 + 0.5 * vaa_oovo[h2a,h3a,p2a,h1a] * taa_vvoo[p1a,p2a,h2a,h3a]

7 + 1.0 * vab_vovv[p1a,h2b,p2a,p3b] * tab_vvoo[p2a,p3b,h1a,h2b] 8 + 0.5 * vaa_vovv[p1a,h2a,p2a,p3a] * taa_vvoo[p2a,p3a,h1a,h2a]

9 + 1.0 * fa_vv[p1a,p2a] * ta_vo[p2a,h1a] 10 - 1.0 * fa_oo[h2a,h1a] * ta_vo[p1a,h2a]

11 + 1.0 * vab_voov[p1a,h2b,h1a,p2b] * tb_vo[p2b,h2b] 12 - 1.0 * vaa_vovo[p1a,h2a,p2a,h1a] * ta_vo[p2a,h2a]

13 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h2b] * tab_vvoo[p1a,p3b,h1a,h3b] 14 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h3b] * taa_vvoo[p1a,p2a,h1a,h2a]

15 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * tab_vvoo[p1a,p3b,h1a,h3b] 16 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h2a] * taa_vvoo[p1a,p3a,h1a,h3a]

17 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h1a] * tab_vvoo[p1a,p3b,h2a,h3b] 18 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h1a] * taa_vvoo[p1a,p3a,h2a,h3a]

19 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p1a,h2a] * tab_vvoo[p2a,p3b,h1a,h3b]

126 20 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p1a,h2a] * taa_vvoo[p2a,p3a,h1a,h3a] 21 - 1.0 * fa_ov[h2a,p2a] * ta_vo[p2a,h1a] * ta_vo[p1a,h2a]

22 + 1.0 * vab_vovv[p1a,h2b,p2a,p3b] * ta_vo[p2a,h1a] * tb_vo[p3b,h2b] 23 + 1.0 * vaa_vovv[p1a,h2a,p2a,p3a] * ta_vo[p2a,h1a] * ta_vo[p3a,h2a]

24 - 1.0 * vab_ooov[h2a,h3b,h1a,p2b] * ta_vo[p1a,h2a] * tb_vo[p2b,h3b] 25 + 1.0 * vaa_oovo[h2a,h3a,p2a,h1a] * ta_vo[p1a,h2a] * ta_vo[p2a,h3a]

26 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h1a] * ta_vo[p1a,h2a] * tb_vo[p3b,h3b] 27 - 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h1a] * ta_vo[p1a,h2a] * ta_vo[p3a,h3a];

28 29 rb_vo[p1b,h1b] =

30 1.0 * fb_vo[p1b,h1b]

31 + 1.0 * fb_ov[h2b,p2b] * tbb_vvoo[p1b,p2b,h1b,h2b] 32 + 1.0 * fa_ov[h2a,p2a] * tab_vvoo[p2a,p1b,h2a,h1b]

33 + 0.5 * vbb_oovo[h2b,h3b,p2b,h1b] * tbb_vvoo[p1b,p2b,h2b,h3b] 34 - 1.0 * vab_oovo[h2a,h3b,p2a,h1b] * tab_vvoo[p2a,p1b,h2a,h3b]

35 + 0.5 * vbb_vovv[p1b,h2b,p2b,p3b] * tbb_vvoo[p2b,p3b,h1b,h2b] 36 + 1.0 * vab_ovvv[h2a,p1b,p2a,p3b] * tab_vvoo[p2a,p3b,h2a,h1b]

37 + 1.0 * fb_vv[p1b,p2b] * tb_vo[p2b,h1b] 38 - 1.0 * fb_oo[h2b,h1b] * tb_vo[p1b,h2b]

39 - 1.0 * vbb_vovo[p1b,h2b,p2b,h1b] * tb_vo[p2b,h2b] 40 + 1.0 * vab_ovvo[h2a,p1b,p2a,h1b] * ta_vo[p2a,h2a]

41 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h2b] * tbb_vvoo[p1b,p3b,h1b,h3b] 42 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h3b] * tab_vvoo[p2a,p1b,h2a,h1b]

43 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * tbb_vvoo[p1b,p3b,h1b,h3b] 44 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h2a] * tab_vvoo[p3a,p1b,h3a,h1b]

45 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h1b] * tbb_vvoo[p1b,p3b,h2b,h3b] 46 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h1b] * tab_vvoo[p2a,p1b,h2a,h3b]

47 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p1b,h2b] * tbb_vvoo[p2b,p3b,h1b,h3b] 48 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p1b,h3b] * tab_vvoo[p2a,p3b,h2a,h1b]

49 - 1.0 * fb_ov[h2b,p2b] * tb_vo[p2b,h1b] * tb_vo[p1b,h2b] 50 + 1.0 * vbb_vovv[p1b,h2b,p2b,p3b] * tb_vo[p2b,h1b] * tb_vo[p3b,h2b]

51 + 1.0 * vab_ovvv[h2a,p1b,p2a,p3b] * ta_vo[p2a,h2a] * tb_vo[p3b,h1b] 52 + 1.0 * vbb_oovo[h2b,h3b,p2b,h1b] * tb_vo[p1b,h2b] * tb_vo[p2b,h3b]

53 - 1.0 * vab_oovo[h2a,h3b,p2a,h1b] * ta_vo[p2a,h2a] * tb_vo[p1b,h3b] 54 - 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h1b] * tb_vo[p1b,h2b] * tb_vo[p3b,h3b]

55 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b];

127 UCCSD-T2

1 rab_vvoo[p1a,p2b,h1a,h2b] = 2 1.0 * vab_vvoo[p1a,p2b,h1a,h2b]

3 + 1.0 * fa_vv[p1a,p3a] * tab_vvoo[p3a,p2b,h1a,h2b] 4 + 1.0 * fb_vv[p2b,p3b] * tab_vvoo[p1a,p3b,h1a,h2b]

5 - 1.0 * fa_oo[h3a,h1a] * tab_vvoo[p1a,p2b,h3a,h2b] 6 - 1.0 * fb_oo[h3b,h2b] * tab_vvoo[p1a,p2b,h1a,h3b]

7 + 1.0 * vab_vvvv[p1a,p2b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h2b] 8 + 1.0 * vab_voov[p1a,h3b,h1a,p3b] * tbb_vvoo[p2b,p3b,h2b,h3b]

9 - 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * tab_vvoo[p3a,p2b,h3a,h2b] 10 - 1.0 * vab_vovo[p1a,h3b,p3a,h2b] * tab_vvoo[p3a,p2b,h1a,h3b]

11 - 1.0 * vab_ovov[h3a,p2b,h1a,p3b] * tab_vvoo[p1a,p3b,h3a,h2b] 12 - 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * tab_vvoo[p1a,p3b,h1a,h3b]

13 + 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * taa_vvoo[p1a,p3a,h1a,h3a] 14 + 1.0 * vab_oooo[h3a,h4b,h1a,h2b] * tab_vvoo[p1a,p2b,h3a,h4b]

15 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h2b] * tbb_vvoo[p2b,p4b,h3b,h4b] 16 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h2b] * tab_vvoo[p3a,p2b,h3a,h4b]

17 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h1a,h2b] * tab_vvoo[p1a,p4b,h3a,h4b] 18 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p2b,h1a,h2b] * taa_vvoo[p1a,p4a,h3a,h4a]

19 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p2b,h1a,h3b] * tbb_vvoo[p3b,p4b,h2b,h4b] 20 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p2b,h1a,h4b] * tab_vvoo[p3a,p4b,h3a,h2b]

21 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h4b] * tab_vvoo[p1a,p2b,h3a,h2b] 22 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b]

23 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h2b] * tab_vvoo[p1a,p2b,h3a,h4b]

24 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b] 25 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h4b] * tab_vvoo[p3a,p2b,h3a,h2b]

26 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h3a] * tbb_vvoo[p2b,p4b,h2b,h4b] 27 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h3a] * tab_vvoo[p4a,p2b,h4a,h2b]

28 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h1a,h4b] * tab_vvoo[p1a,p4b,h3a,h2b] 29 + 1.0 * vab_vvov[p1a,p2b,h1a,p3b] * tb_vo[p3b,h2b]

30 + 1.0 * vab_vvvo[p1a,p2b,p3a,h2b] * ta_vo[p3a,h1a] 31 - 1.0 * vab_vooo[p1a,h3b,h1a,h2b] * tb_vo[p2b,h3b]

32 - 1.0 * vab_ovoo[h3a,p2b,h1a,h2b] * ta_vo[p1a,h3a] 33 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h1a,h2b]

34 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p2b,h3b] * tab_vvoo[p1a,p3b,h1a,h2b] 35 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h1a] * tab_vvoo[p1a,p2b,h3a,h2b]

36 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h2b] * tab_vvoo[p1a,p2b,h1a,h3b] 37 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * tab_vvoo[p3a,p2b,h1a,h2b]

38 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * tab_vvoo[p4a,p2b,h1a,h2b] 39 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * tab_vvoo[p1a,p4b,h1a,h2b]

40 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h3a] * tab_vvoo[p1a,p4b,h1a,h2b] 41 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

42 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 43 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h3b] * tab_vvoo[p1a,p2b,h1a,h4b]

44 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h3a] * tab_vvoo[p1a,p2b,h1a,h4b] 45 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p2b,h3b] * tab_vvoo[p3a,p4b,h1a,h2b]

46 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p1a,h3a] * tab_vvoo[p3a,p4b,h1a,h2b] 47 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * tbb_vvoo[p2b,p4b,h2b,h3b]

48 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * tab_vvoo[p4a,p2b,h3a,h2b] 49 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h1a,h3b]

50 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * tab_vvoo[p1a,p4b,h3a,h2b]

128 51 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * tab_vvoo[p1a,p4b,h1a,h3b] 52 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h2b] * taa_vvoo[p1a,p3a,h1a,h3a]

53 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * tbb_vvoo[p2b,p3b,h2b,h4b] 54 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h4a,h2b]

55 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h1a,h4b] 56 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p2b,h4b] * tab_vvoo[p1a,p3b,h3a,h2b]

57 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p2b,h3b] * tab_vvoo[p1a,p3b,h1a,h4b] 58 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p2b,h4b] * taa_vvoo[p1a,p3a,h1a,h3a]

59 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h2b] * tab_vvoo[p1a,p2b,h3a,h4b] 60 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h1a] * tab_vvoo[p1a,p2b,h3a,h4b]

61 + 1.0 * vab_vvvv[p1a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h2b]

62 - 1.0 * vab_voov[p1a,h3b,h1a,p3b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b] 63 - 1.0 * vab_vovo[p1a,h3b,p3a,h2b] * ta_vo[p3a,h1a] * tb_vo[p2b,h3b]

64 - 1.0 * vab_ovov[h3a,p2b,h1a,p3b] * ta_vo[p1a,h3a] * tb_vo[p3b,h2b] 65 - 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a]

66 + 1.0 * vab_oooo[h3a,h4b,h1a,h2b] * ta_vo[p1a,h3a] * tb_vo[p2b,h4b] 67 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h4b] * tab_vvoo[p3a,p2b,h1a,h2b]

68 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p3a,h4a] * tab_vvoo[p4a,p2b,h1a,h2b] 69 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * tb_vo[p3b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b]

70 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b] 71 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

72 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 73 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p4b,h3b] * tab_vvoo[p1a,p2b,h1a,h4b]

74 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h1a,h4b] 75 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p4b,h1a,h2b]

76 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tbb_vvoo[p2b,p4b,h2b,h4b] 77 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tab_vvoo[p4a,p2b,h4a,h2b]

78 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h1a,h4b] 79 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h3a,h2b]

80 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b] * tab_vvoo[p1a,p4b,h1a,h4b] 81 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b] * taa_vvoo[p1a,p3a,h1a,h3a]

82 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h3a,h4b] 83 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h2b] * tb_vo[p2b,h3b]

84 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] 85 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * tb_vo[p3b,h2b] * tb_vo[p2b,h4b]

86 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p2b,h4b] 87 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b];

88 89 raa_vvoo[p1a, p2a, h1a, h2a] =

90 1.0 * vaa_vvoo[p1a,p2a,h1a,h2a]

91 - 1.0 * fa_vv[p1a,p3a] * taa_vvoo[p2a,p3a,h1a,h2a] 92 + 1.0 * fa_vv[p2a,p3a] * taa_vvoo[p1a,p3a,h1a,h2a]

93 + 1.0 * fa_oo[h3a,h1a] * taa_vvoo[p1a,p2a,h2a,h3a] 94 - 1.0 * fa_oo[h3a,h2a] * taa_vvoo[p1a,p2a,h1a,h3a]

95 + 0.5 * vaa_vvvv[p1a,p2a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h2a] 96 + 1.0 * vab_voov[p1a,h3b,h1a,p3b] * tab_vvoo[p2a,p3b,h2a,h3b]

97 - 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * taa_vvoo[p2a,p3a,h2a,h3a] 98 - 1.0 * vab_voov[p1a,h3b,h2a,p3b] * tab_vvoo[p2a,p3b,h1a,h3b]

99 + 1.0 * vaa_vovo[p1a,h3a,p3a,h2a] * taa_vvoo[p2a,p3a,h1a,h3a] 100 - 1.0 * vab_voov[p2a,h3b,h1a,p3b] * tab_vvoo[p1a,p3b,h2a,h3b]

101 + 1.0 * vaa_vovo[p2a,h3a,p3a,h1a] * taa_vvoo[p1a,p3a,h2a,h3a]

129 102 + 1.0 * vab_voov[p2a,h3b,h2a,p3b] * tab_vvoo[p1a,p3b,h1a,h3b] 103 - 1.0 * vaa_vovo[p2a,h3a,p3a,h2a] * taa_vvoo[p1a,p3a,h1a,h3a]

104 + 0.5 * vaa_oooo[h3a,h4a,h1a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a] 105 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h2a] * tab_vvoo[p2a,p4b,h3a,h4b]

106 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h2a] * taa_vvoo[p2a,p4a,h3a,h4a] 107 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p2a,p3a,h1a,h2a] * tab_vvoo[p1a,p4b,h3a,h4b]

108 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p2a,p3a,h1a,h2a] * taa_vvoo[p1a,p4a,h3a,h4a] 109 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p2a,h1a,h3a] * tab_vvoo[p3a,p4b,h2a,h4b]

110 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p2a,h1a,h3a] * taa_vvoo[p3a,p4a,h2a,h4a] 111 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h4b] * taa_vvoo[p1a,p2a,h2a,h3a]

112 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a]

113 + 0.25 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a] 114 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h3b] * tab_vvoo[p2a,p4b,h2a,h4b]

115 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h4b] * taa_vvoo[p2a,p3a,h2a,h3a] 116 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h3a] * tab_vvoo[p2a,p4b,h2a,h4b]

117 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h3a] * taa_vvoo[p2a,p4a,h2a,h4a] 118 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p2a,p3b,h1a,h3b] * tab_vvoo[p1a,p4b,h2a,h4b]

119 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p2a,p4b,h1a,h4b] * taa_vvoo[p1a,p3a,h2a,h3a] 120 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p2a,p3a,h1a,h3a] * tab_vvoo[p1a,p4b,h2a,h4b]

121 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p2a,p3a,h1a,h3a] * taa_vvoo[p1a,p4a,h2a,h4a] 122 - 1.0 * vaa_vvvo[p1a,p2a,p3a,h1a] * ta_vo[p3a,h2a]

123 + 1.0 * vaa_vvvo[p1a,p2a,p3a,h2a] * ta_vo[p3a,h1a] 124 - 1.0 * vaa_vooo[p1a,h3a,h1a,h2a] * ta_vo[p2a,h3a]

125 + 1.0 * vaa_vooo[p2a,h3a,h1a,h2a] * ta_vo[p1a,h3a] 126 + 1.0 * fa_ov[h3a,p3a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h1a,h2a]

127 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h1a,h2a] 128 + 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h1a] * taa_vvoo[p1a,p2a,h2a,h3a]

129 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h2a] * taa_vvoo[p1a,p2a,h1a,h3a] 130 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * taa_vvoo[p2a,p3a,h1a,h2a]

131 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * taa_vvoo[p2a,p4a,h1a,h2a] 132 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * taa_vvoo[p1a,p3a,h1a,h2a]

133 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * taa_vvoo[p1a,p4a,h1a,h2a] 134 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h4b] * taa_vvoo[p1a,p2a,h2a,h3a]

135 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a] 136 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * tb_vo[p3b,h4b] * taa_vvoo[p1a,p2a,h1a,h3a]

137 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h3a] * taa_vvoo[p1a,p2a,h1a,h4a] 138 - 0.5 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p2a,h3a] * taa_vvoo[p3a,p4a,h1a,h2a]

139 + 0.5 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p1a,h3a] * taa_vvoo[p3a,p4a,h1a,h2a] 140 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * tab_vvoo[p2a,p4b,h2a,h3b]

141 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * taa_vvoo[p2a,p4a,h2a,h3a]

142 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h2a] * tab_vvoo[p2a,p4b,h1a,h3b] 143 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h2a] * taa_vvoo[p2a,p4a,h1a,h3a]

144 - 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * tab_vvoo[p1a,p4b,h2a,h3b] 145 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * taa_vvoo[p1a,p4a,h2a,h3a]

146 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * ta_vo[p3a,h2a] * tab_vvoo[p1a,p4b,h1a,h3b] 147 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h2a] * taa_vvoo[p1a,p4a,h1a,h3a]

148 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p3b,h2a,h4b] 149 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h2a,h4a]

150 + 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p3b,h1a,h4b] 151 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h1a,h4a]

152 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p3b,h2a,h4b]

130 153 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h2a,h4a] 154 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p3b,h1a,h4b]

155 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h1a,h4a] 156 - 0.5 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a]

157 + 0.5 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h1a] * taa_vvoo[p1a,p2a,h3a,h4a] 158 + 1.0 * vaa_vvvv[p1a,p2a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a]

159 + 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * ta_vo[p3a,h2a] * ta_vo[p2a,h3a] 160 - 1.0 * vaa_vovo[p1a,h3a,p3a,h2a] * ta_vo[p3a,h1a] * ta_vo[p2a,h3a]

161 - 1.0 * vaa_vovo[p2a,h3a,p3a,h1a] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] 162 + 1.0 * vaa_vovo[p2a,h3a,p3a,h2a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a]

163 + 1.0 * vaa_oooo[h3a,h4a,h1a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a]

164 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h4b] * taa_vvoo[p2a,p3a,h1a,h2a] 165 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p3a,h4a] * taa_vvoo[p2a,p4a,h1a,h2a]

166 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p2a,h3a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p3a,h1a,h2a] 167 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p2a,h3a] * ta_vo[p3a,h4a] * taa_vvoo[p1a,p4a,h1a,h2a]

168 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h2a,h3a] 169 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a]

170 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h1a,h3a] 171 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h1a,h4a]

172 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a] * taa_vvoo[p3a,p4a,h1a,h2a] 173 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h2a,h4b]

174 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h2a,h4a] 175 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h1a,h4b]

176 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h1a,h4a] 177 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h2a,h4b]

178 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h2a,h4a] 179 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h1a,h4b]

180 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h1a,h4a] 181 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a]

182 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p2a,h3a] 183 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p1a,h3a]

184 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a] 185 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a]

186 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a]; 187

188 rbb_vvoo[p1b,p2b,h1b,h2b] = 189 1.0 * vbb_vvoo[p1b,p2b,h1b,h2b]

190 - 1.0 * fb_vv[p1b,p3b] * tbb_vvoo[p2b,p3b,h1b,h2b] 191 + 1.0 * fb_vv[p2b,p3b] * tbb_vvoo[p1b,p3b,h1b,h2b]

192 + 1.0 * fb_oo[h3b,h1b] * tbb_vvoo[p1b,p2b,h2b,h3b]

193 - 1.0 * fb_oo[h3b,h2b] * tbb_vvoo[p1b,p2b,h1b,h3b] 194 + 0.5 * vbb_vvvv[p1b,p2b,p3b,p4b] * tbb_vvoo[p3b,p4b,h1b,h2b]

195 - 1.0 * vbb_vovo[p1b,h3b,p3b,h1b] * tbb_vvoo[p2b,p3b,h2b,h3b] 196 + 1.0 * vab_ovvo[h3a,p1b,p3a,h1b] * tab_vvoo[p3a,p2b,h3a,h2b]

197 + 1.0 * vbb_vovo[p1b,h3b,p3b,h2b] * tbb_vvoo[p2b,p3b,h1b,h3b] 198 - 1.0 * vab_ovvo[h3a,p1b,p3a,h2b] * tab_vvoo[p3a,p2b,h3a,h1b]

199 + 1.0 * vbb_vovo[p2b,h3b,p3b,h1b] * tbb_vvoo[p1b,p3b,h2b,h3b] 200 - 1.0 * vab_ovvo[h3a,p2b,p3a,h1b] * tab_vvoo[p3a,p1b,h3a,h2b]

201 - 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * tbb_vvoo[p1b,p3b,h1b,h3b] 202 + 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * tab_vvoo[p3a,p1b,h3a,h1b]

203 + 0.5 * vbb_oooo[h3b,h4b,h1b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b]

131 204 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p3b,h1b,h2b] * tbb_vvoo[p2b,p4b,h3b,h4b] 205 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b]

206 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p2b,p3b,h1b,h2b] * tbb_vvoo[p1b,p4b,h3b,h4b] 207 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b]

208 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p2b,h1b,h3b] * tbb_vvoo[p3b,p4b,h2b,h4b] 209 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h3a,h2b] * tbb_vvoo[p1b,p2b,h1b,h4b]

210 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p3b,p4b,h1b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b] 211 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h3a,h1b] * tbb_vvoo[p1b,p2b,h2b,h4b]

212 + 0.25 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p3b,p4b,h1b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b] 213 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p3b,h1b,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b]

214 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h2b] * tbb_vvoo[p1b,p4b,h1b,h4b]

215 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h1b] * tbb_vvoo[p2b,p4b,h2b,h4b] 216 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p1b,h3a,h1b] * tab_vvoo[p4a,p2b,h4a,h2b]

217 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p2b,p3b,h1b,h3b] * tbb_vvoo[p1b,p4b,h2b,h4b] 218 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h2b] * tbb_vvoo[p2b,p4b,h1b,h4b]

219 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h1b] * tbb_vvoo[p1b,p4b,h2b,h4b] 220 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p2b,h3a,h1b] * tab_vvoo[p4a,p1b,h4a,h2b]

221 - 1.0 * vbb_vvvo[p1b,p2b,p3b,h1b] * tb_vo[p3b,h2b] 222 + 1.0 * vbb_vvvo[p1b,p2b,p3b,h2b] * tb_vo[p3b,h1b]

223 - 1.0 * vbb_vooo[p1b,h3b,h1b,h2b] * tb_vo[p2b,h3b] 224 + 1.0 * vbb_vooo[p2b,h3b,h1b,h2b] * tb_vo[p1b,h3b]

225 + 1.0 * fb_ov[h3b,p3b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h1b,h2b] 226 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h1b,h2b]

227 + 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h1b] * tbb_vvoo[p1b,p2b,h2b,h3b] 228 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h2b] * tbb_vvoo[p1b,p2b,h1b,h3b]

229 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * tbb_vvoo[p2b,p4b,h1b,h2b] 230 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * ta_vo[p3a,h3a] * tbb_vvoo[p2b,p4b,h1b,h2b]

231 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * tbb_vvoo[p1b,p4b,h1b,h2b] 232 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h3a] * tbb_vvoo[p1b,p4b,h1b,h2b]

233 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b] 234 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * ta_vo[p3a,h3a] * tbb_vvoo[p1b,p2b,h2b,h4b]

235 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h3b] * tbb_vvoo[p1b,p2b,h1b,h4b] 236 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h3a] * tbb_vvoo[p1b,p2b,h1b,h4b]

237 - 0.5 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p2b,h3b] * tbb_vvoo[p3b,p4b,h1b,h2b] 238 + 0.5 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p1b,h3b] * tbb_vvoo[p3b,p4b,h1b,h2b]

239 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tbb_vvoo[p2b,p4b,h2b,h3b] 240 + 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * tb_vo[p4b,h1b] * tab_vvoo[p3a,p2b,h3a,h2b]

241 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * tbb_vvoo[p2b,p4b,h1b,h3b] 242 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * tb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h3a,h1b]

243 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tbb_vvoo[p1b,p4b,h2b,h3b]

244 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h1b] * tab_vvoo[p3a,p1b,h3a,h2b] 245 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * tbb_vvoo[p1b,p4b,h1b,h3b]

246 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h2b] * tab_vvoo[p3a,p1b,h3a,h1b] 247 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h2b,h4b]

248 - 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h2b] 249 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h1b,h4b]

250 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h1b] 251 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h2b,h4b]

252 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h2b] 253 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h1b,h4b]

254 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h1b]

132 255 - 0.5 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b] 256 + 0.5 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h1b] * tbb_vvoo[p1b,p2b,h3b,h4b]

257 + 1.0 * vbb_vvvv[p1b,p2b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] 258 + 1.0 * vbb_vovo[p1b,h3b,p3b,h1b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b]

259 - 1.0 * vbb_vovo[p1b,h3b,p3b,h2b] * tb_vo[p3b,h1b] * tb_vo[p2b,h3b] 260 - 1.0 * vbb_vovo[p2b,h3b,p3b,h1b] * tb_vo[p3b,h2b] * tb_vo[p1b,h3b]

261 + 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b] 262 + 1.0 * vbb_oooo[h3b,h4b,h1b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b]

263 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * tb_vo[p3b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b] 264 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p1b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b]

265 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * tb_vo[p3b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b]

266 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p2b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b] 267 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b]

268 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h1b] * tbb_vvoo[p1b,p2b,h2b,h4b] 269 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h1b,h4b]

270 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h1b,h4b] 271 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b] * tbb_vvoo[p3b,p4b,h1b,h2b]

272 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b] 273 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h2b]

274 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h1b,h4b] 275 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h1b]

276 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h2b,h4b] 277 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h2b]

278 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h1b,h4b] 279 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h1b]

280 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b] 281 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p2b,h3b]

282 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p1b,h3b] 283 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b]

284 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b] 285 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b];

133 RCCSD-T1

1 r_vo[p1, h1] = 2 1.0 * f_vo[p1,h1]

3 + 1.0 * f_vv[p1,p2] * t_vo[p2,h1] 4 - 1.0 * f_oo[h2,h1] * t_vo[p1,h2]

5 + 2.0 * f_ov[h2,p2] * t_vvoo[p1,p2,h1,h2] 6 - 1.0 * f_ov[h2,p2] * t_vvoo[p2,p1,h1,h2]

7 - 1.0 * v_vovv[p1,h2,p2,p3] * t_vvoo[p3,p2,h1,h2] 8 + 2.0 * v_vovv[p1,h2,p2,p3] * t_vvoo[p2,p3,h1,h2]

9 - 1.0 * v_vovo[p1,h2,p2,h1] * t_vo[p2,h2] 10 + 2.0 * v_ovvo[h2,p1,p2,h1] * t_vo[p2,h2]

11 + 1.0 * v_oovo[h2,h3,p2,h1] * t_vvoo[p1,p2,h2,h3] 12 - 2.0 * v_oovo[h2,h3,p2,h1] * t_vvoo[p1,p2,h3,h2]

13 - 1.0 * f_ov[h2,p2] * t_vo[p2,h1] * t_vo[p1,h2] 14 + 2.0 * v_vovv[p1,h2,p2,p3] * t_vo[p2,h1] * t_vo[p3,h2]

15 - 1.0 * v_vovv[p1,h2,p2,p3] * t_vo[p3,h1] * t_vo[p2,h2] 16 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vvoo[p1,p3,h3,h2]

17 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vvoo[p1,p3,h2,h3] 18 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p1,h2] * t_vvoo[p3,p2,h1,h3]

19 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p1,h2] * t_vvoo[p2,p3,h1,h3] 20 + 4.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h2] * t_vvoo[p1,p3,h1,h3]

21 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h2] * t_vvoo[p3,p1,h1,h3] 22 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p3,h2] * t_vvoo[p1,p2,h1,h3]

23 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p3,h2] * t_vvoo[p2,p1,h1,h3]

24 - 2.0 * v_oovo[h2,h3,p2,h1] * t_vo[p2,h2] * t_vo[p1,h3] 25 + 1.0 * v_oovo[h2,h3,p2,h1] * t_vo[p1,h2] * t_vo[p2,h3]

26 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vo[p1,h2] * t_vo[p3,h3] 27 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vo[p3,h2] * t_vo[p1,h3];

134 RCCSD-T2

1 r_vvoo[p1,p2,h1,h2] = 2 1.0 * v_vvoo[p1,p2,h1,h2]

3 + 1.0 * f_vv[p1,p3] * t_vvoo[p3,p2,h1,h2] 4 + 1.0 * f_vv[p2,p3] * t_vvoo[p1,p3,h1,h2]

5 - 1.0 * f_oo[h3,h1] * t_vvoo[p2,p1,h2,h3] 6 - 1.0 * f_oo[h3,h2] * t_vvoo[p1,p2,h1,h3]

7 + 1.0 * v_vvvv[p1,p2,p3,p4] * t_vvoo[p3,p4,h1,h2] 8 + 1.0 * v_vvvo[p1,p2,p3,h2] * t_vo[p3,h1]

9 + 1.0 * v_vvvo[p2,p1,p3,h1] * t_vo[p3,h2] 10 - 1.0 * v_vovo[p1,h3,p3,h1] * t_vvoo[p2,p3,h2,h3]

11 - 1.0 * v_vovo[p2,h3,p3,h2] * t_vvoo[p1,p3,h1,h3] 12 - 1.0 * v_vovo[p1,h3,p3,h2] * t_vvoo[p3,p2,h1,h3]

13 - 1.0 * v_vovo[p2,h3,p3,h1] * t_vvoo[p3,p1,h2,h3] 14 + 2.0 * v_ovvo[h3,p1,p3,h1] * t_vvoo[p2,p3,h2,h3]

15 + 2.0 * v_ovvo[h3,p2,p3,h2] * t_vvoo[p1,p3,h1,h3] 16 - 1.0 * v_ovvo[h3,p1,p3,h1] * t_vvoo[p3,p2,h2,h3]

17 - 1.0 * v_ovvo[h3,p2,p3,h2] * t_vvoo[p3,p1,h1,h3] 18 - 1.0 * v_ovoo[h3,p2,h1,h2] * t_vo[p1,h3]

19 - 1.0 * v_vooo[p1,h3,h1,h2] * t_vo[p2,h3] 20 + 1.0 * v_oooo[h3,h4,h1,h2] * t_vvoo[p1,p2,h3,h4]

21 - 1.0 * f_ov[h3,p3] * t_vo[p3,h1] * t_vvoo[p2,p1,h2,h3] 22 - 1.0 * f_ov[h3,p3] * t_vo[p3,h2] * t_vvoo[p1,p2,h1,h3]

23 - 1.0 * f_ov[h3,p3] * t_vo[p1,h3] * t_vvoo[p3,p2,h1,h2]

24 - 1.0 * f_ov[h3,p3] * t_vo[p2,h3] * t_vvoo[p1,p3,h1,h2] 25 + 1.0 * v_vvvv[p1,p2,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2]

26 + 2.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * t_vvoo[p2,p4,h2,h3] 27 + 2.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h2] * t_vvoo[p1,p4,h1,h3]

28 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * t_vvoo[p4,p2,h2,h3] 29 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h2] * t_vvoo[p4,p1,h1,h3]

30 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h1] * t_vvoo[p2,p3,h2,h3] 31 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h2] * t_vvoo[p1,p3,h1,h3]

32 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h2] * t_vvoo[p3,p2,h1,h3] 33 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h1] * t_vvoo[p3,p1,h2,h3]

34 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p2,h3] * t_vvoo[p3,p4,h1,h2] 35 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p1,h3] * t_vvoo[p4,p3,h1,h2]

36 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h3] * t_vvoo[p4,p2,h1,h2] 37 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h3] * t_vvoo[p1,p4,h1,h2]

38 + 2.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h3] * t_vvoo[p3,p2,h1,h2] 39 + 2.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h3] * t_vvoo[p1,p3,h1,h2]

40 - 1.0 * v_vovo[p1,h3,p3,h2] * t_vo[p3,h1] * t_vo[p2,h3] 41 - 1.0 * v_vovo[p2,h3,p3,h1] * t_vo[p3,h2] * t_vo[p1,h3]

42 - 1.0 * v_ovvo[h3,p1,p3,h1] * t_vo[p3,h2] * t_vo[p2,h3] 43 - 1.0 * v_ovvo[h3,p2,p3,h2] * t_vo[p3,h1] * t_vo[p1,h3]

44 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p2,h1,h2] * t_vvoo[p1,p4,h4,h3] 45 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h2] * t_vvoo[p2,p4,h4,h3]

46 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p2,h1,h2] * t_vvoo[p1,p4,h3,h4] 47 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h2] * t_vvoo[p2,p4,h3,h4]

48 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p4,h1,h3] * t_vvoo[p2,p1,h2,h4] 49 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p2,h1,h3] * t_vvoo[p4,p3,h2,h4]

50 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p3,h1,h3] * t_vvoo[p2,p1,h2,h4]

135 51 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p2,h1,h3] * t_vvoo[p3,p4,h2,h4] 52 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p4,h1,h2] * t_vvoo[p1,p2,h3,h4]

53 + 4.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h3] * t_vvoo[p2,p4,h2,h4] 54 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h3] * t_vvoo[p4,p2,h2,h4]

55 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p1,h1,h3] * t_vvoo[p2,p4,h2,h4] 56 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p1,h1,h3] * t_vvoo[p4,p2,h2,h4]

57 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p4,h1,h3] * t_vvoo[p2,p3,h2,h4] 58 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p4,h1,h3] * t_vvoo[p3,p2,h2,h4]

59 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p1,h1,h3] * t_vvoo[p2,p3,h2,h4] 60 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p2,h1,h3] * t_vvoo[p3,p1,h2,h4]

61 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h2] * t_vvoo[p1,p2,h4,h3]

62 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h1] * t_vvoo[p1,p2,h3,h4] 63 - 2.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h4] * t_vvoo[p2,p3,h2,h3]

64 - 2.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h4] * t_vvoo[p1,p3,h1,h3] 65 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h4] * t_vvoo[p3,p2,h2,h3]

66 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h4] * t_vvoo[p3,p1,h1,h3] 67 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h3] * t_vvoo[p2,p3,h2,h4]

68 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h3] * t_vvoo[p1,p3,h1,h4] 69 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p2,h3] * t_vvoo[p3,p1,h2,h4]

70 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p1,h3] * t_vvoo[p3,p2,h1,h4] 71 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h4] * t_vvoo[p2,p1,h2,h3]

72 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h4] * t_vvoo[p1,p2,h1,h3] 73 - 2.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h3] * t_vvoo[p2,p1,h2,h4]

74 - 2.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h3] * t_vvoo[p1,p2,h1,h4] 75 + 1.0 * v_oooo[h3,h4,h1,h2] * t_vo[p1,h3] * t_vo[p2,h4]

76 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * t_vo[p2,h3] 77 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h1] * t_vo[p3,h2] * t_vo[p1,h3]

78 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * t_vvoo[p1,p2,h3,h4] 79 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h3] * t_vvoo[p2,p4,h2,h4]

80 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h3] * t_vvoo[p1,p4,h1,h4] 81 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h3] * t_vvoo[p4,p2,h2,h4]

82 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h3] * t_vvoo[p4,p1,h1,h4] 83 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h4] * t_vvoo[p2,p4,h2,h3]

84 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h4] * t_vvoo[p1,p4,h1,h3] 85 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p2,h4] * t_vvoo[p4,p1,h2,h3]

86 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p1,h4] * t_vvoo[p4,p2,h1,h3] 87 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h3] * t_vvoo[p2,p1,h2,h4]

88 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p4,h3] * t_vvoo[p1,p2,h1,h4] 89 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h4] * t_vvoo[p2,p1,h2,h3]

90 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p4,h4] * t_vvoo[p1,p2,h1,h3]

91 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p2,h4] * t_vvoo[p3,p4,h1,h2] 92 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p3,h4] * t_vvoo[p4,p2,h1,h2]

93 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * t_vo[p3,h4] * t_vvoo[p1,p4,h1,h2] 94 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p4,h4] * t_vvoo[p3,p2,h1,h2]

95 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * t_vo[p4,h4] * t_vvoo[p1,p3,h1,h2] 96 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h2] * t_vo[p2,h3] * t_vo[p1,h4]

97 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h1] * t_vo[p1,h3] * t_vo[p2,h4] 98 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * t_vo[p1,h3] * t_vo[p2,h4];

136 UEOMCCSD-X1

1 ra_vo[p1a,h1a] = 2 1.0 * fb_ov[h2b,p2b] * xab_vvoo[p1a,p2b,h1a,h2b]

3 + 1.0 * fa_ov[h2a,p2a] * xaa_vvoo[p1a,p2a,h1a,h2a] 4 - 1.0 * vab_ooov[h2a,h3b,h1a,p2b] * xab_vvoo[p1a,p2b,h2a,h3b]

5 + 0.5 * vaa_oovo[h2a,h3a,p2a,h1a] * xaa_vvoo[p1a,p2a,h2a,h3a] 6 + 1.0 * vab_vovv[p1a,h2b,p2a,p3b] * xab_vvoo[p2a,p3b,h1a,h2b]

7 + 0.5 * vaa_vovv[p1a,h2a,p2a,p3a] * xaa_vvoo[p2a,p3a,h1a,h2a] 8 + 1.0 * fa_vv[p1a,p2a] * xa_vo[p2a,h1a]

9 - 1.0 * fa_oo[h2a,h1a] * xa_vo[p1a,h2a] 10 + 1.0 * vab_voov[p1a,h2b,h1a,p2b] * xb_vo[p2b,h2b]

11 - 1.0 * vaa_vovo[p1a,h2a,p2a,h1a] * xa_vo[p2a,h2a] 12 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * xb_vo[p2b,h2b] * tab_vvoo[p1a,p3b,h1a,h3b]

13 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xb_vo[p3b,h3b] * taa_vvoo[p1a,p2a,h1a,h2a] 14 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p2a,h2a] * tab_vvoo[p1a,p3b,h1a,h3b]

15 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * xa_vo[p2a,h2a] * taa_vvoo[p1a,p3a,h1a,h3a] 16 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p2a,h1a] * tab_vvoo[p1a,p3b,h2a,h3b]

17 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * xa_vo[p2a,h1a] * taa_vvoo[p1a,p3a,h2a,h3a] 18 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p1a,h2a] * tab_vvoo[p2a,p3b,h1a,h3b]

19 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * xa_vo[p1a,h2a] * taa_vvoo[p2a,p3a,h1a,h3a] 20 - 1.0 * fa_ov[h2a,p2a] * xa_vo[p2a,h1a] * ta_vo[p1a,h2a]

21 + 1.0 * vab_vovv[p1a,h2b,p2a,p3b] * xa_vo[p2a,h1a] * tb_vo[p3b,h2b] 22 + 1.0 * vaa_vovv[p1a,h2a,p2a,p3a] * xa_vo[p2a,h1a] * ta_vo[p3a,h2a]

23 - 1.0 * vab_ooov[h2a,h3b,h1a,p2b] * xa_vo[p1a,h2a] * tb_vo[p2b,h3b]

24 + 1.0 * vaa_oovo[h2a,h3a,p2a,h1a] * xa_vo[p1a,h2a] * ta_vo[p2a,h3a] 25 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h2b] * xab_vvoo[p1a,p3b,h1a,h3b]

26 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h3b] * xaa_vvoo[p1a,p2a,h1a,h2a] 27 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * xab_vvoo[p1a,p3b,h1a,h3b]

28 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h2a] * xaa_vvoo[p1a,p3a,h1a,h3a] 29 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h1a] * xab_vvoo[p1a,p3b,h2a,h3b]

30 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h1a] * xaa_vvoo[p1a,p3a,h2a,h3a] 31 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p1a,h2a] * xab_vvoo[p2a,p3b,h1a,h3b]

32 - 0.5 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p1a,h2a] * xaa_vvoo[p2a,p3a,h1a,h3a] 33 - 1.0 * fa_ov[h2a,p2a] * ta_vo[p2a,h1a] * xa_vo[p1a,h2a]

34 + 1.0 * vab_vovv[p1a,h2b,p2a,p3b] * ta_vo[p2a,h1a] * xb_vo[p3b,h2b] 35 + 1.0 * vaa_vovv[p1a,h2a,p2a,p3a] * ta_vo[p2a,h1a] * xa_vo[p3a,h2a]

36 - 1.0 * vab_ooov[h2a,h3b,h1a,p2b] * ta_vo[p1a,h2a] * xb_vo[p2b,h3b] 37 + 1.0 * vaa_oovo[h2a,h3a,p2a,h1a] * ta_vo[p1a,h2a] * xa_vo[p2a,h3a]

38 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p2a,h1a] * ta_vo[p1a,h2a] * tb_vo[p3b,h3b] 39 - 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * xa_vo[p2a,h1a] * ta_vo[p1a,h2a] * ta_vo[p3a,h3a]

40 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h1a] * xa_vo[p1a,h2a] * tb_vo[p3b,h3b] 41 - 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h1a] * xa_vo[p1a,h2a] * ta_vo[p3a,h3a]

42 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h1a] * ta_vo[p1a,h2a] * xb_vo[p3b,h3b] 43 - 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h1a] * ta_vo[p1a,h2a] * xa_vo[p3a,h3a];

44 45 rb_vo[p1b,h1b] =

46 1.0 * fb_ov[h2b,p2b] * xbb_vvoo[p1b,p2b,h1b,h2b] 47 + 1.0 * fa_ov[h2a,p2a] * xab_vvoo[p2a,p1b,h2a,h1b]

48 + 0.5 * vbb_oovo[h2b,h3b,p2b,h1b] * xbb_vvoo[p1b,p2b,h2b,h3b] 49 - 1.0 * vab_oovo[h2a,h3b,p2a,h1b] * xab_vvoo[p2a,p1b,h2a,h3b]

50 + 0.5 * vbb_vovv[p1b,h2b,p2b,p3b] * xbb_vvoo[p2b,p3b,h1b,h2b]

137 51 + 1.0 * vab_ovvv[h2a,p1b,p2a,p3b] * xab_vvoo[p2a,p3b,h2a,h1b] 52 + 1.0 * fb_vv[p1b,p2b] * xb_vo[p2b,h1b]

53 - 1.0 * fb_oo[h2b,h1b] * xb_vo[p1b,h2b] 54 - 1.0 * vbb_vovo[p1b,h2b,p2b,h1b] * xb_vo[p2b,h2b]

55 + 1.0 * vab_ovvo[h2a,p1b,p2a,h1b] * xa_vo[p2a,h2a] 56 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * xb_vo[p2b,h2b] * tbb_vvoo[p1b,p3b,h1b,h3b]

57 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xb_vo[p3b,h3b] * tab_vvoo[p2a,p1b,h2a,h1b] 58 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p2a,h2a] * tbb_vvoo[p1b,p3b,h1b,h3b]

59 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * xa_vo[p2a,h2a] * tab_vvoo[p3a,p1b,h3a,h1b] 60 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * xb_vo[p2b,h1b] * tbb_vvoo[p1b,p3b,h2b,h3b]

61 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xb_vo[p3b,h1b] * tab_vvoo[p2a,p1b,h2a,h3b]

62 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * xb_vo[p1b,h2b] * tbb_vvoo[p2b,p3b,h1b,h3b] 63 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xb_vo[p1b,h3b] * tab_vvoo[p2a,p3b,h2a,h1b]

64 - 1.0 * fb_ov[h2b,p2b] * xb_vo[p2b,h1b] * tb_vo[p1b,h2b] 65 + 1.0 * vbb_vovv[p1b,h2b,p2b,p3b] * xb_vo[p2b,h1b] * tb_vo[p3b,h2b]

66 + 1.0 * vab_ovvv[h2a,p1b,p2a,p3b] * xa_vo[p2a,h2a] * tb_vo[p3b,h1b] 67 + 1.0 * vbb_oovo[h2b,h3b,p2b,h1b] * xb_vo[p1b,h2b] * tb_vo[p2b,h3b]

68 - 1.0 * vab_oovo[h2a,h3b,p2a,h1b] * xa_vo[p2a,h2a] * tb_vo[p1b,h3b] 69 + 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h2b] * xbb_vvoo[p1b,p3b,h1b,h3b]

70 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h3b] * xab_vvoo[p2a,p1b,h2a,h1b] 71 + 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * xbb_vvoo[p1b,p3b,h1b,h3b]

72 + 1.0 * vaa_oovv[h2a,h3a,p2a,p3a] * ta_vo[p2a,h2a] * xab_vvoo[p3a,p1b,h3a,h1b] 73 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h1b] * xbb_vvoo[p1b,p3b,h2b,h3b]

74 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p3b,h1b] * xab_vvoo[p2a,p1b,h2a,h3b] 75 - 0.5 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p1b,h2b] * xbb_vvoo[p2b,p3b,h1b,h3b]

76 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * tb_vo[p1b,h3b] * xab_vvoo[p2a,p3b,h2a,h1b] 77 - 1.0 * fb_ov[h2b,p2b] * tb_vo[p2b,h1b] * xb_vo[p1b,h2b]

78 + 1.0 * vbb_vovv[p1b,h2b,p2b,p3b] * tb_vo[p2b,h1b] * xb_vo[p3b,h2b] 79 + 1.0 * vab_ovvv[h2a,p1b,p2a,p3b] * ta_vo[p2a,h2a] * xb_vo[p3b,h1b]

80 + 1.0 * vbb_oovo[h2b,h3b,p2b,h1b] * tb_vo[p1b,h2b] * xb_vo[p2b,h3b] 81 - 1.0 * vab_oovo[h2a,h3b,p2a,h1b] * ta_vo[p2a,h2a] * xb_vo[p1b,h3b]

82 - 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * xb_vo[p2b,h1b] * tb_vo[p1b,h2b] * tb_vo[p3b,h3b] 83 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * xa_vo[p2a,h2a] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b]

84 - 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h1b] * xb_vo[p1b,h2b] * tb_vo[p3b,h3b] 85 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * xb_vo[p3b,h1b] * tb_vo[p1b,h3b]

86 - 1.0 * vbb_oovv[h2b,h3b,p2b,p3b] * tb_vo[p2b,h1b] * tb_vo[p1b,h2b] * xb_vo[p3b,h3b] 87 - 1.0 * vab_oovv[h2a,h3b,p2a,p3b] * ta_vo[p2a,h2a] * tb_vo[p3b,h1b] * xb_vo[p1b,h3b];

138 UEOMCCSD-X2

1 rab_vvoo[p1a,p2b,h1a,h2b] = 2 1.0 * fa_vv[p1a,p3a] * xab_vvoo[p3a,p2b,h1a,h2b]

3 + 1.0 * fb_vv[p2b,p3b] * xab_vvoo[p1a,p3b,h1a,h2b] 4 - 1.0 * fa_oo[h3a,h1a] * xab_vvoo[p1a,p2b,h3a,h2b]

5 - 1.0 * fb_oo[h3b,h2b] * xab_vvoo[p1a,p2b,h1a,h3b] 6 + 1.0 * vab_vvvv[p1a,p2b,p3a,p4b] * xab_vvoo[p3a,p4b,h1a,h2b]

7 + 1.0 * vab_voov[p1a,h3b,h1a,p3b] * xbb_vvoo[p2b,p3b,h2b,h3b] 8 - 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * xab_vvoo[p3a,p2b,h3a,h2b]

9 - 1.0 * vab_vovo[p1a,h3b,p3a,h2b] * xab_vvoo[p3a,p2b,h1a,h3b] 10 - 1.0 * vab_ovov[h3a,p2b,h1a,p3b] * xab_vvoo[p1a,p3b,h3a,h2b]

11 - 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * xab_vvoo[p1a,p3b,h1a,h3b] 12 + 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * xaa_vvoo[p1a,p3a,h1a,h3a]

13 + 1.0 * vab_oooo[h3a,h4b,h1a,h2b] * xab_vvoo[p1a,p2b,h3a,h4b] 14 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xab_vvoo[p1a,p3b,h1a,h2b] * tbb_vvoo[p2b,p4b,h3b,h4b]

15 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p1a,p4b,h1a,h2b] * tab_vvoo[p3a,p2b,h3a,h4b] 16 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p2b,h1a,h2b] * tab_vvoo[p1a,p4b,h3a,h4b]

17 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xab_vvoo[p3a,p2b,h1a,h2b] * taa_vvoo[p1a,p4a,h3a,h4a] 18 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xab_vvoo[p1a,p2b,h1a,h3b] * tbb_vvoo[p3b,p4b,h2b,h4b]

19 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p1a,p2b,h1a,h4b] * tab_vvoo[p3a,p4b,h3a,h2b] 20 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p4b,h1a,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

21 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p3a,p4a,h1a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 22 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p4b,h1a,h2b] * tab_vvoo[p1a,p2b,h3a,h4b]

23 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xab_vvoo[p1a,p3b,h1a,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b]

24 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p1a,p4b,h1a,h4b] * tab_vvoo[p3a,p2b,h3a,h2b] 25 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p1a,p3a,h1a,h3a] * tbb_vvoo[p2b,p4b,h2b,h4b]

26 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p1a,p3a,h1a,h3a] * tab_vvoo[p4a,p2b,h4a,h2b] 27 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p2b,h1a,h4b] * tab_vvoo[p1a,p4b,h3a,h2b]

28 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h2b] * xbb_vvoo[p2b,p4b,h3b,h4b] 29 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h2b] * xab_vvoo[p3a,p2b,h3a,h4b]

30 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h1a,h2b] * xab_vvoo[p1a,p4b,h3a,h4b] 31 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p2b,h1a,h2b] * xaa_vvoo[p1a,p4a,h3a,h4a]

32 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p2b,h1a,h3b] * xbb_vvoo[p3b,p4b,h2b,h4b] 33 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p2b,h1a,h4b] * xab_vvoo[p3a,p4b,h3a,h2b]

34 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h4b] * xab_vvoo[p1a,p2b,h3a,h2b] 35 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h3a] * xab_vvoo[p1a,p2b,h4a,h2b]

36 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h2b] * xab_vvoo[p1a,p2b,h3a,h4b] 37 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h3b] * xbb_vvoo[p2b,p4b,h2b,h4b]

38 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h4b] * xab_vvoo[p3a,p2b,h3a,h2b] 39 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h3a] * xbb_vvoo[p2b,p4b,h2b,h4b]

40 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h3a] * xab_vvoo[p4a,p2b,h4a,h2b] 41 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h1a,h4b] * xab_vvoo[p1a,p4b,h3a,h2b]

42 + 1.0 * vab_vvov[p1a,p2b,h1a,p3b] * xb_vo[p3b,h2b] 43 + 1.0 * vab_vvvo[p1a,p2b,p3a,h2b] * xa_vo[p3a,h1a]

44 - 1.0 * vab_vooo[p1a,h3b,h1a,h2b] * xb_vo[p2b,h3b] 45 - 1.0 * vab_ovoo[h3a,p2b,h1a,h2b] * xa_vo[p1a,h3a]

46 - 1.0 * fa_ov[h3a,p3a] * xa_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h1a,h2b] 47 - 1.0 * fb_ov[h3b,p3b] * xb_vo[p2b,h3b] * tab_vvoo[p1a,p3b,h1a,h2b]

48 - 1.0 * fa_ov[h3a,p3a] * xa_vo[p3a,h1a] * tab_vvoo[p1a,p2b,h3a,h2b] 49 - 1.0 * fb_ov[h3b,p3b] * xb_vo[p3b,h2b] * tab_vvoo[p1a,p2b,h1a,h3b]

50 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xb_vo[p4b,h3b] * tab_vvoo[p3a,p2b,h1a,h2b]

139 51 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h3a] * tab_vvoo[p4a,p2b,h1a,h2b] 52 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h3b] * tab_vvoo[p1a,p4b,h1a,h2b]

53 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xa_vo[p3a,h3a] * tab_vvoo[p1a,p4b,h1a,h2b] 54 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xb_vo[p3b,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

55 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p3a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 56 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p3b,h3b] * tab_vvoo[p1a,p2b,h1a,h4b]

57 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xa_vo[p3a,h3a] * tab_vvoo[p1a,p2b,h1a,h4b] 58 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xb_vo[p2b,h3b] * tab_vvoo[p3a,p4b,h1a,h2b]

59 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xa_vo[p1a,h3a] * tab_vvoo[p3a,p4b,h1a,h2b] 60 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xa_vo[p3a,h1a] * tbb_vvoo[p2b,p4b,h2b,h3b]

61 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h1a] * tab_vvoo[p4a,p2b,h3a,h2b]

62 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h1a,h3b] 63 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xa_vo[p3a,h1a] * tab_vvoo[p1a,p4b,h3a,h2b]

64 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h2b] * tab_vvoo[p1a,p4b,h1a,h3b] 65 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xb_vo[p4b,h2b] * taa_vvoo[p1a,p3a,h1a,h3a]

66 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xa_vo[p1a,h3a] * tbb_vvoo[p2b,p3b,h2b,h4b] 67 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h4a,h2b]

68 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xa_vo[p1a,h3a] * tab_vvoo[p3a,p2b,h1a,h4b] 69 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xb_vo[p2b,h4b] * tab_vvoo[p1a,p3b,h3a,h2b]

70 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p2b,h3b] * tab_vvoo[p1a,p3b,h1a,h4b] 71 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xb_vo[p2b,h4b] * taa_vvoo[p1a,p3a,h1a,h3a]

72 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xb_vo[p3b,h2b] * tab_vvoo[p1a,p2b,h3a,h4b] 73 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xa_vo[p3a,h1a] * tab_vvoo[p1a,p2b,h3a,h4b]

74 + 1.0 * vab_vvvv[p1a,p2b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p4b,h2b] 75 - 1.0 * vab_voov[p1a,h3b,h1a,p3b] * xb_vo[p3b,h2b] * tb_vo[p2b,h3b]

76 - 1.0 * vab_vovo[p1a,h3b,p3a,h2b] * xa_vo[p3a,h1a] * tb_vo[p2b,h3b] 77 - 1.0 * vab_ovov[h3a,p2b,h1a,p3b] * xa_vo[p1a,h3a] * tb_vo[p3b,h2b]

78 - 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] 79 + 1.0 * vab_oooo[h3a,h4b,h1a,h2b] * xa_vo[p1a,h3a] * tb_vo[p2b,h4b]

80 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p1a,h3a] * xab_vvoo[p3a,p2b,h1a,h2b] 81 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p2b,h3b] * xab_vvoo[p1a,p3b,h1a,h2b]

82 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h1a] * xab_vvoo[p1a,p2b,h3a,h2b] 83 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h2b] * xab_vvoo[p1a,p2b,h1a,h3b]

84 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * xab_vvoo[p3a,p2b,h1a,h2b] 85 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * xab_vvoo[p4a,p2b,h1a,h2b]

86 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * xab_vvoo[p1a,p4b,h1a,h2b] 87 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h3a] * xab_vvoo[p1a,p4b,h1a,h2b]

88 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h4b] * xab_vvoo[p1a,p2b,h3a,h2b] 89 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h3a] * xab_vvoo[p1a,p2b,h4a,h2b]

90 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h3b] * xab_vvoo[p1a,p2b,h1a,h4b]

91 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h3a] * xab_vvoo[p1a,p2b,h1a,h4b] 92 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p2b,h3b] * xab_vvoo[p3a,p4b,h1a,h2b]

93 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p1a,h3a] * xab_vvoo[p3a,p4b,h1a,h2b] 94 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * xbb_vvoo[p2b,p4b,h2b,h3b]

95 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * xab_vvoo[p4a,p2b,h3a,h2b] 96 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h2b] * xab_vvoo[p3a,p2b,h1a,h3b]

97 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * xab_vvoo[p1a,p4b,h3a,h2b] 98 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * xab_vvoo[p1a,p4b,h1a,h3b]

99 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h2b] * xaa_vvoo[p1a,p3a,h1a,h3a] 100 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * xbb_vvoo[p2b,p3b,h2b,h4b]

101 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p1a,h3a] * xab_vvoo[p3a,p2b,h4a,h2b]

140 102 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p1a,h3a] * xab_vvoo[p3a,p2b,h1a,h4b] 103 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p2b,h4b] * xab_vvoo[p1a,p3b,h3a,h2b]

104 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p2b,h3b] * xab_vvoo[p1a,p3b,h1a,h4b] 105 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p2b,h4b] * xaa_vvoo[p1a,p3a,h1a,h3a]

106 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h2b] * xab_vvoo[p1a,p2b,h3a,h4b] 107 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h1a] * xab_vvoo[p1a,p2b,h3a,h4b]

108 + 1.0 * vab_vvvv[p1a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p4b,h2b] 109 - 1.0 * vab_voov[p1a,h3b,h1a,p3b] * tb_vo[p3b,h2b] * xb_vo[p2b,h3b]

110 - 1.0 * vab_vovo[p1a,h3b,p3a,h2b] * ta_vo[p3a,h1a] * xb_vo[p2b,h3b] 111 - 1.0 * vab_ovov[h3a,p2b,h1a,p3b] * ta_vo[p1a,h3a] * xb_vo[p3b,h2b]

112 - 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a]

113 + 1.0 * vab_oooo[h3a,h4b,h1a,h2b] * ta_vo[p1a,h3a] * xb_vo[p2b,h4b] 114 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p1a,h3a] * tb_vo[p4b,h4b] * tab_vvoo[p3a,p2b,h1a,h2b]

115 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p1a,h3a] * ta_vo[p3a,h4a] * tab_vvoo[p4a,p2b,h1a,h2b] 116 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p2b,h3b] * tb_vo[p3b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b]

117 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b] 118 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p4b,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

119 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 120 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h2b] * tb_vo[p4b,h3b] * tab_vvoo[p1a,p2b,h1a,h4b]

121 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h1a,h4b] 122 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p1a,h3a] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p4b,h1a,h2b]

123 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tbb_vvoo[p2b,p4b,h2b,h4b] 124 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tab_vvoo[p4a,p2b,h4a,h2b]

125 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p1a,h3a] * tb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h1a,h4b] 126 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h3a,h2b]

127 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h2b] * tb_vo[p2b,h3b] * tab_vvoo[p1a,p4b,h1a,h4b] 128 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xb_vo[p4b,h2b] * tb_vo[p2b,h4b] * taa_vvoo[p1a,p3a,h1a,h3a]

129 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h3a,h4b] 130 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p4b,h2b] * tb_vo[p2b,h3b]

131 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] 132 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xa_vo[p1a,h3a] * tb_vo[p3b,h2b] * tb_vo[p2b,h4b]

133 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p2b,h4b] 134 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * xb_vo[p4b,h4b] * tab_vvoo[p3a,p2b,h1a,h2b]

135 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * xa_vo[p3a,h4a] * tab_vvoo[p4a,p2b,h1a,h2b] 136 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * xb_vo[p3b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b]

137 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h1a,h2b] 138 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p4b,h4b] * tab_vvoo[p1a,p2b,h3a,h2b]

139 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h3a] * tab_vvoo[p1a,p2b,h4a,h2b] 140 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * xb_vo[p4b,h3b] * tab_vvoo[p1a,p2b,h1a,h4b]

141 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h1a,h4b]

142 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * xb_vo[p2b,h4b] * tab_vvoo[p3a,p4b,h1a,h2b] 143 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tbb_vvoo[p2b,p4b,h2b,h4b]

144 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tab_vvoo[p4a,p2b,h4a,h2b] 145 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * xb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h1a,h4b]

146 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p2b,h4b] * tab_vvoo[p1a,p4b,h3a,h2b] 147 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * xb_vo[p2b,h3b] * tab_vvoo[p1a,p4b,h1a,h4b]

148 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * xb_vo[p2b,h4b] * taa_vvoo[p1a,p3a,h1a,h3a] 149 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p4b,h2b] * tab_vvoo[p1a,p2b,h3a,h4b]

150 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p4b,h2b] * tb_vo[p2b,h3b] 151 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tb_vo[p4b,h2b]

152 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * xb_vo[p3b,h2b] * tb_vo[p2b,h4b]

141 153 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tb_vo[p2b,h4b] 154 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h4b] * xab_vvoo[p3a,p2b,h1a,h2b]

155 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p3a,h4a] * xab_vvoo[p4a,p2b,h1a,h2b] 156 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * tb_vo[p3b,h4b] * xab_vvoo[p1a,p4b,h1a,h2b]

157 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p2b,h4b] * xab_vvoo[p1a,p4b,h1a,h2b] 158 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h4b] * xab_vvoo[p1a,p2b,h3a,h2b]

159 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h3a] * xab_vvoo[p1a,p2b,h4a,h2b] 160 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p4b,h3b] * xab_vvoo[p1a,p2b,h1a,h4b]

161 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h2b] * xab_vvoo[p1a,p2b,h1a,h4b] 162 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p2b,h4b] * xab_vvoo[p3a,p4b,h1a,h2b]

163 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xbb_vvoo[p2b,p4b,h2b,h4b]

164 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xab_vvoo[p4a,p2b,h4a,h2b] 165 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] * xab_vvoo[p3a,p2b,h1a,h4b]

166 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p2b,h4b] * xab_vvoo[p1a,p4b,h3a,h2b] 167 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b] * xab_vvoo[p1a,p4b,h1a,h4b]

168 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b] * xaa_vvoo[p1a,p3a,h1a,h3a] 169 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h2b] * xab_vvoo[p1a,p2b,h3a,h4b]

170 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h2b] * xb_vo[p2b,h3b] 171 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xb_vo[p4b,h2b]

172 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * tb_vo[p3b,h2b] * xb_vo[p2b,h4b] 173 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xb_vo[p2b,h4b]

174 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b] 175 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b]

176 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xb_vo[p4b,h2b] * tb_vo[p2b,h4b] 177 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * tb_vo[p4b,h2b] * xb_vo[p2b,h4b];

178 179 raa_vvoo[p1a, p2a, h1a, h2a] =

180 - 1.0 * fa_vv[p1a,p3a] * xaa_vvoo[p2a,p3a,h1a,h2a] 181 + 1.0 * fa_vv[p2a,p3a] * xaa_vvoo[p1a,p3a,h1a,h2a]

182 + 1.0 * fa_oo[h3a,h1a] * xaa_vvoo[p1a,p2a,h2a,h3a] 183 - 1.0 * fa_oo[h3a,h2a] * xaa_vvoo[p1a,p2a,h1a,h3a]

184 + 0.5 * vaa_vvvv[p1a,p2a,p3a,p4a] * xaa_vvoo[p3a,p4a,h1a,h2a] 185 + 1.0 * vab_voov[p1a,h3b,h1a,p3b] * xab_vvoo[p2a,p3b,h2a,h3b]

186 - 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * xaa_vvoo[p2a,p3a,h2a,h3a] 187 - 1.0 * vab_voov[p1a,h3b,h2a,p3b] * xab_vvoo[p2a,p3b,h1a,h3b]

188 + 1.0 * vaa_vovo[p1a,h3a,p3a,h2a] * xaa_vvoo[p2a,p3a,h1a,h3a] 189 - 1.0 * vab_voov[p2a,h3b,h1a,p3b] * xab_vvoo[p1a,p3b,h2a,h3b]

190 + 1.0 * vaa_vovo[p2a,h3a,p3a,h1a] * xaa_vvoo[p1a,p3a,h2a,h3a] 191 + 1.0 * vab_voov[p2a,h3b,h2a,p3b] * xab_vvoo[p1a,p3b,h1a,h3b]

192 - 1.0 * vaa_vovo[p2a,h3a,p3a,h2a] * xaa_vvoo[p1a,p3a,h1a,h3a]

193 + 0.5 * vaa_oooo[h3a,h4a,h1a,h2a] * xaa_vvoo[p1a,p2a,h3a,h4a] 194 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p1a,p3a,h1a,h2a] * tab_vvoo[p2a,p4b,h3a,h4b]

195 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p1a,p3a,h1a,h2a] * taa_vvoo[p2a,p4a,h3a,h4a] 196 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p2a,p3a,h1a,h2a] * tab_vvoo[p1a,p4b,h3a,h4b]

197 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p2a,p3a,h1a,h2a] * taa_vvoo[p1a,p4a,h3a,h4a] 198 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p1a,p2a,h1a,h3a] * tab_vvoo[p3a,p4b,h2a,h4b]

199 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p1a,p2a,h1a,h3a] * taa_vvoo[p3a,p4a,h2a,h4a] 200 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p4b,h1a,h4b] * taa_vvoo[p1a,p2a,h2a,h3a]

201 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p3a,p4a,h1a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a] 202 + 0.25 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p3a,p4a,h1a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a]

203 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xab_vvoo[p1a,p3b,h1a,h3b] * tab_vvoo[p2a,p4b,h2a,h4b]

142 204 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p1a,p4b,h1a,h4b] * taa_vvoo[p2a,p3a,h2a,h3a] 205 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p1a,p3a,h1a,h3a] * tab_vvoo[p2a,p4b,h2a,h4b]

206 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p1a,p3a,h1a,h3a] * taa_vvoo[p2a,p4a,h2a,h4a] 207 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xab_vvoo[p2a,p3b,h1a,h3b] * tab_vvoo[p1a,p4b,h2a,h4b]

208 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p2a,p4b,h1a,h4b] * taa_vvoo[p1a,p3a,h2a,h3a] 209 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xaa_vvoo[p2a,p3a,h1a,h3a] * tab_vvoo[p1a,p4b,h2a,h4b]

210 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xaa_vvoo[p2a,p3a,h1a,h3a] * taa_vvoo[p1a,p4a,h2a,h4a] 211 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h2a] * xab_vvoo[p2a,p4b,h3a,h4b]

212 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h2a] * xaa_vvoo[p2a,p4a,h3a,h4a] 213 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p2a,p3a,h1a,h2a] * xab_vvoo[p1a,p4b,h3a,h4b]

214 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p2a,p3a,h1a,h2a] * xaa_vvoo[p1a,p4a,h3a,h4a]

215 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p2a,h1a,h3a] * xab_vvoo[p3a,p4b,h2a,h4b] 216 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p2a,h1a,h3a] * xaa_vvoo[p3a,p4a,h2a,h4a]

217 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h1a,h4b] * xaa_vvoo[p1a,p2a,h2a,h3a] 218 - 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h3a] * xaa_vvoo[p1a,p2a,h2a,h4a]

219 + 0.25 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p3a,p4a,h1a,h2a] * xaa_vvoo[p1a,p2a,h3a,h4a] 220 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p1a,p3b,h1a,h3b] * xab_vvoo[p2a,p4b,h2a,h4b]

221 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p1a,p4b,h1a,h4b] * xaa_vvoo[p2a,p3a,h2a,h3a] 222 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p1a,p3a,h1a,h3a] * xab_vvoo[p2a,p4b,h2a,h4b]

223 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p1a,p3a,h1a,h3a] * xaa_vvoo[p2a,p4a,h2a,h4a] 224 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tab_vvoo[p2a,p3b,h1a,h3b] * xab_vvoo[p1a,p4b,h2a,h4b]

225 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p2a,p4b,h1a,h4b] * xaa_vvoo[p1a,p3a,h2a,h3a] 226 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * taa_vvoo[p2a,p3a,h1a,h3a] * xab_vvoo[p1a,p4b,h2a,h4b]

227 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * taa_vvoo[p2a,p3a,h1a,h3a] * xaa_vvoo[p1a,p4a,h2a,h4a] 228 - 1.0 * vaa_vvvo[p1a,p2a,p3a,h1a] * xa_vo[p3a,h2a]

229 + 1.0 * vaa_vvvo[p1a,p2a,p3a,h2a] * xa_vo[p3a,h1a] 230 - 1.0 * vaa_vooo[p1a,h3a,h1a,h2a] * xa_vo[p2a,h3a]

231 + 1.0 * vaa_vooo[p2a,h3a,h1a,h2a] * xa_vo[p1a,h3a] 232 + 1.0 * fa_ov[h3a,p3a] * xa_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h1a,h2a]

233 - 1.0 * fa_ov[h3a,p3a] * xa_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h1a,h2a] 234 + 1.0 * fa_ov[h3a,p3a] * xa_vo[p3a,h1a] * taa_vvoo[p1a,p2a,h2a,h3a]

235 - 1.0 * fa_ov[h3a,p3a] * xa_vo[p3a,h2a] * taa_vvoo[p1a,p2a,h1a,h3a] 236 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xb_vo[p4b,h3b] * taa_vvoo[p2a,p3a,h1a,h2a]

237 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h3a] * taa_vvoo[p2a,p4a,h1a,h2a] 238 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * xb_vo[p4b,h3b] * taa_vvoo[p1a,p3a,h1a,h2a]

239 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * xa_vo[p3a,h3a] * taa_vvoo[p1a,p4a,h1a,h2a] 240 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xb_vo[p3b,h4b] * taa_vvoo[p1a,p2a,h2a,h3a]

241 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p3a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a] 242 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * xb_vo[p3b,h4b] * taa_vvoo[p1a,p2a,h1a,h3a]

243 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * xa_vo[p3a,h3a] * taa_vvoo[p1a,p2a,h1a,h4a]

244 - 0.5 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p2a,h3a] * taa_vvoo[p3a,p4a,h1a,h2a] 245 + 0.5 * vaa_vovv[p2a,h3a,p3a,p4a] * xa_vo[p1a,h3a] * taa_vvoo[p3a,p4a,h1a,h2a]

246 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xa_vo[p3a,h1a] * tab_vvoo[p2a,p4b,h2a,h3b] 247 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h1a] * taa_vvoo[p2a,p4a,h2a,h3a]

248 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * xa_vo[p3a,h2a] * tab_vvoo[p2a,p4b,h1a,h3b] 249 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h2a] * taa_vvoo[p2a,p4a,h1a,h3a]

250 - 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * xa_vo[p3a,h1a] * tab_vvoo[p1a,p4b,h2a,h3b] 251 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * xa_vo[p3a,h1a] * taa_vvoo[p1a,p4a,h2a,h3a]

252 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * xa_vo[p3a,h2a] * tab_vvoo[p1a,p4b,h1a,h3b] 253 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * xa_vo[p3a,h2a] * taa_vvoo[p1a,p4a,h1a,h3a]

254 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xa_vo[p1a,h3a] * tab_vvoo[p2a,p3b,h2a,h4b]

143 255 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h2a,h4a] 256 + 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * xa_vo[p1a,h3a] * tab_vvoo[p2a,p3b,h1a,h4b]

257 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * xa_vo[p1a,h3a] * taa_vvoo[p2a,p3a,h1a,h4a] 258 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * xa_vo[p2a,h3a] * tab_vvoo[p1a,p3b,h2a,h4b]

259 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h2a,h4a] 260 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * xa_vo[p2a,h3a] * tab_vvoo[p1a,p3b,h1a,h4b]

261 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * xa_vo[p2a,h3a] * taa_vvoo[p1a,p3a,h1a,h4a] 262 - 0.5 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p3a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a]

263 + 0.5 * vaa_oovo[h3a,h4a,p3a,h2a] * xa_vo[p3a,h1a] * taa_vvoo[p1a,p2a,h3a,h4a] 264 + 1.0 * vaa_vvvv[p1a,p2a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h2a]

265 + 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * xa_vo[p3a,h2a] * ta_vo[p2a,h3a]

266 - 1.0 * vaa_vovo[p1a,h3a,p3a,h2a] * xa_vo[p3a,h1a] * ta_vo[p2a,h3a] 267 - 1.0 * vaa_vovo[p2a,h3a,p3a,h1a] * xa_vo[p3a,h2a] * ta_vo[p1a,h3a]

268 + 1.0 * vaa_vovo[p2a,h3a,p3a,h2a] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] 269 + 1.0 * vaa_oooo[h3a,h4a,h1a,h2a] * xa_vo[p1a,h3a] * ta_vo[p2a,h4a]

270 + 1.0 * fa_ov[h3a,p3a] * ta_vo[p1a,h3a] * xaa_vvoo[p2a,p3a,h1a,h2a] 271 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p2a,h3a] * xaa_vvoo[p1a,p3a,h1a,h2a]

272 + 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h1a] * xaa_vvoo[p1a,p2a,h2a,h3a] 273 - 1.0 * fa_ov[h3a,p3a] * ta_vo[p3a,h2a] * xaa_vvoo[p1a,p2a,h1a,h3a]

274 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * xaa_vvoo[p2a,p3a,h1a,h2a] 275 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * xaa_vvoo[p2a,p4a,h1a,h2a]

276 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * tb_vo[p4b,h3b] * xaa_vvoo[p1a,p3a,h1a,h2a] 277 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h3a] * xaa_vvoo[p1a,p4a,h1a,h2a]

278 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * tb_vo[p3b,h4b] * xaa_vvoo[p1a,p2a,h2a,h3a] 279 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h3a] * xaa_vvoo[p1a,p2a,h2a,h4a]

280 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * tb_vo[p3b,h4b] * xaa_vvoo[p1a,p2a,h1a,h3a] 281 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h3a] * xaa_vvoo[p1a,p2a,h1a,h4a]

282 - 0.5 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p2a,h3a] * xaa_vvoo[p3a,p4a,h1a,h2a] 283 + 0.5 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p1a,h3a] * xaa_vvoo[p3a,p4a,h1a,h2a]

284 + 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * xab_vvoo[p2a,p4b,h2a,h3b] 285 + 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * xaa_vvoo[p2a,p4a,h2a,h3a]

286 - 1.0 * vab_vovv[p1a,h3b,p3a,p4b] * ta_vo[p3a,h2a] * xab_vvoo[p2a,p4b,h1a,h3b] 287 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h2a] * xaa_vvoo[p2a,p4a,h1a,h3a]

288 - 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * ta_vo[p3a,h1a] * xab_vvoo[p1a,p4b,h2a,h3b] 289 - 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * xaa_vvoo[p1a,p4a,h2a,h3a]

290 + 1.0 * vab_vovv[p2a,h3b,p3a,p4b] * ta_vo[p3a,h2a] * xab_vvoo[p1a,p4b,h1a,h3b] 291 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h2a] * xaa_vvoo[p1a,p4a,h1a,h3a]

292 - 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p1a,h3a] * xab_vvoo[p2a,p3b,h2a,h4b] 293 + 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p1a,h3a] * xaa_vvoo[p2a,p3a,h2a,h4a]

294 + 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * ta_vo[p1a,h3a] * xab_vvoo[p2a,p3b,h1a,h4b]

295 - 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p1a,h3a] * xaa_vvoo[p2a,p3a,h1a,h4a] 296 + 1.0 * vab_ooov[h3a,h4b,h1a,p3b] * ta_vo[p2a,h3a] * xab_vvoo[p1a,p3b,h2a,h4b]

297 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p2a,h3a] * xaa_vvoo[p1a,p3a,h2a,h4a] 298 - 1.0 * vab_ooov[h3a,h4b,h2a,p3b] * ta_vo[p2a,h3a] * xab_vvoo[p1a,p3b,h1a,h4b]

299 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p2a,h3a] * xaa_vvoo[p1a,p3a,h1a,h4a] 300 - 0.5 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h2a] * xaa_vvoo[p1a,p2a,h3a,h4a]

301 + 0.5 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h1a] * xaa_vvoo[p1a,p2a,h3a,h4a] 302 + 1.0 * vaa_vvvv[p1a,p2a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h2a]

303 + 1.0 * vaa_vovo[p1a,h3a,p3a,h1a] * ta_vo[p3a,h2a] * xa_vo[p2a,h3a] 304 - 1.0 * vaa_vovo[p1a,h3a,p3a,h2a] * ta_vo[p3a,h1a] * xa_vo[p2a,h3a]

305 - 1.0 * vaa_vovo[p2a,h3a,p3a,h1a] * ta_vo[p3a,h2a] * xa_vo[p1a,h3a]

144 306 + 1.0 * vaa_vovo[p2a,h3a,p3a,h2a] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] 307 + 1.0 * vaa_oooo[h3a,h4a,h1a,h2a] * ta_vo[p1a,h3a] * xa_vo[p2a,h4a]

308 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p1a,h3a] * tb_vo[p4b,h4b] * taa_vvoo[p2a,p3a,h1a,h2a] 309 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p1a,h3a] * ta_vo[p3a,h4a] * taa_vvoo[p2a,p4a,h1a,h2a]

310 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p2a,h3a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p3a,h1a,h2a] 311 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p2a,h3a] * ta_vo[p3a,h4a] * taa_vvoo[p1a,p4a,h1a,h2a]

312 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h2a,h3a] 313 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a]

314 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h2a] * tb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h1a,h3a] 315 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h2a] * ta_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h1a,h4a]

316 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p1a,h3a] * ta_vo[p2a,h4a] * taa_vvoo[p3a,p4a,h1a,h2a]

317 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h2a,h4b] 318 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h2a,h4a]

319 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h2a] * ta_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h1a,h4b] 320 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h2a] * ta_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h1a,h4a]

321 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h1a] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h2a,h4b] 322 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h2a,h4a]

323 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h2a] * ta_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h1a,h4b] 324 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h2a] * ta_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h1a,h4a]

325 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a] 326 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p2a,h3a]

327 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p1a,h3a] 328 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * xa_vo[p3a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a]

329 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * xa_vo[p3a,h1a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a] 330 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * xb_vo[p4b,h4b] * taa_vvoo[p2a,p3a,h1a,h2a]

331 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * xa_vo[p3a,h4a] * taa_vvoo[p2a,p4a,h1a,h2a] 332 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p2a,h3a] * xb_vo[p4b,h4b] * taa_vvoo[p1a,p3a,h1a,h2a]

333 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p2a,h3a] * xa_vo[p3a,h4a] * taa_vvoo[p1a,p4a,h1a,h2a] 334 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h2a,h3a]

335 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h2a,h4a] 336 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * xb_vo[p4b,h4b] * taa_vvoo[p1a,p2a,h1a,h3a]

337 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * xa_vo[p4a,h3a] * taa_vvoo[p1a,p2a,h1a,h4a] 338 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * xa_vo[p2a,h4a] * taa_vvoo[p3a,p4a,h1a,h2a]

339 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h2a,h4b] 340 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h2a,h4a]

341 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * xa_vo[p1a,h3a] * tab_vvoo[p2a,p4b,h1a,h4b] 342 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * xa_vo[p1a,h3a] * taa_vvoo[p2a,p4a,h1a,h4a]

343 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * xa_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h2a,h4b] 344 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h2a,h4a]

345 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * xa_vo[p2a,h3a] * tab_vvoo[p1a,p4b,h1a,h4b]

346 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * xa_vo[p2a,h3a] * taa_vvoo[p1a,p4a,h1a,h4a] 347 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h2a] * taa_vvoo[p1a,p2a,h3a,h4a]

348 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h2a] * ta_vo[p2a,h3a] 349 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h2a] * ta_vo[p1a,h3a]

350 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h2a] * xa_vo[p1a,h3a] * ta_vo[p2a,h4a] 351 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h1a] * xa_vo[p1a,h3a] * ta_vo[p2a,h4a]

352 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p1a,h3a] * tb_vo[p4b,h4b] * xaa_vvoo[p2a,p3a,h1a,h2a] 353 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p3a,h4a] * xaa_vvoo[p2a,p4a,h1a,h2a]

354 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p2a,h3a] * tb_vo[p4b,h4b] * xaa_vvoo[p1a,p3a,h1a,h2a] 355 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p2a,h3a] * ta_vo[p3a,h4a] * xaa_vvoo[p1a,p4a,h1a,h2a]

356 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * tb_vo[p4b,h4b] * xaa_vvoo[p1a,p2a,h2a,h3a]

145 357 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h3a] * xaa_vvoo[p1a,p2a,h2a,h4a] 358 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * tb_vo[p4b,h4b] * xaa_vvoo[p1a,p2a,h1a,h3a]

359 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p4a,h3a] * xaa_vvoo[p1a,p2a,h1a,h4a] 360 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a] * xaa_vvoo[p3a,p4a,h1a,h2a]

361 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xab_vvoo[p2a,p4b,h2a,h4b] 362 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xaa_vvoo[p2a,p4a,h2a,h4a]

363 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * xab_vvoo[p2a,p4b,h1a,h4b] 364 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * xaa_vvoo[p2a,p4a,h1a,h4a]

365 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h1a] * ta_vo[p2a,h3a] * xab_vvoo[p1a,p4b,h2a,h4b] 366 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p2a,h3a] * xaa_vvoo[p1a,p4a,h2a,h4a]

367 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h2a] * ta_vo[p2a,h3a] * xab_vvoo[p1a,p4b,h1a,h4b]

368 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h2a] * ta_vo[p2a,h3a] * xaa_vvoo[p1a,p4a,h1a,h4a] 369 + 0.5 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * xaa_vvoo[p1a,p2a,h3a,h4a]

370 - 1.0 * vaa_vovv[p1a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * xa_vo[p2a,h3a] 371 + 1.0 * vaa_vovv[p2a,h3a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * xa_vo[p1a,h3a]

372 - 1.0 * vaa_oovo[h3a,h4a,p3a,h1a] * ta_vo[p3a,h2a] * ta_vo[p1a,h3a] * xa_vo[p2a,h4a] 373 + 1.0 * vaa_oovo[h3a,h4a,p3a,h2a] * ta_vo[p3a,h1a] * ta_vo[p1a,h3a] * xa_vo[p2a,h4a]

374 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xa_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a] 375 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * xa_vo[p4a,h2a] * ta_vo[p1a,h3a] * ta_vo[p2a,h4a]

376 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * xa_vo[p1a,h3a] * ta_vo[p2a,h4a] 377 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * ta_vo[p3a,h1a] * ta_vo[p4a,h2a] * ta_vo[p1a,h3a] * xa_vo[p2a,h4a];

378 379 rbb_vvoo[p1b,p2b,h1b,h2b] =

380 - 1.0 * fb_vv[p1b,p3b] * xbb_vvoo[p2b,p3b,h1b,h2b] 381 + 1.0 * fb_vv[p2b,p3b] * xbb_vvoo[p1b,p3b,h1b,h2b]

382 + 1.0 * fb_oo[h3b,h1b] * xbb_vvoo[p1b,p2b,h2b,h3b] 383 - 1.0 * fb_oo[h3b,h2b] * xbb_vvoo[p1b,p2b,h1b,h3b]

384 + 0.5 * vbb_vvvv[p1b,p2b,p3b,p4b] * xbb_vvoo[p3b,p4b,h1b,h2b] 385 - 1.0 * vbb_vovo[p1b,h3b,p3b,h1b] * xbb_vvoo[p2b,p3b,h2b,h3b]

386 + 1.0 * vab_ovvo[h3a,p1b,p3a,h1b] * xab_vvoo[p3a,p2b,h3a,h2b] 387 + 1.0 * vbb_vovo[p1b,h3b,p3b,h2b] * xbb_vvoo[p2b,p3b,h1b,h3b]

388 - 1.0 * vab_ovvo[h3a,p1b,p3a,h2b] * xab_vvoo[p3a,p2b,h3a,h1b] 389 + 1.0 * vbb_vovo[p2b,h3b,p3b,h1b] * xbb_vvoo[p1b,p3b,h2b,h3b]

390 - 1.0 * vab_ovvo[h3a,p2b,p3a,h1b] * xab_vvoo[p3a,p1b,h3a,h2b] 391 - 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * xbb_vvoo[p1b,p3b,h1b,h3b]

392 + 1.0 * vab_ovvo[h3a,p2b,p3a,h2b] * xab_vvoo[p3a,p1b,h3a,h1b] 393 + 0.5 * vbb_oooo[h3b,h4b,h1b,h2b] * xbb_vvoo[p1b,p2b,h3b,h4b]

394 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p1b,p3b,h1b,h2b] * tbb_vvoo[p2b,p4b,h3b,h4b] 395 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p2b,h3a,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b]

396 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p2b,p3b,h1b,h2b] * tbb_vvoo[p1b,p4b,h3b,h4b]

397 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p1b,h3a,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b] 398 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p1b,p2b,h1b,h3b] * tbb_vvoo[p3b,p4b,h2b,h4b]

399 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p4b,h3a,h2b] * tbb_vvoo[p1b,p2b,h1b,h4b] 400 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p3b,p4b,h1b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b]

401 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p4b,h3a,h1b] * tbb_vvoo[p1b,p2b,h2b,h4b] 402 + 0.25 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p3b,p4b,h1b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b]

403 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p1b,p3b,h1b,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b] 404 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p2b,h3a,h2b] * tbb_vvoo[p1b,p4b,h1b,h4b]

405 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p1b,h3a,h1b] * tbb_vvoo[p2b,p4b,h2b,h4b] 406 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xab_vvoo[p3a,p1b,h3a,h1b] * tab_vvoo[p4a,p2b,h4a,h2b]

407 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xbb_vvoo[p2b,p3b,h1b,h3b] * tbb_vvoo[p1b,p4b,h2b,h4b]

146 408 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p1b,h3a,h2b] * tbb_vvoo[p2b,p4b,h1b,h4b] 409 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xab_vvoo[p3a,p2b,h3a,h1b] * tbb_vvoo[p1b,p4b,h2b,h4b]

410 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * xab_vvoo[p3a,p2b,h3a,h1b] * tab_vvoo[p4a,p1b,h4a,h2b] 411 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p3b,h1b,h2b] * xbb_vvoo[p2b,p4b,h3b,h4b]

412 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h4b] * xbb_vvoo[p1b,p4b,h1b,h2b] 413 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p2b,p3b,h1b,h2b] * xbb_vvoo[p1b,p4b,h3b,h4b]

414 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h4b] * xbb_vvoo[p2b,p4b,h1b,h2b] 415 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p2b,h1b,h3b] * xbb_vvoo[p3b,p4b,h2b,h4b]

416 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h3a,h2b] * xbb_vvoo[p1b,p2b,h1b,h4b] 417 - 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p3b,p4b,h1b,h3b] * xbb_vvoo[p1b,p2b,h2b,h4b]

418 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p4b,h3a,h1b] * xbb_vvoo[p1b,p2b,h2b,h4b]

419 + 0.25 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p3b,p4b,h1b,h2b] * xbb_vvoo[p1b,p2b,h3b,h4b] 420 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p1b,p3b,h1b,h3b] * xbb_vvoo[p2b,p4b,h2b,h4b]

421 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h2b] * xbb_vvoo[p1b,p4b,h1b,h4b] 422 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h1b] * xbb_vvoo[p2b,p4b,h2b,h4b]

423 + 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p1b,h3a,h1b] * xab_vvoo[p4a,p2b,h4a,h2b] 424 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tbb_vvoo[p2b,p3b,h1b,h3b] * xbb_vvoo[p1b,p4b,h2b,h4b]

425 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p1b,h3a,h2b] * xbb_vvoo[p2b,p4b,h1b,h4b] 426 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tab_vvoo[p3a,p2b,h3a,h1b] * xbb_vvoo[p1b,p4b,h2b,h4b]

427 - 1.0 * vaa_oovv[h3a,h4a,p3a,p4a] * tab_vvoo[p3a,p2b,h3a,h1b] * xab_vvoo[p4a,p1b,h4a,h2b] 428 - 1.0 * vbb_vvvo[p1b,p2b,p3b,h1b] * xb_vo[p3b,h2b]

429 + 1.0 * vbb_vvvo[p1b,p2b,p3b,h2b] * xb_vo[p3b,h1b] 430 - 1.0 * vbb_vooo[p1b,h3b,h1b,h2b] * xb_vo[p2b,h3b]

431 + 1.0 * vbb_vooo[p2b,h3b,h1b,h2b] * xb_vo[p1b,h3b] 432 + 1.0 * fb_ov[h3b,p3b] * xb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h1b,h2b]

433 - 1.0 * fb_ov[h3b,p3b] * xb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h1b,h2b] 434 + 1.0 * fb_ov[h3b,p3b] * xb_vo[p3b,h1b] * tbb_vvoo[p1b,p2b,h2b,h3b]

435 - 1.0 * fb_ov[h3b,p3b] * xb_vo[p3b,h2b] * tbb_vvoo[p1b,p2b,h1b,h3b] 436 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * xb_vo[p3b,h3b] * tbb_vvoo[p2b,p4b,h1b,h2b]

437 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * xa_vo[p3a,h3a] * tbb_vvoo[p2b,p4b,h1b,h2b] 438 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h3b] * tbb_vvoo[p1b,p4b,h1b,h2b]

439 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xa_vo[p3a,h3a] * tbb_vvoo[p1b,p4b,h1b,h2b] 440 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * xb_vo[p3b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b]

441 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * xa_vo[p3a,h3a] * tbb_vvoo[p1b,p2b,h2b,h4b] 442 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p3b,h3b] * tbb_vvoo[p1b,p2b,h1b,h4b]

443 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xa_vo[p3a,h3a] * tbb_vvoo[p1b,p2b,h1b,h4b] 444 - 0.5 * vbb_vovv[p1b,h3b,p3b,p4b] * xb_vo[p2b,h3b] * tbb_vvoo[p3b,p4b,h1b,h2b]

445 + 0.5 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p1b,h3b] * tbb_vvoo[p3b,p4b,h1b,h2b] 446 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * xb_vo[p3b,h1b] * tbb_vvoo[p2b,p4b,h2b,h3b]

447 + 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * xb_vo[p4b,h1b] * tab_vvoo[p3a,p2b,h3a,h2b]

448 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * xb_vo[p3b,h2b] * tbb_vvoo[p2b,p4b,h1b,h3b] 449 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * xb_vo[p4b,h2b] * tab_vvoo[p3a,p2b,h3a,h1b]

450 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h1b] * tbb_vvoo[p1b,p4b,h2b,h3b] 451 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xb_vo[p4b,h1b] * tab_vvoo[p3a,p1b,h3a,h2b]

452 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h2b] * tbb_vvoo[p1b,p4b,h1b,h3b] 453 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * xb_vo[p4b,h2b] * tab_vvoo[p3a,p1b,h3a,h1b]

454 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * xb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h2b,h4b] 455 - 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * xb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h2b]

456 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p1b,h3b] * tbb_vvoo[p2b,p3b,h1b,h4b] 457 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h1b]

458 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * xb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h2b,h4b]

147 459 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * xb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h2b] 460 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p2b,h3b] * tbb_vvoo[p1b,p3b,h1b,h4b]

461 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * xb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h1b] 462 - 0.5 * vbb_oovo[h3b,h4b,p3b,h1b] * xb_vo[p3b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b]

463 + 0.5 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p3b,h1b] * tbb_vvoo[p1b,p2b,h3b,h4b] 464 + 1.0 * vbb_vvvv[p1b,p2b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h2b]

465 + 1.0 * vbb_vovo[p1b,h3b,p3b,h1b] * xb_vo[p3b,h2b] * tb_vo[p2b,h3b] 466 - 1.0 * vbb_vovo[p1b,h3b,p3b,h2b] * xb_vo[p3b,h1b] * tb_vo[p2b,h3b]

467 - 1.0 * vbb_vovo[p2b,h3b,p3b,h1b] * xb_vo[p3b,h2b] * tb_vo[p1b,h3b] 468 + 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * xb_vo[p3b,h1b] * tb_vo[p1b,h3b]

469 + 1.0 * vbb_oooo[h3b,h4b,h1b,h2b] * xb_vo[p1b,h3b] * tb_vo[p2b,h4b]

470 + 1.0 * fb_ov[h3b,p3b] * tb_vo[p1b,h3b] * xbb_vvoo[p2b,p3b,h1b,h2b] 471 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p2b,h3b] * xbb_vvoo[p1b,p3b,h1b,h2b]

472 + 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h1b] * xbb_vvoo[p1b,p2b,h2b,h3b] 473 - 1.0 * fb_ov[h3b,p3b] * tb_vo[p3b,h2b] * xbb_vvoo[p1b,p2b,h1b,h3b]

474 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * xbb_vvoo[p2b,p4b,h1b,h2b] 475 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * ta_vo[p3a,h3a] * xbb_vvoo[p2b,p4b,h1b,h2b]

476 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h3b] * xbb_vvoo[p1b,p4b,h1b,h2b] 477 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * ta_vo[p3a,h3a] * xbb_vvoo[p1b,p4b,h1b,h2b]

478 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h3b] * xbb_vvoo[p1b,p2b,h2b,h4b] 479 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * ta_vo[p3a,h3a] * xbb_vvoo[p1b,p2b,h2b,h4b]

480 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h3b] * xbb_vvoo[p1b,p2b,h1b,h4b] 481 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * ta_vo[p3a,h3a] * xbb_vvoo[p1b,p2b,h1b,h4b]

482 - 0.5 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p2b,h3b] * xbb_vvoo[p3b,p4b,h1b,h2b] 483 + 0.5 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p1b,h3b] * xbb_vvoo[p3b,p4b,h1b,h2b]

484 + 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * xbb_vvoo[p2b,p4b,h2b,h3b] 485 + 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * tb_vo[p4b,h1b] * xab_vvoo[p3a,p2b,h3a,h2b]

486 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * xbb_vvoo[p2b,p4b,h1b,h3b] 487 - 1.0 * vab_ovvv[h3a,p1b,p3a,p4b] * tb_vo[p4b,h2b] * xab_vvoo[p3a,p2b,h3a,h1b]

488 - 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * xbb_vvoo[p1b,p4b,h2b,h3b] 489 - 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h1b] * xab_vvoo[p3a,p1b,h3a,h2b]

490 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h2b] * xbb_vvoo[p1b,p4b,h1b,h3b] 491 + 1.0 * vab_ovvv[h3a,p2b,p3a,p4b] * tb_vo[p4b,h2b] * xab_vvoo[p3a,p1b,h3a,h1b]

492 + 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p1b,h3b] * xbb_vvoo[p2b,p3b,h2b,h4b] 493 - 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * tb_vo[p1b,h4b] * xab_vvoo[p3a,p2b,h3a,h2b]

494 - 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p1b,h3b] * xbb_vvoo[p2b,p3b,h1b,h4b] 495 + 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p1b,h4b] * xab_vvoo[p3a,p2b,h3a,h1b]

496 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p2b,h3b] * xbb_vvoo[p1b,p3b,h2b,h4b] 497 + 1.0 * vab_oovo[h3a,h4b,p3a,h1b] * tb_vo[p2b,h4b] * xab_vvoo[p3a,p1b,h3a,h2b]

498 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p2b,h3b] * xbb_vvoo[p1b,p3b,h1b,h4b]

499 - 1.0 * vab_oovo[h3a,h4b,p3a,h2b] * tb_vo[p2b,h4b] * xab_vvoo[p3a,p1b,h3a,h1b] 500 - 0.5 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h2b] * xbb_vvoo[p1b,p2b,h3b,h4b]

501 + 0.5 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h1b] * xbb_vvoo[p1b,p2b,h3b,h4b] 502 + 1.0 * vbb_vvvv[p1b,p2b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h2b]

503 + 1.0 * vbb_vovo[p1b,h3b,p3b,h1b] * tb_vo[p3b,h2b] * xb_vo[p2b,h3b] 504 - 1.0 * vbb_vovo[p1b,h3b,p3b,h2b] * tb_vo[p3b,h1b] * xb_vo[p2b,h3b]

505 - 1.0 * vbb_vovo[p2b,h3b,p3b,h1b] * tb_vo[p3b,h2b] * xb_vo[p1b,h3b] 506 + 1.0 * vbb_vovo[p2b,h3b,p3b,h2b] * tb_vo[p3b,h1b] * xb_vo[p1b,h3b]

507 + 1.0 * vbb_oooo[h3b,h4b,h1b,h2b] * tb_vo[p1b,h3b] * xb_vo[p2b,h4b] 508 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p1b,h3b] * tb_vo[p3b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b]

509 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p1b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b]

148 510 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p2b,h3b] * tb_vo[p3b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b] 511 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p2b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b]

512 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b] 513 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p4b,h1b] * tbb_vvoo[p1b,p2b,h2b,h4b]

514 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h2b] * tb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h1b,h4b] 515 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xa_vo[p3a,h3a] * tb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h1b,h4b]

516 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p1b,h3b] * tb_vo[p2b,h4b] * tbb_vvoo[p3b,p4b,h1b,h2b] 517 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b]

518 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xb_vo[p4b,h1b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h2b] 519 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h2b] * tb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h1b,h4b]

520 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xb_vo[p4b,h2b] * tb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h1b]

521 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h2b,h4b] 522 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xb_vo[p4b,h1b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h2b]

523 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h2b] * tb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h1b,h4b] 524 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * xb_vo[p4b,h2b] * tb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h1b]

525 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b] 526 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p2b,h3b]

527 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p1b,h3b] 528 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * xb_vo[p3b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b]

529 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * xb_vo[p3b,h1b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b] 530 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * xb_vo[p3b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b]

531 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p1b,h4b] * tbb_vvoo[p2b,p4b,h1b,h2b] 532 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * xb_vo[p3b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b]

533 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p2b,h4b] * tbb_vvoo[p1b,p4b,h1b,h2b] 534 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h2b,h4b]

535 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p4b,h1b] * tbb_vvoo[p1b,p2b,h2b,h4b] 536 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * xb_vo[p4b,h3b] * tbb_vvoo[p1b,p2b,h1b,h4b]

537 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * xb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h1b,h4b] 538 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * xb_vo[p2b,h4b] * tbb_vvoo[p3b,p4b,h1b,h2b]

539 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h2b,h4b] 540 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * xb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h2b]

541 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * xb_vo[p1b,h3b] * tbb_vvoo[p2b,p4b,h1b,h4b] 542 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * xb_vo[p1b,h4b] * tab_vvoo[p3a,p2b,h3a,h1b]

543 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h2b,h4b] 544 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * xb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h2b]

545 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * xb_vo[p2b,h3b] * tbb_vvoo[p1b,p4b,h1b,h4b] 546 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * xb_vo[p2b,h4b] * tab_vvoo[p3a,p1b,h3a,h1b]

547 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h2b] * tbb_vvoo[p1b,p2b,h3b,h4b] 548 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h2b] * tb_vo[p2b,h3b]

549 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h2b] * tb_vo[p1b,h3b]

550 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h2b] * xb_vo[p1b,h3b] * tb_vo[p2b,h4b] 551 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h1b] * xb_vo[p1b,h3b] * tb_vo[p2b,h4b]

552 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * tb_vo[p3b,h4b] * xbb_vvoo[p2b,p4b,h1b,h2b] 553 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p1b,h4b] * xbb_vvoo[p2b,p4b,h1b,h2b]

554 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p2b,h3b] * tb_vo[p3b,h4b] * xbb_vvoo[p1b,p4b,h1b,h2b] 555 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p2b,h4b] * xbb_vvoo[p1b,p4b,h1b,h2b]

556 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h3b] * xbb_vvoo[p1b,p2b,h2b,h4b] 557 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h1b] * xbb_vvoo[p1b,p2b,h2b,h4b]

558 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p4b,h3b] * xbb_vvoo[p1b,p2b,h1b,h4b] 559 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * ta_vo[p3a,h3a] * tb_vo[p4b,h2b] * xbb_vvoo[p1b,p2b,h1b,h4b]

560 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b] * xbb_vvoo[p3b,p4b,h1b,h2b]

149 561 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b] * xbb_vvoo[p2b,p4b,h2b,h4b] 562 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * tb_vo[p1b,h4b] * xab_vvoo[p3a,p2b,h3a,h2b]

563 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p1b,h3b] * xbb_vvoo[p2b,p4b,h1b,h4b] 564 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p1b,h4b] * xab_vvoo[p3a,p2b,h3a,h1b]

565 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p2b,h3b] * xbb_vvoo[p1b,p4b,h2b,h4b] 566 + 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h1b] * tb_vo[p2b,h4b] * xab_vvoo[p3a,p1b,h3a,h2b]

567 - 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h2b] * tb_vo[p2b,h3b] * xbb_vvoo[p1b,p4b,h1b,h4b] 568 - 1.0 * vab_oovv[h3a,h4b,p3a,p4b] * tb_vo[p4b,h2b] * tb_vo[p2b,h4b] * xab_vvoo[p3a,p1b,h3a,h1b]

569 + 0.5 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * xbb_vvoo[p1b,p2b,h3b,h4b] 570 - 1.0 * vbb_vovv[p1b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * xb_vo[p2b,h3b]

571 + 1.0 * vbb_vovv[p2b,h3b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * xb_vo[p1b,h3b]

572 - 1.0 * vbb_oovo[h3b,h4b,p3b,h1b] * tb_vo[p3b,h2b] * tb_vo[p1b,h3b] * xb_vo[p2b,h4b] 573 + 1.0 * vbb_oovo[h3b,h4b,p3b,h2b] * tb_vo[p3b,h1b] * tb_vo[p1b,h3b] * xb_vo[p2b,h4b]

574 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * xb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b] 575 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * xb_vo[p4b,h2b] * tb_vo[p1b,h3b] * tb_vo[p2b,h4b]

576 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * xb_vo[p1b,h3b] * tb_vo[p2b,h4b] 577 + 1.0 * vbb_oovv[h3b,h4b,p3b,p4b] * tb_vo[p3b,h1b] * tb_vo[p4b,h2b] * tb_vo[p1b,h3b] * xb_vo[p2b,h4b];

150 REOMCCSD-X1

1 r_vo[p1, h1] = 2 1.0 * f_vv[p1,p2] * x_vo[p2,h1]

3 - 1.0 * f_oo[h2,h1] * x_vo[p1,h2] 4 + 2.0 * f_ov[h2,p2] * x_vvoo[p1,p2,h1,h2]

5 - 1.0 * f_ov[h2,p2] * x_vvoo[p2,p1,h1,h2] 6 - 1.0 * v_vovv[p1,h2,p2,p3] * x_vvoo[p3,p2,h1,h2]

7 + 2.0 * v_vovv[p1,h2,p2,p3] * x_vvoo[p2,p3,h1,h2] 8 - 1.0 * v_vovo[p1,h2,p2,h1] * x_vo[p2,h2]

9 + 2.0 * v_ovvo[h2,p1,p2,h1] * x_vo[p2,h2] 10 + 1.0 * v_oovo[h2,h3,p2,h1] * x_vvoo[p1,p2,h2,h3]

11 - 2.0 * v_oovo[h2,h3,p2,h1] * x_vvoo[p1,p2,h3,h2] 12 - 1.0 * f_ov[h2,p2] * x_vo[p2,h1] * t_vo[p1,h2]

13 + 2.0 * v_vovv[p1,h2,p2,p3] * x_vo[p2,h1] * t_vo[p3,h2] 14 - 1.0 * v_vovv[p1,h2,p2,p3] * x_vo[p3,h1] * t_vo[p2,h2]

15 + 1.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h1] * t_vvoo[p1,p3,h3,h2] 16 - 2.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h1] * t_vvoo[p1,p3,h2,h3]

17 + 1.0 * v_oovv[h2,h3,p2,p3] * x_vo[p1,h2] * t_vvoo[p3,p2,h1,h3] 18 - 2.0 * v_oovv[h2,h3,p2,p3] * x_vo[p1,h2] * t_vvoo[p2,p3,h1,h3]

19 + 4.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h2] * t_vvoo[p1,p3,h1,h3] 20 - 2.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h2] * t_vvoo[p3,p1,h1,h3]

21 - 2.0 * v_oovv[h2,h3,p2,p3] * x_vo[p3,h2] * t_vvoo[p1,p2,h1,h3] 22 + 1.0 * v_oovv[h2,h3,p2,p3] * x_vo[p3,h2] * t_vvoo[p2,p1,h1,h3]

23 - 2.0 * v_oovo[h2,h3,p2,h1] * x_vo[p2,h2] * t_vo[p1,h3]

24 + 1.0 * v_oovo[h2,h3,p2,h1] * x_vo[p1,h2] * t_vo[p2,h3] 25 - 1.0 * f_ov[h2,p2] * t_vo[p2,h1] * x_vo[p1,h2]

26 + 2.0 * v_vovv[p1,h2,p2,p3] * t_vo[p2,h1] * x_vo[p3,h2] 27 - 1.0 * v_vovv[p1,h2,p2,p3] * t_vo[p3,h1] * x_vo[p2,h2]

28 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * x_vvoo[p1,p3,h3,h2] 29 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * x_vvoo[p1,p3,h2,h3]

30 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p1,h2] * x_vvoo[p3,p2,h1,h3] 31 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p1,h2] * x_vvoo[p2,p3,h1,h3]

32 + 4.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h2] * x_vvoo[p1,p3,h1,h3] 33 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h2] * x_vvoo[p3,p1,h1,h3]

34 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p3,h2] * x_vvoo[p1,p2,h1,h3] 35 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p3,h2] * x_vvoo[p2,p1,h1,h3]

36 - 2.0 * v_oovo[h2,h3,p2,h1] * t_vo[p2,h2] * x_vo[p1,h3] 37 + 1.0 * v_oovo[h2,h3,p2,h1] * t_vo[p1,h2] * x_vo[p2,h3]

38 - 2.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h1] * t_vo[p1,h2] * t_vo[p3,h3] 39 + 1.0 * v_oovv[h2,h3,p2,p3] * x_vo[p2,h1] * t_vo[p3,h2] * t_vo[p1,h3]

40 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * x_vo[p1,h2] * t_vo[p3,h3] 41 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * x_vo[p3,h2] * t_vo[p1,h3]

42 - 2.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vo[p1,h2] * x_vo[p3,h3] 43 + 1.0 * v_oovv[h2,h3,p2,p3] * t_vo[p2,h1] * t_vo[p3,h2] * x_vo[p1,h3];

151 UEOMCCSD-X2

1 r_vvoo[p1,p2,h1,h2] = 2 1.0 * f_vv[p1,p3] * x_vvoo[p3,p2,h1,h2]

3 + 1.0 * f_vv[p2,p3] * x_vvoo[p1,p3,h1,h2] 4 - 1.0 * f_oo[h3,h1] * x_vvoo[p2,p1,h2,h3]

5 - 1.0 * f_oo[h3,h2] * x_vvoo[p1,p2,h1,h3] 6 + 1.0 * v_vvvv[p1,p2,p3,p4] * x_vvoo[p3,p4,h1,h2]

7 + 1.0 * v_vvvo[p1,p2,p3,h2] * x_vo[p3,h1] 8 + 1.0 * v_vvvo[p2,p1,p3,h1] * x_vo[p3,h2]

9 - 1.0 * v_vovo[p1,h3,p3,h1] * x_vvoo[p2,p3,h2,h3] 10 - 1.0 * v_vovo[p2,h3,p3,h2] * x_vvoo[p1,p3,h1,h3]

11 - 1.0 * v_vovo[p1,h3,p3,h2] * x_vvoo[p3,p2,h1,h3] 12 - 1.0 * v_vovo[p2,h3,p3,h1] * x_vvoo[p3,p1,h2,h3]

13 + 2.0 * v_ovvo[h3,p1,p3,h1] * x_vvoo[p2,p3,h2,h3] 14 + 2.0 * v_ovvo[h3,p2,p3,h2] * x_vvoo[p1,p3,h1,h3]

15 - 1.0 * v_ovvo[h3,p1,p3,h1] * x_vvoo[p3,p2,h2,h3] 16 - 1.0 * v_ovvo[h3,p2,p3,h2] * x_vvoo[p3,p1,h1,h3]

17 - 1.0 * v_ovoo[h3,p2,h1,h2] * x_vo[p1,h3] 18 - 1.0 * v_vooo[p1,h3,h1,h2] * x_vo[p2,h3]

19 + 1.0 * v_oooo[h3,h4,h1,h2] * x_vvoo[p1,p2,h3,h4] 20 - 1.0 * f_ov[h3,p3] * x_vo[p3,h1] * t_vvoo[p2,p1,h2,h3]

21 - 1.0 * f_ov[h3,p3] * x_vo[p3,h2] * t_vvoo[p1,p2,h1,h3] 22 - 1.0 * f_ov[h3,p3] * x_vo[p1,h3] * t_vvoo[p3,p2,h1,h2]

23 - 1.0 * f_ov[h3,p3] * x_vo[p2,h3] * t_vvoo[p1,p3,h1,h2]

24 + 1.0 * v_vvvv[p1,p2,p3,p4] * x_vo[p3,h1] * t_vo[p4,h2] 25 + 2.0 * v_vovv[p1,h3,p3,p4] * x_vo[p3,h1] * t_vvoo[p2,p4,h2,h3]

26 + 2.0 * v_vovv[p2,h3,p3,p4] * x_vo[p3,h2] * t_vvoo[p1,p4,h1,h3] 27 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p3,h1] * t_vvoo[p4,p2,h2,h3]

28 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p3,h2] * t_vvoo[p4,p1,h1,h3] 29 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p4,h1] * t_vvoo[p2,p3,h2,h3]

30 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p4,h2] * t_vvoo[p1,p3,h1,h3] 31 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p4,h2] * t_vvoo[p3,p2,h1,h3]

32 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p4,h1] * t_vvoo[p3,p1,h2,h3] 33 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p2,h3] * t_vvoo[p3,p4,h1,h2]

34 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p1,h3] * t_vvoo[p4,p3,h1,h2] 35 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p3,h3] * t_vvoo[p4,p2,h1,h2]

36 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p3,h3] * t_vvoo[p1,p4,h1,h2] 37 + 2.0 * v_vovv[p1,h3,p3,p4] * x_vo[p4,h3] * t_vvoo[p3,p2,h1,h2]

38 + 2.0 * v_vovv[p2,h3,p3,p4] * x_vo[p4,h3] * t_vvoo[p1,p3,h1,h2] 39 - 1.0 * v_vovo[p1,h3,p3,h2] * x_vo[p3,h1] * t_vo[p2,h3]

40 - 1.0 * v_vovo[p2,h3,p3,h1] * x_vo[p3,h2] * t_vo[p1,h3] 41 - 1.0 * v_ovvo[h3,p1,p3,h1] * x_vo[p3,h2] * t_vo[p2,h3]

42 - 1.0 * v_ovvo[h3,p2,p3,h2] * x_vo[p3,h1] * t_vo[p1,h3] 43 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p2,h1,h2] * t_vvoo[p1,p4,h4,h3]

44 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p3,h1,h2] * t_vvoo[p2,p4,h4,h3] 45 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p2,h1,h2] * t_vvoo[p1,p4,h3,h4]

46 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p3,h1,h2] * t_vvoo[p2,p4,h3,h4] 47 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p4,h1,h3] * t_vvoo[p2,p1,h2,h4]

48 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p2,h1,h3] * t_vvoo[p4,p3,h2,h4] 49 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p4,p3,h1,h3] * t_vvoo[p2,p1,h2,h4]

50 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p2,h1,h3] * t_vvoo[p3,p4,h2,h4]

152 51 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p4,h1,h2] * t_vvoo[p1,p2,h3,h4] 52 + 4.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p3,h1,h3] * t_vvoo[p2,p4,h2,h4]

53 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p3,h1,h3] * t_vvoo[p4,p2,h2,h4] 54 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p1,h1,h3] * t_vvoo[p2,p4,h2,h4]

55 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p3,p1,h1,h3] * t_vvoo[p4,p2,h2,h4] 56 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p4,h1,h3] * t_vvoo[p2,p3,h2,h4]

57 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p1,p4,h1,h3] * t_vvoo[p3,p2,h2,h4] 58 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p4,p1,h1,h3] * t_vvoo[p2,p3,h2,h4]

59 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vvoo[p4,p2,h1,h3] * t_vvoo[p3,p1,h2,h4] 60 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p3,h2] * t_vvoo[p1,p2,h4,h3]

61 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p3,h1] * t_vvoo[p1,p2,h3,h4]

62 - 2.0 * v_oovo[h3,h4,p3,h1] * x_vo[p1,h4] * t_vvoo[p2,p3,h2,h3] 63 - 2.0 * v_oovo[h3,h4,p3,h2] * x_vo[p2,h4] * t_vvoo[p1,p3,h1,h3]

64 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p1,h4] * t_vvoo[p3,p2,h2,h3] 65 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p2,h4] * t_vvoo[p3,p1,h1,h3]

66 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p1,h3] * t_vvoo[p2,p3,h2,h4] 67 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p2,h3] * t_vvoo[p1,p3,h1,h4]

68 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p2,h3] * t_vvoo[p3,p1,h2,h4] 69 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p1,h3] * t_vvoo[p3,p2,h1,h4]

70 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p3,h4] * t_vvoo[p2,p1,h2,h3] 71 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p3,h4] * t_vvoo[p1,p2,h1,h3]

72 - 2.0 * v_oovo[h3,h4,p3,h1] * x_vo[p3,h3] * t_vvoo[p2,p1,h2,h4] 73 - 2.0 * v_oovo[h3,h4,p3,h2] * x_vo[p3,h3] * t_vvoo[p1,p2,h1,h4]

74 + 1.0 * v_oooo[h3,h4,h1,h2] * x_vo[p1,h3] * t_vo[p2,h4] 75 - 1.0 * f_ov[h3,p3] * t_vo[p3,h1] * x_vvoo[p2,p1,h2,h3]

76 - 1.0 * f_ov[h3,p3] * t_vo[p3,h2] * x_vvoo[p1,p2,h1,h3] 77 - 1.0 * f_ov[h3,p3] * t_vo[p1,h3] * x_vvoo[p3,p2,h1,h2]

78 - 1.0 * f_ov[h3,p3] * t_vo[p2,h3] * x_vvoo[p1,p3,h1,h2] 79 + 1.0 * v_vvvv[p1,p2,p3,p4] * t_vo[p3,h1] * x_vo[p4,h2]

80 + 2.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * x_vvoo[p2,p4,h2,h3] 81 + 2.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h2] * x_vvoo[p1,p4,h1,h3]

82 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * x_vvoo[p4,p2,h2,h3] 83 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h2] * x_vvoo[p4,p1,h1,h3]

84 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h1] * x_vvoo[p2,p3,h2,h3] 85 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h2] * x_vvoo[p1,p3,h1,h3]

86 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h2] * x_vvoo[p3,p2,h1,h3] 87 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h1] * x_vvoo[p3,p1,h2,h3]

88 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p2,h3] * x_vvoo[p3,p4,h1,h2] 89 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p1,h3] * x_vvoo[p4,p3,h1,h2]

90 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h3] * x_vvoo[p4,p2,h1,h2]

91 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p3,h3] * x_vvoo[p1,p4,h1,h2] 92 + 2.0 * v_vovv[p1,h3,p3,p4] * t_vo[p4,h3] * x_vvoo[p3,p2,h1,h2]

93 + 2.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h3] * x_vvoo[p1,p3,h1,h2] 94 - 1.0 * v_vovo[p1,h3,p3,h2] * t_vo[p3,h1] * x_vo[p2,h3]

95 - 1.0 * v_vovo[p2,h3,p3,h1] * t_vo[p3,h2] * x_vo[p1,h3] 96 - 1.0 * v_ovvo[h3,p1,p3,h1] * t_vo[p3,h2] * x_vo[p2,h3]

97 - 1.0 * v_ovvo[h3,p2,p3,h2] * t_vo[p3,h1] * x_vo[p1,h3] 98 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p2,h1,h2] * x_vvoo[p1,p4,h4,h3]

99 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h2] * x_vvoo[p2,p4,h4,h3] 100 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p2,h1,h2] * x_vvoo[p1,p4,h3,h4]

101 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h2] * x_vvoo[p2,p4,h3,h4]

153 102 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p4,h1,h3] * x_vvoo[p2,p1,h2,h4] 103 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p2,h1,h3] * x_vvoo[p4,p3,h2,h4]

104 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p3,h1,h3] * x_vvoo[p2,p1,h2,h4] 105 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p2,h1,h3] * x_vvoo[p3,p4,h2,h4]

106 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p4,h1,h2] * x_vvoo[p1,p2,h3,h4] 107 + 4.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h3] * x_vvoo[p2,p4,h2,h4]

108 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p3,h1,h3] * x_vvoo[p4,p2,h2,h4] 109 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p1,h1,h3] * x_vvoo[p2,p4,h2,h4]

110 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p3,p1,h1,h3] * x_vvoo[p4,p2,h2,h4] 111 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p4,h1,h3] * x_vvoo[p2,p3,h2,h4]

112 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p1,p4,h1,h3] * x_vvoo[p3,p2,h2,h4]

113 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p1,h1,h3] * x_vvoo[p2,p3,h2,h4] 114 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vvoo[p4,p2,h1,h3] * x_vvoo[p3,p1,h2,h4]

115 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h2] * x_vvoo[p1,p2,h4,h3] 116 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h1] * x_vvoo[p1,p2,h3,h4]

117 - 2.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h4] * x_vvoo[p2,p3,h2,h3] 118 - 2.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h4] * x_vvoo[p1,p3,h1,h3]

119 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h4] * x_vvoo[p3,p2,h2,h3] 120 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h4] * x_vvoo[p3,p1,h1,h3]

121 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p1,h3] * x_vvoo[p2,p3,h2,h4] 122 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p2,h3] * x_vvoo[p1,p3,h1,h4]

123 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p2,h3] * x_vvoo[p3,p1,h2,h4] 124 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p1,h3] * x_vvoo[p3,p2,h1,h4]

125 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h4] * x_vvoo[p2,p1,h2,h3] 126 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h4] * x_vvoo[p1,p2,h1,h3]

127 - 2.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h3] * x_vvoo[p2,p1,h2,h4] 128 - 2.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h3] * x_vvoo[p1,p2,h1,h4]

129 + 1.0 * v_oooo[h3,h4,h1,h2] * t_vo[p1,h3] * x_vo[p2,h4] 130 - 1.0 * v_vovv[p1,h3,p3,p4] * x_vo[p3,h1] * t_vo[p4,h2] * t_vo[p2,h3]

131 - 1.0 * v_vovv[p2,h3,p3,p4] * x_vo[p4,h1] * t_vo[p3,h2] * t_vo[p1,h3] 132 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p4,h2] * t_vvoo[p1,p2,h3,h4]

133 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p1,h3] * t_vvoo[p2,p4,h2,h4] 134 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p2,h3] * t_vvoo[p1,p4,h1,h4]

135 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p1,h3] * t_vvoo[p4,p2,h2,h4] 136 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p2,h3] * t_vvoo[p4,p1,h1,h4]

137 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p1,h4] * t_vvoo[p2,p4,h2,h3] 138 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p2,h4] * t_vvoo[p1,p4,h1,h3]

139 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p2,h4] * t_vvoo[p4,p1,h2,h3] 140 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p1,h4] * t_vvoo[p4,p2,h1,h3]

141 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p4,h3] * t_vvoo[p2,p1,h2,h4]

142 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p4,h3] * t_vvoo[p1,p2,h1,h4] 143 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p4,h4] * t_vvoo[p2,p1,h2,h3]

144 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h2] * t_vo[p4,h4] * t_vvoo[p1,p2,h1,h3] 145 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p1,h3] * t_vo[p2,h4] * t_vvoo[p3,p4,h1,h2]

146 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p1,h3] * t_vo[p3,h4] * t_vvoo[p4,p2,h1,h2] 147 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p2,h3] * t_vo[p3,h4] * t_vvoo[p1,p4,h1,h2]

148 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p1,h3] * t_vo[p4,h4] * t_vvoo[p3,p2,h1,h2] 149 - 2.0 * v_oovv[h3,h4,p3,p4] * x_vo[p2,h3] * t_vo[p4,h4] * t_vvoo[p1,p3,h1,h2]

150 + 1.0 * v_oovo[h3,h4,p3,h1] * x_vo[p3,h2] * t_vo[p2,h3] * t_vo[p1,h4] 151 + 1.0 * v_oovo[h3,h4,p3,h2] * x_vo[p3,h1] * t_vo[p1,h3] * t_vo[p2,h4]

152 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * x_vo[p4,h2] * t_vo[p2,h3]

154 153 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h1] * x_vo[p3,h2] * t_vo[p1,h3] 154 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p4,h2] * t_vvoo[p1,p2,h3,h4]

155 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p1,h3] * t_vvoo[p2,p4,h2,h4] 156 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p2,h3] * t_vvoo[p1,p4,h1,h4]

157 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p1,h3] * t_vvoo[p4,p2,h2,h4] 158 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p2,h3] * t_vvoo[p4,p1,h1,h4]

159 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p1,h4] * t_vvoo[p2,p4,h2,h3] 160 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p2,h4] * t_vvoo[p1,p4,h1,h3]

161 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p2,h4] * t_vvoo[p4,p1,h2,h3] 162 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p1,h4] * t_vvoo[p4,p2,h1,h3]

163 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p4,h3] * t_vvoo[p2,p1,h2,h4]

164 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p4,h3] * t_vvoo[p1,p2,h1,h4] 165 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p4,h4] * t_vvoo[p2,p1,h2,h3]

166 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * x_vo[p4,h4] * t_vvoo[p1,p2,h1,h3] 167 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * x_vo[p2,h4] * t_vvoo[p3,p4,h1,h2]

168 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * x_vo[p3,h4] * t_vvoo[p4,p2,h1,h2] 169 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * x_vo[p3,h4] * t_vvoo[p1,p4,h1,h2]

170 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * x_vo[p4,h4] * t_vvoo[p3,p2,h1,h2] 171 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * x_vo[p4,h4] * t_vvoo[p1,p3,h1,h2]

172 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h2] * x_vo[p2,h3] * t_vo[p1,h4] 173 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h1] * x_vo[p1,h3] * t_vo[p2,h4]

174 - 1.0 * v_vovv[p1,h3,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * x_vo[p2,h3] 175 - 1.0 * v_vovv[p2,h3,p3,p4] * t_vo[p4,h1] * t_vo[p3,h2] * x_vo[p1,h3]

176 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * x_vvoo[p1,p2,h3,h4] 177 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h3] * x_vvoo[p2,p4,h2,h4]

178 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h3] * x_vvoo[p1,p4,h1,h4] 179 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h3] * x_vvoo[p4,p2,h2,h4]

180 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h3] * x_vvoo[p4,p1,h1,h4] 181 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p1,h4] * x_vvoo[p2,p4,h2,h3]

182 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p2,h4] * x_vvoo[p1,p4,h1,h3] 183 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p2,h4] * x_vvoo[p4,p1,h2,h3]

184 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p1,h4] * x_vvoo[p4,p2,h1,h3] 185 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h3] * x_vvoo[p2,p1,h2,h4]

186 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p4,h3] * x_vvoo[p1,p2,h1,h4] 187 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h4] * x_vvoo[p2,p1,h2,h3]

188 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h2] * t_vo[p4,h4] * x_vvoo[p1,p2,h1,h3] 189 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p2,h4] * x_vvoo[p3,p4,h1,h2]

190 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p3,h4] * x_vvoo[p4,p2,h1,h2] 191 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * t_vo[p3,h4] * x_vvoo[p1,p4,h1,h2]

192 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p1,h3] * t_vo[p4,h4] * x_vvoo[p3,p2,h1,h2]

193 - 2.0 * v_oovv[h3,h4,p3,p4] * t_vo[p2,h3] * t_vo[p4,h4] * x_vvoo[p1,p3,h1,h2] 194 + 1.0 * v_oovo[h3,h4,p3,h1] * t_vo[p3,h2] * t_vo[p2,h3] * x_vo[p1,h4]

195 + 1.0 * v_oovo[h3,h4,p3,h2] * t_vo[p3,h1] * t_vo[p1,h3] * x_vo[p2,h4] 196 + 1.0 * v_oovv[h3,h4,p3,p4] * x_vo[p3,h1] * t_vo[p4,h2] * t_vo[p1,h3] * t_vo[p2,h4]

197 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * x_vo[p4,h2] * t_vo[p1,h3] * t_vo[p2,h4] 198 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * x_vo[p1,h3] * t_vo[p2,h4]

199 + 1.0 * v_oovv[h3,h4,p3,p4] * t_vo[p3,h1] * t_vo[p4,h2] * t_vo[p1,h3] * x_vo[p2,h4];

155 A.2 Equations implemented in the DLTC

The following equations are obtained from NWChem TCE module and imple- mented in the DLTC for performance evaluation.

CCD-T2

1 R_VVOO[p3,p4,h1,h2] += V_VVOO[p3,p4,h1,h2] 2 I_1[h5,h1] += F_OO[h5,h1]

3 I_1[h5,h1] += 0.5 * T_VVOO[p6,p7,h1,h8] * V_OOVV[h5,h8,p6,p7] 4 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VVOO[p3,p4,h1,h5] * I_1[h5,h2]

5 I_2[p3,p5] += F_VV[p3,p5] 6 I_2[p3,p5] += -0.5 * T_VVOO[p3,p6,h7,h8] * V_OOVV[h7,h8,p5,p6]

7 R_VVOO[p3,p4,h1,h2] += T_VVOO[p3,p5,h1,h2] * I_2[p4,p5] 8 I_3[h7,h9,h1,h2] += -1.0 * V_OOOO[h7,h9,h1,h2]

9 I_3[h7,h9,h1,h2] += -0.5 * T_VVOO[p5,p6,h1,h2] * V_OOVV[h7,h9,p5,p6] 10 R_VVOO[p3,p4,h1,h2] += -0.5 * T_VVOO[p3,p4,h7,h9] * I_3[h7,h9,h1,h2]

11 I_4[h6,p3,h1,p5] += V_OVOV[h6,p3,h1,p5] 12 I_4[h6,p3,h1,p5] += -0.5 * T_VVOO[p3,p7,h1,h8] * V_OOVV[h6,h8,p5,p7]

13 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VVOO[p3,p5,h1,h6] * I_4[h6,p4,h2,p5] 14 R_VVOO[p3,p4,h1,h2] += 0.5 * T_VVOO[p5,p6,h1,h2] * V_VVVV[p3,p4,p5,p6]

CCSD-T1

1 R_VO[p2,h1] += F_VO[p2,h1]

2 R_VO[p2,h1] += -1.0 * T_VO[p3,h4] * V_OVOV[h4,p2,h1,p3] 3 R_VO[p2,h1] += -0.5 * T_VVOO[p3,p4,h1,h5] * V_OVVV[h5,p2,p3,p4]

4 I_1[h7,h1] += F_OO[h7,h1] 5 I_1[h7,h1] += -1.0 * T_VO[p4,h5] * V_OOOV[h5,h7,h1,p4]

6 I_1[h7,h1] += -0.5 * T_VVOO[p3,p4,h1,h5] * V_OOVV[h5,h7,p3,p4] 7 I_2[h7,p3] += F_OV[h7,p3]

8 I_2[h7,p3] += -1.0 * T_VO[p5,h6] * V_OOVV[h6,h7,p3,p5] 9 I_1[h7,h1] += T_VO[p3,h1] * I_2[h7,p3]

10 R_VO[p2,h1] += -1.0 * T_VO[p2,h7] * I_1[h7,h1] 11 I_3[p2,p3] += F_VV[p2,p3]

12 I_3[p2,p3] += -1.0 * T_VO[p4,h5] * V_OVVV[h5,p2,p3,p4] 13 R_VO[p2,h1] += T_VO[p3,h1] * I_3[p2,p3]

14 I_4[h8,p7] += F_OV[h8,p7] 15 I_4[h8,p7] += T_VO[p5,h6] * V_OOVV[h6,h8,p5,p7]

16 R_VO[p2,h1] += T_VVOO[p2,p7,h1,h8] * I_4[h8,p7] 17 I_5[h4,h5,h1,p3] += V_OOOV[h4,h5,h1,p3]

18 I_5[h4,h5,h1,p3] += -1.0 * T_VO[p6,h1] * V_OOVV[h4,h5,p3,p6] 19 R_VO[p2,h1] += -0.5 * T_VVOO[p2,p3,h4,h5] * I_5[h4,h5,h1,p3]

156 CCSD-T2 :

1 R_VVOO[p3,p4,h1,h2] += V_VVOO[p3,p4,h1,h2] 2 R_VVOO[p3,p4,h1,h2] += 0.5 * T_VVOO[p5,p6,h1,h2] * V_VVVV[p3,p4,p5,p6]

3 I_1[h10,p3,h1,h2] += V_OVOO[h10,p3,h1,h2] 4 I_1[h10,p3,h1,h2] += 0.5 * T_VVOO[p5,p6,h1,h2] * V_OVVV[h10,p3,p5,p6]

5 I_2[h10,h11,h1,h2] += -1.0 * V_OOOO[h10,h11,h1,h2] 6 I_2[h10,h11,h1,h2] += -0.5 * T_VVOO[p7,p8,h1,h2] * V_OOVV[h10,h11,p7,p8]

7 I_3[h10,h11,h1,p5] += V_OOOV[h10,h11,h1,p5] 8 I_3[h10,h11,h1,p5] += -0.5 * T_VO[p6,h1] * V_OOVV[h10,h11,p5,p6]

9 I_2[h10,h11,h1,h2] += T_VO[p5,h1] * I_3[h10,h11,h2,p5] 10 I_1[h10,p3,h1,h2] += 0.5 * T_VO[p3,h11] * I_2[h10,h11,h1,h2]

11 I_4[h10,p3,h1,p5] += V_OVOV[h10,p3,h1,p5] 12 I_4[h10,p3,h1,p5] += -0.5 * T_VO[p6,h1] * V_OVVV[h10,p3,p5,p6]

13 I_1[h10,p3,h1,h2] += -1.0 * T_VO[p5,h1] * I_4[h10,p3,h2,p5] 14 I_5[h10,p5] += F_OV[h10,p5]

15 I_5[h10,p5] += -1.0 * T_VO[p6,h7] * V_OOVV[h7,h10,p5,p6] 16 I_1[h10,p3,h1,h2] += -1.0 * T_VVOO[p3,p5,h1,h2] * I_5[h10,p5]

17 I_6[h7,h10,h1,p9] += V_OOOV[h7,h10,h1,p9] 18 I_6[h7,h10,h1,p9] += T_VO[p5,h1] * V_OOVV[h7,h10,p5,p9]

19 I_1[h10,p3,h1,h2] += T_VVOO[p3,p9,h1,h7] * I_6[h7,h10,h2,p9] 20 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VO[p3,h10] * I_1[h10,p4,h1,h2]

21 I_7[p3,p4,h1,p5] += V_VVOV[p3,p4,h1,p5] 22 I_7[p3,p4,h1,p5] += -0.5 * T_VO[p6,h1] * V_VVVV[p3,p4,p5,p6]

23 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VO[p5,h1] * I_7[p3,p4,h2,p5]

24 I_8[h9,h1] += F_OO[h9,h1] 25 I_8[h9,h1] += -1.0 * T_VO[p6,h7] * V_OOOV[h7,h9,h1,p6]

26 I_8[h9,h1] += -0.5 * T_VVOO[p6,p7,h1,h8] * V_OOVV[h8,h9,p6,p7] 27 I_9[h9,p8] += F_OV[h9,p8]

28 I_9[h9,p8] += T_VO[p6,h7] * V_OOVV[h7,h9,p6,p8] 29 I_8[h9,h1] += T_VO[p8,h1] * I_9[h9,p8]

30 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VVOO[p3,p4,h1,h9] * I_8[h9,h2] 31 I_10[p3,p5] += F_VV[p3,p5]

32 I_10[p3,p5] += -1.0 * T_VO[p6,h7] * V_OVVV[h7,p3,p5,p6] 33 I_10[p3,p5] += -0.5 * T_VVOO[p3,p6,h7,h8] * V_OOVV[h7,h8,p5,p6]

34 R_VVOO[p3,p4,h1,h2] += T_VVOO[p3,p5,h1,h2] * I_10[p4,p5] 35 I_11[h9,h11,h1,h2] += -1.0 * V_OOOO[h9,h11,h1,h2]

36 I_11[h9,h11,h1,h2] += -0.5 * T_VVOO[p5,p6,h1,h2] * V_OOVV[h9,h11,p5,p6] 37 I_12[h9,h11,h1,p8] += V_OOOV[h9,h11,h1,p8]

38 I_12[h9,h11,h1,p8] += 0.5 * T_VO[p6,h1] * V_OOVV[h9,h11,p6,p8] 39 I_11[h9,h11,h1,h2] += T_VO[p8,h1] * I_12[h9,h11,h2,p8]

40 R_VVOO[p3,p4,h1,h2] += -0.5 * T_VVOO[p3,p4,h9,h11] * I_11[h9,h11,h1,h2] 41 I_13[h6,p3,h1,p5] += V_OVOV[h6,p3,h1,p5]

42 I_13[h6,p3,h1,p5] += -1.0 * T_VO[p7,h1] * V_OVVV[h6,p3,p5,p7] 43 I_13[h6,p3,h1,p5] += -0.5 * T_VVOO[p3,p7,h1,h8] * V_OOVV[h6,h8,p5,p7]

44 R_VVOO[p3,p4,h1,h2] += -1.0 * T_VVOO[p3,p5,h1,h6] * I_13[h6,p4,h2,p5]

157 BIBLIOGRAPHY

[1] Mvapich: Mpi over infiniband, 10gige/iwarp and roce. mvapich.cse.ohio-state.

edu.

[2] Papi: Performance application programming interface. icl.cs.utk.edu/papi.

[3] Psi4: ab initio quantum chemistry. www.psicode.org.

[4] Agarwal, R., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P. A three-

dimensional approach to parallel matrix multiplication. IBM Journal of Research

and Development 39, 5 (1995).

[5] Allam, A., Ramanujam, J., Baumgartner, G., and Sadayappan, P. Memory

minimization for tensor contractions using integer linear programming. In

Proceedings of the 20th International Conference on Parallel and Distributed Process-

ing (2006), IPDPS’06, IEEE Computer Society, p. 382.

[6] Auer, A. A., Baumgartner, G., Bernholdt, D. E., Bibireata, A., Choppella,

V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S.,

Lam, C.-C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., and

Sibiryakov, A. Automatic code generation for many-body electronic structure

methods: the tensor contraction engine. Molecular 104, 2 (2006), 211–

228.

158 [7] Bartlett, R. Coupled-cluster theory in quantum chemistry. Reviews of Modern

Physics 79, 1 (2007), 291–352.

[8] Bastert, O., and Matuszewski, C. Layered drawings of digraphs. In Drawing

Graphs, vol. 2025 of Lecture Notes in Computer Science. Springer Berlin Heidel-

berg, 2001, pp. 87–120.

[9] Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Co-

ciorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., Krishnan,

S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., and

Sibiryakov, A. Synthesis of high-performance parallel programs for a class of

ab initio quantum chemistry models. In Proceedings of the IEEE (2005), vol. 93,

pp. 276–292.

[10] Baumgartner, G., Bernholdt, D. E., Cociorva, D., Harrison, R., Hirata, S.,

Lam, C.-C., Nooijen, M., Pitzer, R., Ramanujam, J., and Sadayappan, P. A high-

level approach to synthesis of high-performance codes for quantum chemistry.

In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (2002), SC’02,

IEEE Computer Society Press, pp. 1–10.

[11] Baumgartner, G., Bernholdt, D. E., Cociorva, D., Lam, C.-C., Ramanujam, J.,

Harrison, R., Noolijen, M., and Sadayappan, P. A performance optimization

framework for compilation of tensor contraction expressions into parallel pro-

grams. In Proceedings of the 16th International Parallel and Distributed Processing

Symposium (2002), IPDPS ’02, IEEE Computer Society, p. 33.

159 [12] Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., and Don-

garra, J. J. Parsec: Exploiting heterogeneity to enhance scalability. Computing

in Science and Engineering 15, 6 (2013), 36–45.

[13] Cannon, L. E. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD

thesis, Montana State University, 1969.

[14] Cociorva, D., Baumgartner, G., Lam, C.-C., Sadayappan, P., and Ramanujam,

J. Memory-constrained communication minimization for a class of array com-

putations. In Proceedings of the 15th International Conference on Languages and

Compilers for Parallel Computing (2005), LCPC’02, Springer-Verlag, pp. 1–15.

[15] Cociorva, D., Baumgartner, G., Lam, C.-C., Sadayappan, P., Ramanujam, J.,

Nooijen, M., Bernholdt, D. E., and Harrison, R. Space-time trade-off op-

timization for a class of electronic structure calculations. In Proceedings of the

ACM SIGPLAN 2002 Conference on Programming Language Design and Implemen-

tation (2002), PLDI ’02, ACM, pp. 177–186.

[16] Cociorva, D., Gao, X., Krishnan, S., Baumgartner, G., Lam, C.-C., Sadayap-

pan, P., and Ramanujam, J. Global communication optimization for tensor

contraction expressions under memory constraints. In Proceedings of the 17th

International Symposium on Parallel and Distributed Processing (2003), IPDPS ’03,

IEEE Computer Society, p. 37.2.

[17] Cociorva, D., Wilkins, J. W., Baumgartner, G., Sadayappan, P., Ramanujam, J.,

Nooijen, M., Bernholdt, D. E., and Harrison, R. J. Towards automatic syn-

thesis of high-performance codes for electronic structure calculations: Data

160 locality optimization. In Proceedings of the 8th International Conference on High

Performance Computing (2001), HiPC ’01, Springer-Verlag, pp. 237–248.

[18] Cociorva, D., Wilkins, J. W., Lam, C.-C., Baumgartner, G., Ramanujam, J., and

Sadayappan, P. Loop optimization for a class of memory-constrained computa-

tions. In Proceedings of International ACM Conference on International Conference

on Supercomputing (2001), ICS’01, pp. 103–113.

[19] Coffman, Jr., E. G., and Graham, R. Optimal scheduling for two-processor

systems. Acta Informatica 1, 3 (1972), 200–213.

[20] Datta, K., Bonachea, D., and Yelick, K. Titanium Performance and Potential: An

NPB Experimental Study, vol. 4339. Springer Berlin Heidelberg, 2006, pp. 200–

214.

[21] Dinan, J., Krishnamoorthy, S., Larkins, D. B., Nieplocha, J., and Sadayappan,

P. Scioto: A framework for global-view task parallelism. In Proceedings of the

37th International Conference on Parallel Processing (2008), ICPP’08, IEEE Com-

puter Society, pp. 586–593.

[22] Dinan, J., Larkins, D. B., Sadayappan, P., Krishnamoorthy, S., and Nieplocha,

J. Scalable work stealing. In Proceedings of the Conference on High Performance

Computing Networking, Storage and Analysis (2009), SC ’09, ACM, pp. 53:1–53:11.

[23] Einstein, A. The Foundation of the General Theory of Relativity. 1916.

[24] Engels-Putzka, A., and Hanrath, M. A fully simultaneously optimizing ge-

netic approach to the highly excited coupled-cluster factorization problem.

The Journal of Chemical Physics 134, 12 (2011).

161 [25] Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K., and

Walker, D. W. Solving Problems on Concurrent Processors. Vol. 1: General Tech-

niques and Regular Problems. Prentice-Hall, Inc., 1988.

[26] Gao, X., Krishnamoorthy, S., Sahoo, S. K., Lam, C.-C., Baumgartner, G.,

Ramanujam, J., and Sadayappan, P. Efficient search-space pruning for inte-

grated fusion and tiling transformations. In Proceedings of the 18th International

Conference on Languages and Compilers for Parallel Computing (2005), LCPC’05,

Springer-Verlag, pp. 215–229.

[27] Gao, X., Krishnamoorthy, S., Sahoo, S. K., Lam, C.-C., Baumgartner, G., Ra-

manujam, J., and Sadayappan, P. Efficient search-space pruning for integrated

fusion and tiling transformations: Research articles. Concurr. Comput. : Pract.

Exper. 19, 18 (2007), 2425–2443.

[28] Hanrath, M., and Engels-Putzka, A. An efficient matrix-matrix multiplica-

tion based contraction engine for general order coupled

cluster. The Journal of Chemical Physics 133, 6 (2010).

[29] Hartono, A., Lu, Q., Gao, X., Krishnamoorthy, S., Nooijen, M., Baumgartner,

G., Bernholdt, D. E., Choppella, V., Pitzer, R. M., Ramanujam, J., Rountev, A.,

and Sadayappan, P. Identifying cost-effective common subexpressions to re-

duce operation count in tensor contraction evaluations. In Proceedings of the 6th

International Conference on Computational Science - Volume Part I (2006), ICCS’06,

Springer-Verlag, pp. 267–275.

162 [30] Hartono, A., Lu, Q., Henretty, T., Krishnamoorthy, S., Zhang, H., Baum-

gartner, G., Bernholdt, D. E., Nooijen, M., Pitzer, R., Ramanujam, J., and Sa-

dayappan, P. Performance optimization of tensor contraction expressions for

many-body methods in quantum chemistry. The Journal of Physical Chemistry

A 113, 45 (2009), 12715–12723.

[31] Hartono, A., Sibiryakov, A., Nooijen, M., Baumgartner, G., Bernholdt, D. E.,

Hirata, S., Lam, C.-C., Pitzer, R. M., Ramanujam, J., and Sadayappan, P. Auto-

mated operation minimization of tensor contraction expressions in electronic

structure calculations. In Proceedings of the 5th International Conference on Com-

putational Science - Volume Part I (2005), ICCS’05, Springer-Verlag, pp. 155–164.

[32] Hirata, S. Tensor contraction engine:￿ abstraction and automated parallel im-

plementation of configuration-interaction, coupled-cluster, and many-body

perturbation theories. The Journal of Physical Chemistry 107, 46 (2003), 9887–

9897.

[33] Kowalski, K., Krishnamoorthy, S., Olson, R. M., Tipparaju, V., and Aprà, E.

Scalable implementations of accurate excited-state coupled cluster theories:

Application of high-level methods to porphyrin-based systems. In Proceedings

of International Conference for High Performance Computing, Networking, Storage

and Analysis (2011), SC ’11, ACM, pp. 72:1–72:10.

[34] Krishnamoorthy, S., Baumgartner, G., Cociorva, D., Lam, C.-C., and Sadayap-

pan, P. In Proceedings of the 2003 IEEE International Conference on Cluster Com-

puting, CLUSTER’03.

163 [35] Krishnamoorthy, S., Baumgartner, G., Lam, C.-C., Nieplocha, J., and Sadayap-

pan, P. Efficient layout transformation for disk-based multidimensional arrays.

In Proceedings of the 11th International Conference on High Performance Computing

(2004), HiPC’04, Springer-Verlag, pp. 386–398.

[36] Krishnamoorthy, S., Baumgartner, G., Lam, C.-C., Nieplocha, J., and Sadayap-

pan, P. Layout transformation support for the disk resident arrays framework.

J. Supercomput. 36, 2 (2006), 153–170.

[37] Krishnan, S., Krishnamoorthy, S., Baumgartner, G., Cociorva, D., Lam, C.-C.,

Sadayappan, P., Ramanujam, J., Bernholdt, D., and Choppella, V. Data Locality

Optimization for Synthesis of Efficient Out-of-Core Algorithms, vol. 2913. Springer

Berlin Heidelberg, 2003, pp. 406–417.

[38] Krishnan, S., Krishnamoorthy, S., Baumgartner, G., Lam, C.-C., Ramanujam,

J., Sadayappan, P., and Choppella, V. Efficient synthesis of out-of-core algo-

rithms using a nonlinear optimization solver. J. Parallel Distrib. Comput. 66, 5

(2006), 659–673.

[39] Kucharski, S. A., and Bartlett, R. J. The coupled-cluster single, double, triple,

and quadruple excitation method. The Journal of Chemical Physics 97, 6 (1992),

4282–4288.

[40] Lam, C.-C., Cociorva, D., Baumgartner, G., and Sadayappan, P. Memory-

optimal evaluation of expression trees involving large objects. In Proceedings of

the 6th International Conference on High Performance Computing (1999), HiPC’99,

pp. 103–110.

164 [41] Lam, C.-C., Cociorva, D., Baumgartner, G., and Sadayappan, P. Optimiza-

tion of memory usage requirement for a class of loops implementing multi-

dimensional integrals. In Proceedings of International Conference on Languages

and Compilers for High Performance Computing (1999), LCPC’99, pp. 350–364.

[42] Lam, C.-C., Rauber, T., Baumgartner, G., Cociorva, D., and Sadayappan, P.

Memory-optimal evaluation of expression trees involving large objects. Com-

put. Lang. Syst. Struct. 37, 2 (2011), 63–75.

[43] Lifflander, J., Krishnamoorthy, S., and Kale, L. V. Work stealing and

persistence-based load balancers for iterative overdecomposed applications.

In Proceedings of the 21st International Symposium on High-Performance Parallel

and Distributed Computing (2012), HPDC ’12, ACM, pp. 137–148.

[44] Lipshitz, B., Ballard, G., Demmel, J., and Schwartz, O. Communication-

avoiding parallel strassen: Implementation and performance. In Proceedings of

the International Conference on High Performance Computing, Networking, Storage

and Analysis (2012), SC ’12, IEEE Computer Society Press, pp. 101:1–101:11.

[45] Lotrich, V., Flocke, N., Ponton, M., Yau, A. D., Perera, A., Deumens, E., and

Bartlett, R. J. Parallel implementation of electronic structure energy, gradient,

and hessian calculations. The Journal of Chemical Physics 128, 19 (2008).

[46] Lu, Q., Gao, X., Krishnamoorthy, S., Baumgartner, G., Ramanujam, J., and

Sadayappan, P. Empirical performance-model driven data layout optimization.

In Proceedings of the 17th International Conference on Languages and Compilers for

High Performance Computing (2004), LCPC’04, Springer-Verlag, pp. 72–86.

165 [47] Lu, Q., Gao, X., Krishnamoorthy, S., Baumgartner, G., Ramanujam, J., and

Sadayappan, P. Empirical performance model-driven data layout optimization

and library call selection for tensor contraction expressions. J. Parallel Distrib.

Comput. 72, 3 (2012), 338–352.

[48] Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K., and Agrawal, G. Opti-

mizing tensor contraction expressions for hybrid cpu-gpu execution. Cluster

Computing 16, 1 (2013), 131–155.

[49] Mehlhorn, K. Data Structures and Algorithms 2: Graph Algorithms and NP-

Completeness, vol. 2 of Monographs in Theoretical Computer Science. An EATCS

Series. Springer, 1984.

[50] Musial, M., Kucharski, S. A., and Bartlett, R. J. Formulation and implemen-

tation of the full coupled-cluster method through pentuple excitations. The

Journal of Chemical Physics 116, 11 (2002), 4382–4388.

[51] Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., and Aprà, E.

Advances, applications and performance of the global arrays shared memory

programming toolkit. International Journal of High Performance Computing and

Applications 20, 2 (2006), 203 – 231.

[52] Noga, J., and Bartlett, R. J. The full ccsdt model for molecular electronic

structure. The Journal of Chemical Physics 86 (1987), 7041.

[53] Noga, J., and Bartlett, R. J. Erratum: The full ccsdt model for molecular elec-

tronic structure [j. chem. phys. 86, 7041 (1987)]. The Journal of Chemical Physics

89 (1988), 3401.

166 [54] Numrich, R. W., and Reid, J. Co-array fortran for parallel programming. SIG-

PLAN Fortran Forum 17, 2 (1998), 1–31.

[55] Ozog, D., Hammond, J. R., Dinan, J., Balaji, P., Shende, S., and Malony, A.

Inspector-executor load balancing algorithms for block-sparse tensor contrac-

tions. In Proceedings of the 42nd International Conference on Parallel Processing

(2013), ICPP ’13, IEEE Computer Society, pp. 30–39.

[56] Ozog, D., Shende, S., Malony, A., Hammond, J. R., Dinan, J., and Balaji, P.

Inspector/executor load balancing algorithms for block-sparse tensor contrac-

tions. In Proceedings of the 27th International ACM Conference on International

Conference on Supercomputing (2013), ICS’13, ACM, pp. 483–484.

[57] Priyanka Ghosh, Jeff R. Hammond, S. G., and Chapman, B. Performance anal-

ysis of nwchem tce for different communication patterns. In 4th International

Workshop on Performance Modeling, Benchmarking, and Simulation of High Perfor-

mance Computer Systems (2013), PMBS’13.

[58] Pulay, P., Saebø, S., and Meyer, W. An efficient reformulation of the closed-

shell self-consistent electron pair theory. The Journal of Chemical Physics 81

(1984), 1901–1905.

[59] Püschel, M., Moura, J. M. F., Singer, B., Xiong, J., Johnson, J., Padua, D.,

Veloso, M., Johnson, R. W., Püschel, M., Moura, J. M. F., Singer, B., Xiong,

J., Johnson, J., Padua, D., Veloso, M., and Johnson, R. W. Spiral: A generator

for platform-adapted libraries of signal processing algorithms. Journal of High

Performance Computing and Applications 18 (2004), 21–45.

167 [60] Raghavachari, K., Trucks, G. W., Pople, J. A., and Head-Gordon, M. A

fifth-order perturbation comparison of electron correlation theories. Chemi-

cal Physics Letters 157, 6 (1989), 479 – 483.

[61] Rajbhandari, S., Nikam, A., Lai, P.-W., Stock, K., Krishnamoorthy, S., and

Sadayappan, P. Cast: Contraction algorithm for symmetric tensors. In Proceed-

ings of ICPP14: International Conference on Parallel Processing (2014 in press),

ICPP’14.

[62] Rajbhandari, S., Nikam, A., Lai, P.-W., Stock, K., Krishnamoorthy, S., and Sa-

dayappan, P. Communication-optimal framework for contracting distributed

tensors. In Proceedings of SC14: International Conference for High Performance

Computing, Networking, Storage and Analysis (2014 in press), SC’14.

[63] Sanders, B. A., Deumens, E., Lotrich, V., and Ponton, M. Refactoring a lan-

guage for parallel computational chemistry. In Proceedings of the 2nd Workshop

on Refactoring Tools (2008), WRT ’08, ACM, pp. 11:1–11:4.

[64] Schrödinger, E. An undulatory theory of the mechanics of atoms and

molecules. Phys. Rev. 28 (1926), 1049–1070.

[65] Scuseria, G. E., and III, H. F. S. A new implementation of the full ccsdt method

for electronic structure calculations. Chemical Physics Letters 152 (1988), 382–

386.

[66] Sibiryakov, A. Operation optimization of tensor contraction expressions. Mas-

ter’s thesis, The Ohio State University, Columbus, OH, 2004.

168 [67] Solomonik, E., and Demmel, J. Communication-optimal parallel 2.5d matrix

multiplication and lu factorization algorithms. In Euro-Par 2011 Parallel Pro-

cessing, vol. 6853. Springer Berlin Heidelberg, 2011, pp. 90–109.

[68] Solomonik, E., Hammond, J., and Demmel, J. A preliminary analysis of cy-

clops tensor framework. Tech. Rep. UCB/EECS-2012-29, EECS Department,

University of California, Berkeley, 2012.

[69] Solomonik, E., Matthews, D., Hammond, J., and Demmel, J. Cyclops tensor

framework: Reducing communication and eliminating load imbalance in mas-

sively parallel contractions. In Proceedings of the 27th International Symposium

on Parallel and Distributed Processing (2013), IPDPS ’13, IEEE Computer Society,

pp. 813–824.

[70] Solomonik, E., Matthews, D., Hammond, J. R., Stanton, J. F., and Demmel, J.

A massively parallel tensor contraction framework for coupled-cluster com-

putations. Journal of Parallel and Distributed Computing (2014).

[71] Stanton, J., Gauss, J., Watts, J., and Bartlett, R. A direct product decompo-

sition approach for symmetry exploitation in many-body methods. i. energy

calculations. The Journal of Chemical Physics 94, 6 (1991), 4334 – 4345.

[72] Sugiyama, K., Tagawa, S., and Toda, M. Methods for visual understanding of

hierarchical system structures. IEEE Transactions on Systems, Man and Cyber-

netics 11, 2 (1981), 109–125.

[73] UPC Consortium. Upc language specifications, v1.2. Tech. Rep. LBNL-59208,

Lawrence Berkeley National Lab, 2005.

169 [74] Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H. V.,

Wang, D., Nieplocha, J., Apra, E., Windus, T., and de Jong, W. Nwchem:

A comprehensive and scalable open-source solution for large scale molecular

simulations. Computer Physics Communications 181, 9 (2010), 1477 – 1489.

[75] van de Geijn, R. A., and Watts, J. SUMMA: scalable universal matrix multi-

plication algorithm. Concurrency - Practice and Experience 9, 4 (1997), 255–274.

[76] Werner, H.-J., Knowles, P. J., Knizia, G., Manby, F. R., Schütz, M., et al. Mol-

pro, version 2012.1, a package of ab initio programs, 2012.

[77] Yanai, T., Nakano, H., Nakajima, T., Tsuneda, T., Hirata, S., Kawashima, Y.,

Nakao, Y., Kamiya, M., Sekino, H., and Hirao, K. Utchem: A program for ab

initio quantum chemistry. In Proceedings of the 2003 International Conference on

Computational Science (2003), ICCS’03, Springer-Verlag, pp. 84–95.

[78] Zhang, H. Utilization of symmetry in optimization of tensor contraction ex-

pressions. Master’s thesis, The Ohio State University, Columbus, OH, 2010.

170