Quick viewing(Text Mode)

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories

J. Phys. Chem. A 2003, 107, 9887-9897 9887

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories

So Hirata* William R. Wiley EnVironmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, P.O.Box 999, Richland, Washington 99352 ReceiVed: March 7, 2003; In Final Form: July 16, 2003

We have developed a symbolic manipulation program and program generator ( contraction engine or TCE) that abstracts and automates the time-consuming, error-prone processes of deriving the working equations of a well-defined model of second-quantized many-electron theories and synthesizing efficient parallel computer programs on the of these equations. Provided an ansatz of a many-electron theory model, TCE performs valid contractions of creation and annihilation operators according to Wick’s theorem, consolidates identical terms, and reduces the expressions into the form of multiple tensor contractions acted upon by permutation operators. It subsequently determines the binary contraction order for each multiple tensor contraction with the minimal operation and memory cost, factorizes common binary contractions (defines intermediate ), and identifies reusable intermediates. The resulting ordered list of binary tensor contractions, additions, and index permutations is translated into an optimized program that is combined with the NWCHEM and UTCHEM computational chemistry software packages. The programs synthesized by TCE take advantage of spin symmetry (within the spin-orbital formalisms), real Abelian point-group symmetry, and index permutation symmetry at every stage of the calculations to minimize the number of arithmetic operations and storage requirement, adjust the peak local memory usage by index-range tiling, and support parallel I/O interfaces and dynamic load balancing for parallel executions. We demonstrate the utility of TCE through automatic derivation and implementation of parallel programs for a range of predictive computational methodss configuration-interaction theory (CISD, CISDT, CISDTQ), generalized many-body perturbation theory [MBPT- (2), MBPT(3), MBPT(4)], and coupled-cluster theory (LCCD, CCD, LCCSD, CCSD, QCISD, CCSDT, and CCSDTQ), some for the first timesand discuss the performance of the implemented programs.

1. Introduction sophistication have been developed.1-7 A pioneering and perhaps the most thorough study was conducted by Janssen and Electronic computers have enabled a complex sequence of Schaefer,3 who built a computer program that automated the arithmetic operations in quantum mechanical calculations of derivation and computer implementation of coupled-cluster many-electron systems. Equally complex are the symbolic models for open-shell systems. There have also been some manipulation processes of deriving the working equations of studies that aimed at performing calculations of various many- many-electron theories and implementing efficient computer electron theory models in a single algorithmic framework, programs on the basis of these equations, which are also subject sometimes at the expense of efficiency. The prime examples to abstraction and automation by computers. The objective of are determinant-based general-order many-body perturbation8,9 this study is to develop a general-purpose computer program and coupled-cluster theories10-12 and the remarkable string-based that performs both symbolic manipulation processessa program general-order coupled-cluster theory of Ka´llay and Surja´n.13 that manipulates second-quantized operators and derives the These and the computer-aided formula derivation and program working equations of any well-defined second-quantized many- synthesis are closely related in the sense that both approaches electron theory, analyzes these equations, and translates them require a high degree of abstraction of the equations and into a thoroughly optimized parallel program. The significance 14 of such a program is evident. (1) It expedites time-consuming constituent quantities of many-electron theories. and error-prone derivation and computer implementation of We have developed a symbolic manipulation program of various many-electron theory models, (2) it facilitates parallel- second-quantized operators and a program generator, which we ization and other laborious optimization of synthesized pro- call a tensor contraction engine or TCE,15 adopting the design grams, which may be tailored to a particular computer archi- philosophy of Janssen and Schaefer. TCE inherits various - tecture, (3) it enhances the portability, maintainability, and techniques invented by these and other authors,3,5 7,13,14 but its extensibility of synthesized programs, and (4) it helps design applicability is broadened and its capabilities in equation analysis and test a new many-electron theory model or implements and program optimization are significantly enhanced to the models that are too complex to be hand-coded. extent that the computer-synthesized programs can compete with A number of researchers have capitalized on computer-aided hand-coded programs in terms of operation and memory cost. formula derivation and program synthesis in the past, and some TCE is based on the second-quantized representation of many- symbolic manipulation programs with a varied degree of electron theories, which is general and covers a wide spectrum of models ranging from configuration-interaction (CI) theory, * E-mail: [email protected]. many-body perturbation theory (MBPT), and coupled-cluster 10.1021/jp034596z CCC: $25.00 © 2003 American Chemical Society Published on Web 10/28/2003 9888 J. Phys. Chem. A, Vol. 107, No. 46, 2003 Hirata

(CC) theory.16-19 Given a definition of a many-electron theory The process of evaluating operator expectation values with as the quasi-vacuum expectation values of normal-ordered second quantization can be significantly simplified and acceler- second-quantized operators, TCE performs valid contractions ated by Wick’s theorem, which states that a string of creation of creation and annihilation operators according to Wick’s and annihilation operators is a sum of all possible partial theorem, consolidates identical terms, and reduces the expres- contractions of the string in the normal order. A normal-ordered sions into the form of multiple tensor contractions acted upon string of operators (enclosed by a pair of braces) relative to a by permutation operators. TCE subsequently performs strength quasi-vacuum is defined as the rearrangement of the operators reduction (determines the binary contraction order for each such that all hole annihilation and particle creation operators multiple tensor contraction that has the minimal operation and are to the left of all hole creation and particle annihilation memory cost), factorization (eliminates common binary contrac- operators. The significance of Wick’s theorem is that the quasi- tions and defines intermediate tensors), and common subex- vacuum expectation value of a normal-ordered string of opera- pression elimination (intermediate reuse). The resulting ordered tors vanishes unless it is fully contracted. list (“operation tree”) of binary tensor contractions, additions, An initial task of TCE (an interpreted, interactive, object- and index permutations is translated into an optimized program oriented program written in Python programming language) is that is combined with a high-performance quantum chemistry therefore to perform all valid contractions of second-quantized program package tailored to parallel computer environments. operators, provided a definition of a many-electron theory model The programs synthesized by TCE take advantage of spin, is expressed in terms of quasi-vacuum expectation values of a spatial, and index permutation symmetries at every stage of the product of normal-ordered strings of operators, such as calculations to reduce the operation cost and storage require- 1 ment, adjust the peak local memory usage by index-range tiling, + Vg5g6 p9p10 p13p14〈 |{ † † }{ † † }{ † † } g g th h th h 0 h1h2p4p3 g5g6g8g7 p9p10h12h11 and support multiple parallel I/O interfaces and dynamic load 128 7 8 11 12 15 16 { † † }| 〉 balancing for parallel executions. p13p 14h16h15 0 (1) In this article, we describe the machinery of TCE and discuss the characteristics of the equations and intermediate quantities which is a part of the CCSD T2 amplitude equation. Here, hn of many-electron theories upon which TCE is based. To render denotes a hole index, pn denotes a particle index, gn denotes TCE widely applicable and to advance its program optimization either a hole or a particle index, Vg5g6 represents an anti- g7g8 capabilities, we answer the following questions: (1) What are symmetrized two-electron integral, tp13p14 represents an excita- h15h16 the adequate representations of tensors, second-quantized opera- tion amplitude, and the Einstein convention of implied sum- tors, permutation operators, and so forth that permit rapid mation is employed. A valid contraction is the one between a pattern-matching operations? (2) What is the rational definition hole creation and a hole annihilation operator and between a of intermediate tensors that have desirable index permutation particle annihilation and a particle creation operator across symmetry? (3) How can we take advantage of spin, spatial, and different normal-ordered strings (i.e., excluding internal contrac- index permutation symmetries simultaneously to minimize the tions). TCE performs this procedure iteratively, with each cycle number of arithmetic operations? (4) How can we adjust the of iteration consisting of the following steps. First, TCE peak memory usage without significantly increasing the opera- performs all possible contractions of an operator. Because the tion cost? (5) What is the universal storage for tensors order of contraction is immaterial, TCE elects to contract the compressed by the use of spin, spatial, and index permutation leftmost operator with another operator. After a single contrac- symmetries? (6) How can we effectively parallelize the entire tion of the leftmost operator, eq 1 becomes calculation? We demonstrate the computer-aided implementation 1 of high-performance parallel programs for various models of - Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } g h th h th h 0 h2p4p3 g5g6g7 p9p10h12h11 many-electron theories that include configuration-interaction 128 7 1 11 12 15 16 { † † }| 〉 theory (CISD, CISDT, CISDTQ), many-body perturbation p13p14h16h15 0 theory [MBPT(2), MBPT(3), MBPT(4)], and coupled-cluster 1 theory (LCCD, CCD, LCCSD, CCSD, QCISD, CCSDT, and + Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } h g th h th h 0 h2p4p3 g5g6g8 p9p10h12h11 CCSDTQ). 128 1 8 11 12 15 16 {p† p† h h }|0〉 2. Machinery of Tensor Contraction Engine 13 14 16 15

2.1. Derivation of Working Equations. The vast majority - 1 Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } g g th h th h 0 h2p4p3 g5g6g8g7 p9p10h11 of modern many-electron theories for electron correlation 128 7 8 11 1 15 16 problems are defined in terms of the expectation values of the { † † }| 〉 p13p14h16h15 0 nonrelativistic electronic Hamiltonian and other operators for wave functions expanded by Slater determinants. The expecta- + 1 Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } g g th h th h 0 h2p4p3 g5g6g8g7 p9p10h12 tion values, in their simplest forms, are usually tensor algebraic 128 7 8 1 12 15 16 expressions in which the tensors represent certain physical { † † }| 〉 p13p14h16h15 0 interactions.14 Therefore, the derivation of working equations amounts to evaluating the expectation values of operators for - 1 Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } g g th h th h 0 h2p4p3 g5g6g8g7 p9p10h12h11 determinantal wave functions and reducing the resulting alge- 128 7 8 11 12 15 1 braic expressions into the simplest form of tensor contractions { † † }| 〉 p13p14h15 0 and additions. This may be accomplished in various ways,16 but three widely used approaches are the one based on Slater’s + 1 Vg5g6 p9p10 p13p14〈 |{ † }{ † † }{ † † } g g th h th h 0 h2p4p3 g5g6g8g7 p9p10h12h11 rules, the method of second quantization, and the diagrammatic 128 7 8 11 12 1 16 approach. TCE adopts the method of second quantization that { † † }| 〉 p13p14h16 0 (2) appears to blend the applicability and expediency in the most adequate balance for the purpose of building a universal and Even when a contraction gives rise to nonvanishing terms, efficient symbolic manipulation program. the inspection of the number of remaining creation and Tensor Contraction Engine J. Phys. Chem. A, Vol. 107, No. 46, 2003 9889 annihilation operators and their types in the strings can indicate comparison of two expressions by permuting the common that they eventually lead to only vanishing contributions when indices in all possible ways to examine whether the two are fully contracted. TCE identifies and erases partially contracted equivalent. strings that will vanish when fully contracted. It is critical to Subsequently, TCE examines index permutation symmetries erase them early to maintain the number of partially contracted among the tensor contraction expressions. When there are two strings at a manageable level, which otherwise tends to grow or more tensor contraction expressions that are related to each exponentially. In the above example (eq 2), the first two terms other by index permutation, TCE consolidates them into one vanish when fully contracted. tensor contraction expression acted upon by a sum of operators A contraction can often give rise to the equivalent terms that permute just the indices of output tensors (a permutation multiple times, which differs merely in the order of tensor parts of common indices merely gives rise to equivalent expressions). of operators (e.g., integrals and excitation amplitudes), the order Equation 4 (after the disconnected term is deleted) can thus be of permutable indices (e.g., indices of an integral ), or simplified to the labels of common (summation) indices. A simple and efficient way to consolidate these equivalent terms is to recast +1 - p3p4h1h2 p5p4 p6p3Vh7h8 - 1 - p3p4h1h2 p3p4 p6p7Vh5h8 (1 Pp p h h )th h th h p p (1 Pp p h h )th h th h p p the expressions in a canonical form3 so that the expressions of 2 4 3 1 2 1 2 7 8 5 6 2 3 4 2 1 5 1 8 2 6 7 the two equivalent terms become identical by character and are + 1 p5p6 p3p4Vh7h8 - - p3p4h1h2 p5p4 p7p3Vh6h8 rapidly merged. The definition of a canonical form is rather th h th h p p (1 Pp p h h )th h th h p p (6) 4 1 2 7 8 5 6 4 3 1 2 6 1 8 2 5 7 arbitrary, and TCE adopts the following: the tensor parts of the operators are in alphabetical order, then in the order of their where permutation operators are also expressed in the tensor ranks, then in the order of labels of the indices that are not notation (subscript particle indices of the permutation operator among the common indices. All common indices are subse- are replaced by the corresponding superscript particle indices, quently relabeled and sorted in a unique order. Equation 2 has and superscript hole indices are replaced by the corresponding only two distinct nonvanishing terms and is hence rewritten in subscript hole indices in the above equation). Occasionally, an the canonical form as index permutation of output tensors acting upon a tensor contraction expression results in an expression equivalent to the - 1 p5p6 p8p9 Vg12g13〈 |{ † }{ † † }{ † † } original expression. This occurs when there are two or more th h th h g g 0 h2p4p3 g12g13g15g14 p5p6h7 64 7 1 10 11 14 15 equivalent tensors in the expression (i.e., the tensors of the same { † † }| 〉 p8p9h11h10 0 type and rank contracted in the same topological manner). In this situation, TCE prefers to rewrite the expression in a more - 1 p5p6 p8p9 Vg12g13〈 |{ † }{ † † }{ † † } symmetric form with a permutation operator, for reasons that th h th h g g 0 h2p4p3 g12g13g15g14 p8p9h11h10 64 7 1 10 11 14 15 will become apparent. The final tensor contraction expressions { † † }| 〉 p5p6h7 0 (3) for eq 1 are 1 1 For an ansatz containing 2n second-quantized operators, n + - p3p4h1h2 p5p4 p6p3Vh7h8 - - p3p4h1h2 p3p4 p6p7Vh5h8 (1 Pp p h h )th h th h p p (1 Pp p h h )th h th h p p cycles of an iterative contraction procedure lead to fully 2 4 3 1 2 1 2 7 8 5 6 2 3 4 2 1 5 1 8 2 6 7 contracted expressions + 1 p5p6 p3p4Vh7h8 - 1 - p3p4h1h2 + p3p4h1h2 - th h th h p p (1 Pp p h h Pp p h h 4 1 2 7 8 5 6 2 4 3 1 2 4 3 2 1 1 p p p p h h 1 p p p p h h 1 p p p p h h - 5 3 6 4V 7 8 + 5 4 6 3V 7 8 - 3 4 6 7V 5 8 p4p3h1h2 p5p4 p7p3 h6h8 th h th h p p th h th h p p th h th h p p P )t t V (7) 2 1 2 7 8 5 6 2 1 2 7 8 5 6 2 5 1 8 2 6 7 p4p3h2h1 h6h1 h8h2 p5p7

+ 1 p3p4 p6p7Vh5h8 - p5p4 p7p3Vh6h8 + p5p3 p7p4Vh6h8 Note that the permutation operator of the last term of eq 6 is th h th h p p th h th h p p th h th h p p 2 5 2 8 1 6 7 6 1 8 2 5 7 6 1 8 2 5 7 symmetrized. 2.2. Generation of an Operation Tree. Although multiple + 1 p5p6 p3p4Vh7h8 + 1 p3p4 p5p6Vh7h8 th h th h p p th h th h p p (4) tensor contraction expressions, such as eq 7, may be imple- 4 1 2 7 8 5 6 4 1 2 7 8 5 6 mented for the purpose of verifying the expressions themselves, for the above example. They need to be further simplified by they are premature for the synthesis of a high-performance virtue of the topological properties of the expressions akin to program. They must first undergo (1) the canonicalization of those of corresponding diagrammatic representations. First, the permutation operator expressions, (2) the strength reduction tensor contraction expressions that correspond to disconnected (which determines the order of contractions), (3) the canoni- diagrams may optionally be deleted. The connectedness of tensor calization of binary tensor contraction expressions, (4) the contraction expressions can be inferred straightforwardly by factorization, and (5) the common subexpression elimination chasing the tensor indices. In eq 4, the last term is disconnected (intermediate reuse). The result of these processes is an operation because the only contraction takes place between tensor V and tree consisting of binary tensor contractions and additions. one of the two tensors t. Examining the connectedness may also The canonicalization of permutation operator expressions identify “cyclic” tensor contraction expressions, such as serves the dual purposes of guiding the program generator to invoke index permutation symmetry and facilitating the subse- quent optimizations. The use of the index permutation symmetry +1 p1p2 p5p6Vp1p2 (th h )*th h p p (5) 8 3 4 3 4 5 6 is crucial in a many-electron theory calculation because it dramatically reduces both the storage and operation costs of a whose corresponding diagram contains a closed loop formed tensor contraction and it also enhances the stability of the by more than two vertices. A cyclic tensor contraction typically calculation. Consider the following example taken from the arises when deexcitation operators are employed in the ansatz. CCSDT T3 amplitude equation: The expressions of this type cannot be handled effectively by the canonicalization technique. When TCE detects a cyclic p4p5p6 )+1 - p4p5p6h3h1h2 - p4p5p6h3h2h1 p4p5p6Vh7h8 øh h h (1 Pp p p h h h Pp p p h h h )th h h h h (8) tensor contraction expression, it performs a more rigorous 1 2 3 2 4 5 6 2 1 3 4 5 6 1 2 3 7 8 3 1 2 9890 J. Phys. Chem. A, Vol. 107, No. 46, 2003 Hirata

By virtue of the index permutation symmetry of input and output In eq 11, the permutable sets are {p4, p5, p6} and {h1, h2}. The tensors canonical expression of a permutation operator is unique and is hence a convenient representation for the subsequent strength p4p5p6 )-p4p5p6 )-p5p4p6 ) ‚‚‚ ) p6p5p4 th h h th h h th h h th h h (9) reduction, factorization, and common subexpression elimination 7 8 3 8 7 3 7 8 3 3 8 7 processes. Vp7p8 )-Vp7p8 )-Vp8p7 )Vp8p7 Strength reduction refers to the process of finding the order h h h h h h h h (10) 1 2 2 1 1 2 2 1 of contractions in a multiple tensor contraction with the minimal number of arithmetic operations. Owing to the associativity and we may store just the nonredundant elements of each tensor, commutativity of a tensor contraction, the outcome of a multiple which is equivalent to restricting the ranges of indices as tensor contraction is invariant to the contraction order, whereas p4

p

h8p4

A canonical permutation operator maps the index ranges of the The overall permutation operator (P9 in abbreviated notation) tensor it acts upon back onto the original index ranges as it is factorized such that intermediate tensors (ê and κ) become permutes the indices. Consequently, as eq 14 illustrates, it is antisymmetric to the interchange of any pair of contravariant not necessary to lift the restrictions on the index ranges of ê, or covariant indices of the output tensor ø (the external indices) and hence it offers the most compact way of performing tensor and to the interchange of any pair of the remaining contravariant contractions with index permutation symmetry (see also section or covariant indices (the internal indices). The intermediate h8p4

As this example illustrates, the appropriate form of a TABLE 1: Operation Tree for the CCSD Energy Equation component permutation operator is dictated by the form of the h6 )+ h6 + 1 p3Vh4h6 (ê1)p f p th p p binary tensor contraction. Therefore, TCE performs strength 5 5 2 4 3 5 reduction by generating component permutation operators on )+p5 h6 + 1 p1p2Vh3h4 e th (ê1)p th h p p the basis of the binary tensor contractions and subsequently 6 5 4 3 4 1 2 examining the compatibility between the component permutation operators and the overall permutation operator. When the overall TABLE 2: Operation Tree for the CCSD T1 Amplitude permutation operator is not equal to the product of the Equation component permutation operators, TCE is unable to process the h7 p5 h6h7 (¥1) )-t V equations any further. This might appear to be a limitation of p3 h6 p3p5 h7 h7 h7 (ê22) )+f + (¥1) TCE, but this is not the case. Incompatibility occurs only when p3 p3 p3 the ansatz supplied to TCE intrinsically violates the antisym- h7 )+ h7 + p3 h7 - p4Vh5h7 - 1 p3p4Vh5h7 (ê2)h f h th (ê22)p th h p th h p p metry of wave functions and hence the mechanism prevents a 1 1 1 3 5 1 4 2 1 5 3 4 p2 p2 p4 h5p2 faulty ansatz from going undetected. (ê3) )+f - t V p3 p3 h5 p3p4 Judging the identity of two permutation operators deserves h8 )+ h8 + h8 (ê5)p f p (¥1)p caution. For example, apparently distinct permutation operators 7 7 7 h4h5 )+Vh4h5 - p6Vh4h5 h h h p p p h h h p p p (ê6)h p h p th p p 1 2 3 5 6 4 - 1 2 3 6 5 4 1 3 1 3 1 3 6 Ph h h p p p and Ph h h p p p are equivalent to each other in the 1 2 3 4 5 6 1 2 3 4 5 6 < p2 )+ p2 - p2 h7 + p3 p2 - p3Vh4p2 + p2p7 h8 - h8p4 p5 rh f h th (ê2)h th (ê3)p th h p th h (ê5)p context of eq 16 because ê < < is antisymmetric to the 1 1 7 1 1 3 4 1 3 1 8 7 h1 h2 h3 interchange of p and p . The canonicalization of permutation 4 5 1 p2p3 h4h5 - 1 p3p4Vh5p2 th h (ê6)h p th h p p operators expedites this judgment by bringing the expressions 2 4 5 1 3 2 1 5 3 4 of two equivalent permutation operators into an identical form. The symmetrization of permutation operators mentioned in section 2.1 also concerns this process. Because the products of indices and expedites loop fusion and space-time tradeoff 21 component permutation operators are symmetric with respect optimization, which are, however, beyond the scope of this to the interchange of two equivalent tensors, overall permutation study.) TCE identifies every factorizable pair of binary tensor operators must be symmetrized beforehand. contractions and replaces each of them by an addition and a Let us consider the following example of binary tensor contraction, under the tacit assumption that the strength reduc- contractions and additions that result from the strength reduction tion and the factorization are decoupled. Finding an optimal process: contraction sequence of a coupled strength reduction and factorization problem involves an intractably large search space, and the simple exhaustive search algorithm adopted here for p4p5p6 ) h7p4 p5p6 - 1 h7p6 p4p5 øh h h P9êh h th h P9κh h th h (19) 1 2 3 1 3 2 7 2 1 2 3 7 the decoupled problem will not be adequate, although an effective approach to finding a near-optimal solution does seem êh7p4 ) P tp4p8Vh7h9 (20) h1h3 2 h3h9 h1p8 to exist (e.g., the recursive intermediate factorization scheme of Kucharski and Bartlett).4 Nevertheless, the gain in perfor- h p p p h p κ 7 6 ) t 8 9V 7 6 (21) mance on going from an optimal sequence of the decoupled h1h2 h1h2 p8p9 problem to that of the coupled one will not be substantial (see Equation 19 is factorizable, and the two binary tensor contrac- section 3). tions and an addition can be converted into a binary tensor TCE also supports the common subexpression elimination addition and a contraction by virtue of the distributive and that permits the precomputation and reuse of the results of commutative nature of . However, in the above equivalent contractions (persistent intermediates as opposed to representation of the tensor expressions, the factorizability is volatile intermediates) of two input (nonintermediate) tensors inconspicuous, owing to the apparent mismatch of the indices at the cost of increased storage space. This optimization, despite in the common tensors (tp5p6 and tp4p5 in the above equation). its appeal, will not necessarily be invoked in the applications h2h7 h3h7 The mismatch arises partially from the degrees of freedom in discussed in this article because the reduction in the operation representing a set of tensor contractions that are related to each cost by the optimization is relatively insignificant and is largely other by index permutation symmetry. Therefore, prior to canceled by increased memory and I/O operation cost in factorization, TCE canonicalizes binary tensor contraction practice. expressions so that the two tensors that are contracted are sorted Tables 1-3 are examples of operation trees generated by in a unique order and the external indices are in ascending order TCE, which may be viewed as intermediate codes that describe across the two tensors (the corresponding permutation operator the operations, the order of execution, and the data dependencies. is also re-expressed accordingly). Equations 19-21 are rewritten Each operation that appears in the trees is either a unary tensor as substitution, a binary tensor contraction, or a tensor addition by virtue of the strength reduction. Many of the volatile p4p5p6 )- p4p5 h7p6 - 1 p4p5 h7p6 intermediates ê are defined as a sum of tensors as a consequence øh h h P9th h êh h P9th h κh h 1 2 3 1 7 2 3 2 1 7 2 3 of the factorization, and the intermediates represented by ¥ are persistent intermediates that are reused several times. The order )- p4p5 h7p6 + 1 h7p6 P9th h êh h κh h (22) 1 7( 2 3 2 2 3) of entries in the operation trees reflects the data flow. Internally, TCE stores an operation tree as a data structure analogous to h p p p h h ê 7 4 )-P t 4 8V 7 9 (23) 22 h1h2 2 h1h9 h2p8 that of a directed acyclic graph (DAG), each node of the tree representing a binary tensor contraction or a unary tensor κh7p4 ) tp8p9Vh7p4 (24) h1h2 h1h2 p8p9 substitution and an edge connecting two adjacent nodes corre- sponding to a tensor addition. The leaf nodes of the tree are in the canonical form that exposes the common tensors and lends binary tensor contractions or unary tensor substitutions of input itself to rapid factorization. (It also manifests the common loop tensors or persistent intermediate tensors. The definition of the 9892 J. Phys. Chem. A, Vol. 107, No. 46, 2003 Hirata

TABLE 3: Operation Tree for the CCSD T2 Amplitude takes place under the above listed conditions. The conditions Equation preclude any contraction except for that between an excitation h7h10 )+p5Vh7h10 amplitude tensor and an integral tensor and that between an (¥1)h p th p p 1 9 1 5 9 excitation amplitude tensor and an intermediate tensor. Because h10h11 )-1 p7p8Vh10h11 (¥2)h h th h p p 1 2 2 1 2 7 8 an integral tensor is encompassed by eq 25, we can confine h10p3 )-1 p6Vh10p3 our analysis to the binary tensor contraction of the form (¥3)h p th p p 1 5 2 1 5 6 h10 )-p6Vh7h10 <‚‚‚< <‚‚‚< < <‚‚‚< <‚‚‚< < <‚‚‚< (¥4)p th p p e1 ea,i1 ic cc+1 cd eb+1 eg ce+1 cf 5 7 5 6 Pê <‚‚‚< <‚‚‚< < <‚‚‚< t <‚‚‚< < + <‚‚‚< (26) ea+1 eb,id+1 ie ce+1 cf eg+1 eh cc 1 cd h10h11 )+Vh10h11 + 1 h10h11 (ê222)h p h p (¥1)h p 1 5 1 5 2 1 5 h10h11 h10h11 p5 h10h11 h10h11 (ê22) )-V + P2t (ê222) + (¥2) where P is a sum of permutation operators that makes the tensor h1h2 h1h2 h1 h2p5 h1h2 h10p3 )+Vh10p3 + h10p3 it acts upon antisymmetric to the interchange of any pair of (ê23)h p h p (¥3)h p 1 5 1 5 1 5 contravariant or covariant external indices. Notice that the h10 )+ h10 + h10 (ê24)p f p (¥4)p 5 5 5 common (summation) indices are among the internal indices h7h10 )+Vh7h10 + h7h10 (ê25)h p h p (¥1)h p of the intermediate tensor ê and that an index of the excitation 1 9 1 9 1 1 9 h10p3 )+Vh10p3 + p3 h10h11 - p5 h10p3 amplitude tensor t is either a common index or an external index. (ê2)h h h h th (ê22)h h P2th (ê23)h p 1 2 1 2 2 11 1 2 1 2 5 The contraction of eq 26 is performed in the following three 1 - tp3p5(ê )h10 + P tp3p9(ê )h7h10 + tp5p6Vh10p3 distinct steps. First, we “decompress” the tensors by lifting some h1h2 24 p5 2 h1h7 25 h2p9 h1h2 p5p6 2 of the restrictions on the index ranges p3p4 )+Vp3p4 - 1 p6Vp3p4 (ê3)h p h p th p p 1 5 1 5 2 1 5 6 e <‚‚‚

p3p5 p4 1 p3p4 h9h11 p3p5 h6p4 1 p5p6 p3p4 <‚‚‚< <‚‚‚< <‚‚‚< + P t (ê ) - t (ê ) - P t (ê ) + t V e1 ea,i1 ic,eb+1 eg ) - - 2 h1h2 5 p5 h9h11 6 h1h2 4 h1h6 7 h2p5 h1h2 p5p6 η <‚‚‚< <‚‚‚< <‚‚‚< (d c)!(f e)! 2 2 ea+1 eb,id+1 ie,eg+1 eh

e1<‚‚‚

p4p5p6 1 p4p7p8 p5p6 Figure 1. Overview of a subroutine automatically generated by TCE for tensor contraction ø ) /2P3t V . Sorted and unsorted tiles of h1h2h3 h1h2h3 p7p8 tensors are distinguished by primes.

Executing conditionals 31 and 32 or imposing the index-range to control the peak memory usage by adjusting the tile sizes at restrictions element by element in the process of tensor run time. contraction is inefficient owing to the substantial overhead of Figure 1 illustrates a program automatically synthesized by having IF statements in a performance-critical section (i.e., p4p5p6 1 p4p7p8 p5p6 TCE for tensor contraction ø ) /2P3t V . The nested h1h2h3 h1h2h3 p7p8 matrix multiplications and matrix sorts) of the calculation. outer loops (line 1) run over the tiles of tensor ø with the tile Alternatively, one might decompress tensors (cf. eqs 27 and range restrictions p5t e p6t and h1t e h2t e h3t. Although tensor 28) and rearrange their elements according to the spin and spatial ø is stored with the full restrictions of p4t e p5t e p6t and h1t e symmetry to expose a long uninterrupted sequence of arithmetic h2t e h3t, we cannot impose p4t e p5t in the contraction stage e e e e operations. However, such an algorithm tends to require a large p5tp7t p8t p4t p6t p6tp7t p8t p4t p5t because -t e e V e and t e e V e as well as h1t h2t h3t p7t p8t h1t h2t h3t p7t p8t memory space to accommodate the decompressed tensor that p4tp7tep8t p5tep6t p4tep5tep6t t e e V e contribute to ø e e by virtue of the h1t h2t h3t p7t p8t h1t h2t h3t contains many redundant elements. To overcome this algorith- permutation operator. Immediately within the loops, we execute mic dilemma, we resort to index-range tiling (also called index- a conditional of spin and spatial symmetry to know whether range blocking) that partitions indices into tiles (blocks) with the tile of tensor ø has any nonvanishing element (line 2). When each tile consisting of indices with the same hole or particle the tile is nonzero, the operation enters the inner nested loops type, spin symmetry, and spatial symmetry, and we exploit these over the common indices p7t e p8t (line 3), which are followed symmetries at the tile level (rather than at the element level). by a conditional (line 4) to test whether the tile of tensor V is p4 p8t (lines 11 and without explicitly decompressing (compressing) the tensor. 12). Finally, the tile of tensor ø is mapped to a symmetrically Furthermore, the tiling also exposes an adequate granularity of unique tile and is added to storage (lines 14-16). Note that parallelism (the tile-level matrix multiplications and matrix this mapping corresponds to the reverse operation of the element sorts can be performed in parallel) and offers a means canonicalized permutation and also that the conditionals (lines 9894 J. Phys. Chem. A, Vol. 107, No. 46, 2003 Hirata

Figure 2. Caller subroutine for the CCSD energy equation generated by TCE.

- p5 h6 14 16) overlap with one another because the permutation tensors in the subsequent tensor contraction e ) t (ê1) (line h6 p5 applies to diagonal tiles also. 7) and is destroyed immediately after its use (line 8). The nested outer loops (line 1) are parallelized, and the computational workload (the tile retrieval, tile-level tensor sorts 3. Applications of the Tensor Contraction Engine and contractions, and tile accumulation) is balanced between processors by dynamical workload allocation. Therefore, each As an initial application, we have employed TCE to derive processor performs asynchronously the workload dynamically the equations of various models of CI, CC, MBPT automatically assigned to it within its local memory space that accommodates and implement them into parallel computer programs within 25 the tiles of the tensors, and synchronization between processors the computational chemistry program suites of NWCHEM and 24 occurs only at the conclusion of the tensor contraction process UTCHEM. The models that have been implemented were spin- (line 21). Dynamical load balancing is essential because tile unrestricted CISD, CISDT, and CISDTQ, spin-unrestricted sizes can be significantly different from one another and it is CCD, LCCD, CCSD, LCCSD, QCISD, CCSDT, and CCSDTQ, hard to distribute the workload statically evenly across proces- and noncanonical spin-unrestricted MBPT(2), MBPT(3), and - sors within the present algorithmic framework. Although they MBPT(4). For details of these models, see refs 17 19. are inconspicuous, interprocessor interactions exists in the tile The ansatz of CI theory (e.g., that of configuration-interaction retrieval and accumulation processes as overlapping I/O opera- single, double, and triple substitutions (CISDT)) may be given tions to the tensor storage. as TCE generates subroutines (Figure 1) for unary tensor ) + + substitution and binary tensor contraction and also caller C C1 C2 C3 (33) subroutines (Figure 2). We have interfaced these subroutines E ) Q H (1 + C)Q (34) with the computational chemistry program suite UTCHEM24 0 N 0 (in Fortran90) for sequential executions and with NWCHEM25 ) + - 0 Q1HN(1 C)Q0 EQ1CQ0 (35) (in Fortran77) for sequential and parallel executions on the basis ) + - of the following parallel I/O schemes: (1) a shared file algorithm 0 Q2HN(1 C)Q0 EQ2CQ0 (36) 26 based on a global file system using the shared file library, (2) ) + - a replicated file algorithm based on a distributed or global file 0 Q3HN(1 C)Q0 EQ3CQ0 (37) system using the exclusive access file library,26 and (3) an incore algorithm using the global array library.26 In the shared file where Qn is a projection operator onto the of n-tuply algorithm, each tensor is stored in a file on a global file system substituted determinants from the reference wave function, Cn accessed by all processors, whereas in the replicated algorithm is an n-fold excitation operator, HN is the normal-ordered each processor has its own copy of a file and updates the copy Hamiltonian operator, and E is the correlation energy. Equation asynchronously. In the latter algorithm, the copies of the file 34 can be expressed in an explicit normal-ordered second- must be reconciled at the end of tensor contraction so that each quantized form as copy contains a complete intermediate tensor. In the incore algorithm, a global memory space residing across all computer )+ g1 p3〈 |{ † }{ † }| 〉 E f g ch 0 g1g2 p3h4 0 nodes is employed in lieu of a file system, and hence I/O 2 4 operations are replaced by interprocessor communications. + 1Vg1g2 p5〈 |{ † † }{ † }| 〉 g g ch 0 g1g2g4g3 p5h6 0 Tensors are stored as a 1D array of symmetrically unique 4 3 4 6 tiles, and the elements of each tile are stored consecutively. The + 1 g1 p3p4〈 |{ † }{ † † }| 〉 f g ch h 0 g1g2 p3p4h6h5 0 positions (offsets) of the tiles in the storage are precomputed 4 2 5 6 and stored in memory for each intermediate tensor by a + 1 Vg1g2 p5p6〈 |{ † † }{ † † }| 〉 subroutine generated by TCE. Figure 2 illustrates a sequence g g ch h 0 g1g2g4g3 p5p6h8h7 0 16 3 4 7 8 of operations for evaluating the CCSD energy equation (Table - 1). In lines 1 3, the program precomputes the offsets, size, and + 1 g1 p3p4p5〈 |{ † }{ † † † }| 〉 f g ch h h 0 g1g2 p3p4p5h8h7h6 0 name of a file that reserves the storage space for intermediate 36 2 6 7 8 ê1. Subsequently, it issues subroutine calls for unary tensor 1 g g p p p † † † † † h6 h6 h6 + V 1 2 5 6 7 〈 |{ }{ }| 〉 substitution (ê1) ) f (line 4) (f denotes a Fock matrix c 0 g g g g p p p h h h 0 p5 p5 p5 g3g4 h8h9h10 1 2 4 3 5 6 7 10 9 8 h6 h6 144 element) and binary tensor contraction (ê1) ) (ê1) + (38) p5 p5 1 p3 h4h6 /2t V (line 5). When a replicated file algorithm is em- h4 p3p5 ployed, tensor ê1, written in fragment in exclusive access files, Equations such as this can be parsed and processed by TCE, is consolidated into a complete tensor, which is then broadcast which subsequently synthesizes parallel computer programs. The to each processor (line 6). Tensor ê1 is used as one of the input synthesized programs can then be compiled and executed Tensor Contraction Engine J. Phys. Chem. A, Vol. 107, No. 46, 2003 9895 without any manual modification once appropriate interface where U1 and U2 are the second-order single- and double- programs to a computational chemistry program package are excitation operators. Provided this ansatz, TCE generates the in place. following coupled equations: The ansatz of CC theory such as the following (coupled- cluster singles and doubles or CCSD) (3) )+1 p1p2Vh3h4 E uh h p p (52) 4 3 4 1 2 ) + T T1 T2 (39) )- h2 p1 + p1 p2 + 1 p2p1Vh3h4 +1 p2p3Vh4p1 0 f h uh f p uh th h h p th h p p (53) ) + + 1 2 9 2 2 9 2 3 4 9 2 2 4 9 2 3 E Q0 HN 1 T T Q0 (40) [ ( 2! )]C 0 )-(1 - Pp1p2h9h10)f h3up1p2 p1p2h10h9 h9 h3h10 1 1 0 ) Q H 1 + T + T2 + T3 Q (41) 1 N 0 - - p1p2h9h10 p2 p3p1 + 1 p1p2Vh3h4 [ ( 2! 3! )]C (1 Pp p h h )f p uh h th h h h 2 1 9 10 3 9 10 2 3 4 9 10 1 1 1 ) + + 2 + 3 + 4 + - p2p1h10h9 - p1p2h10h9 + p2p1h10h9 p3p1 Vh4p2 0 Q2 HN 1 T T T T Q0 (42) (1 Pp p h h Pp p h h Pp p h h )th h h p [ ( 2! 3! 4! )]C 1 2 10 9 1 2 9 10 1 2 9 10 4 10 9 3

+ 1 p3p4 Vp1p2 can be processed in the same manner by TCE except that th h p p (54) 2 9 10 3 4 immediately after the contraction step disconnected terms need ‚‚‚ to be eliminated ([ ]C means that all terms in the second- The unrestricted CI and CC models can in principle take any quantized form of HN and T must be connected by common reference wave function such as a restricted HF (RHF), an indices, and Tn is an n-fold excitation operator). The operation unrestricted HF (UHF), a restricted open-shell HF (ROHF), and trees obtained from the CCSD ansatz are shown in Tables 1-3. a restricted/unrestricted Kohn-Sham wave function, whereas In this article, we confine ourselves to “noncanonical” or the unrestricted MBPT models are based either on an RHF or “generalized” MBPT that is invariant to unitary transformations a UHF reference wave function. When an RHF wave function within the just-occupied and just-virtual spaces without resorting is used as a reference of a closed-shell system, the programs to Wigner’s 2n + 1 rule. Noncanonical MBPT in its tensor generated by TCE map â orbitals to the R orbitals having the formulation is considered to be the most fundamental repre- same spatial part and avoid any redundant computation and sentation of the theory27 and is essential for linear scaling storage associated with the all-â spin-components of tensors. algorithms using local molecular orbitals.14 As usual, we For example, among the three independent spin-components of partition the Hamiltonian operator as a sum of the Fock operator RR Râ ââ the T2 tensor, tRR, tRâ, and tââ, the CCD program generated by F and the fluctuation potential V and apply Rayleigh-Schro¨- RR Râ TCE processes and stores only the tRR and tRâ spin components dinger perturbation theory. Hence, the ansatz of second-order for an RHF reference. (However, this must be distinguished many-body perturbation theory [MBPT(2)] is Râ from spin-adapted CCD, which deals with just the tRâ spin (2) ) + components.) The coupled equations of the CI, CC, and MBPT E Q0V(T1 T2)Q0 (43) models are solved iteratively by a common driver subroutine 0 ) Q F(T + T )Q + Q VQ (44) that updates excitation amplitude tensors by a combination of 1 1 2 0 1 0 the Jacobi iteration and the DIIS (direct inversion in the iterative ) + + 0 Q2F(T1 T2)Q0 Q2VQ0 (45) subspace) extrapolation. It may be instructive to compare the theoretical operation where T1 and T2 are the first-order single- and double-excitation counts of the CCSD program automatically generated by TCE operators, respectively. When evaluated by TCE, this ansatz with an equivalent program hand-coded by a group of experts leads to a tensor formulation of MBPT(2) (see also ref 14) that in the . According to Stanton et al.,28 the number of reads arithmetic operations of their unrestricted CCSD program is 5 2 4 3 3 5 4 2 approximately ( /4)O V + 20O V + ( /2)O V in the leading ≈ ≈ R (2) )+1 p1p2Vh3h4 order, where O OR Oâ is the number of -orâ-spin E th h p p (46) 4 3 4 1 2 occupied orbitals and V ≈ VR ≈ Vâ is the number of R-orâ-spin

h2 p1 p1 p2 h2 p3p1 virtual orbitals. The operation count of the TCE-generated 0 )-f t + f t + f t (47) 5 2 4 45 3 3 h9 h2 p2 h9 p3 h2h9 CCSD code (Table 3) is approximately ( /4)O V + ( /2)O V + 25 4 2 2 4 3 3 p p h h p p h h p p h h p p ( /2)O V . The prefactors of the O V and O V terms, which 0 )-(1 - P 1 2 9 10 - P 2 1 9 10 + P 1 2 9 10)f 2t 1 . p2p1h9h10 p2p1h10h9 p2p1h10h9 h9 h10 usually dominate the operation count because V O in most CCSD applications, are the same or only slightly greater in the - - p1p2h9h10 h3 p1p2 - (1 Pp p h h )f h th h TCE-generated program than in the program of Stanton et al. 1 2 10 9 9 3 10 Therefore, the TCE-generated CCSD program can compete with, (1 - Pp1p2h9h10)f p2tp3p1 +Vp1p2 (48) p2p1h9h10 p3 h9h10 h9h10 albeit not outperform, the equivalent, carefully hand-coded program. The greater operation count of the former than that which may be solved iteratively. It may be noticed that because of the latter is caused by the decoupling of the strength reduction a Hartree-Fock (HF) reference wave function satisfies the and factorization processes in TCE. p1 h4 Brillouin theorem (f ) f ) 0) T1 is zero. Employing this Table 4 compares the CPU time for a single CC iteration h3 p2 fact, we can write the ansatz of MBPT(3) as performed on a Hewlett-Packard Longs Peak Linux cluster consisting of Intel Itanium-2 1-GHz dual processors. A com- (3) ) + parison of the first two rows of the Table attests to the E Q0V(U1 U2)Q0 (49) tremendous speedup of the CCSDT calculation brought about ) + + 0 Q1F(U1 U2)Q0 Q1VT2Q0 (50) by the use of point-group symmetry. The speedup that is actually ) + + observed usually does not exceed 70% of the theoretical 0 Q2F(U1 U2)Q0 Q2VT2Q0 (51) maximum speedup of h2, with h being the order of the point 9896 J. Phys. Chem. A, Vol. 107, No. 46, 2003 Hirata

TABLE 4: CPU Time (min) for a Single Coupled-Cluster Iteration on 1, 2, 4, 8, and 16 Intel Itanium-2 1-GHz Dual Processors of a Hewlett-Packard Longs Peak Linux Cluster number of processors molecule symmetry theory I/O 1 2 4 8 16

CH2 C1 CCSDT/cc-pVTZ global array 6.9 3.7 2.0 1.5 1.1 CH2 C2V CCSDT/cc-pVTZ global array 1.6 0.63 0.37 0.24 0.17 CH2 C2V CCSDT/cc-pVTZ replicated 1.6 0.68 0.35 0.20 0.13 + a C10H8 C2V CCSD/cc-pVDZ replicated 7.2 4.2 2.6 2.1 1.7 + b NC4H5 C2V CCSDT/cc-pVDZ replicated 550 320 190 140 110 a The azulene radical cation. b The pyrrole radical cation. group,28 primarily because the partitioning of orbitals according intermediate tensor has two disjoint permutable sets of covariant to irreducible representations is irregular. We consider the indices and two disjoint permutable sets of contravariant indices. speedup achieved by the TCE-generated program to be satisfac- (3-6) The index-range tiling algorithm achieves the best tory. compromise between the conflicting demand from the minimi- Stanton et al.28 also pointed out that point-group symmetry zation of the number of arithmetic operations and the minimiza- naturally subdivides a tensor contraction (matrix multiplication) tion of the memory requirement. It moves the conditionals to into h independent operations that can be performed in parallel, exploit spin, spatial, and index permutation symmetries to the although they did not actually parallelize their CCSD program. suburb of the performance-critical section of the calculation This is realized in the TCE-generated programs by virtue of without necessitating large temporary memory space to accom- the tiling algorithm and the dynamic load balancing scheme. modate decompressed tensors. It also exposes an adequate The tiles can either coincide with symmetry blocks of orbitals granularity of parallelism and offers a means to control the peak or subdivide them and are therefore adjustable to available memory usage by adjusting the tile sizes. memory resources. The tile-level matrix multiplications and Despite its proven practical usefulness, the TCE program matrix sorts are performed in parallel within the local memory presented here is an initial prototype, and its capabilities are space of each processor with vectorized kernels (the DGEMM currently being actively enhanced in a multi-disciplinary, multi- subroutine of the BLAS library). Although parallel performance institution project that involves quantum chemists and computer falls off gradually as the number of processors increases, both scientists. A list of enhancements that are being considered for the global array and replicated algorithms exhibit reasonable TCE’s program generator includes storage-space minimization scalability for methylene. Expectedly, the replicated algorithm via the loop fusion and space-time tradeoff techniques21 and sustains slightly better parallel performance than the global array the development of a universal interface to various quantum algorithm, owing to the greater communication overhead in the chemistry software packages. Needless to say, TCE offers an latter. The global array algorithm, however, is expected to be expedient, exact, and perhaps even pedagogical means to derive scalable with respect to problem size, but the replicated and implement existing and new models of many-electron algorithm is not. If the fall off were primarily a consequence of theories for the accurate theoretical treatment of electronic, small serial components within the CC iteration, then the magnetic, mechanical, and spectroscopic properties of atomic scalability would improve for a more time-consuming calcula- and molecular systems, which may include but are not limited tion because the calculation would have an increased proportion to equation-of-motion CC and MBPT for excited states, of parallel components. The fact that the scalability does not relativistic correlation theories, analytical energy derivatives, improve for the azulene or pyrrole radical cation may therefore combined CC and MBPT such as CCSD(T), and multireference indicate that the overhead arising from the overlapping I/O CI, CC, and MBPT. operations is not negligible or the dynamical setting for the Acknowledgment. I thank the members of the TCE projects workload balancing is not entirely successful in decomposing Professor Robert J. Harrison, Professor Marcel Nooijen, Dr. the parallel portion of the problem evenly. It is not known at Alexander A. Auer, Dr. David E. Bernholdt, Dr. Venkatesh this moment whether stagnant scalability is inherent to many- Choppella, Professor P. Sadayappan, Professor Gerald Baum- electron theories that suffer from complex data dependency or gartner, Dr. Daniel Cociorva, Professor Russell Pitzer, and if it can be overcome by an alternative parallel strategy that Professor J. Ramanujamsfor valuable suggestions and encour- also maintains the universal applicability of TCE. agements throughout the study. I am indebted to Dr. Jarek The results of the applications of the electron correlation Nieplocha and Dr. Theresa L. Windus for insightful discussions methods implemented by TCE to the problem of the singlet- on the parallel I/O strategies, Dr. Takeshi Yanai and Professor triplet separation of methylene are available in the Supporting Kimihiko Hirao for generously providing the UTCHEM pro- Information. gram system, Dr. Michel Dupuis for a critical reading of the manuscript prior to publication, and Professor Rodney J. Bartlett 4. Concluding Remarks for introducing me to advanced second-quantization techniques This article has described the symbolic manipulation program and diagrammatics for many-electron theories and many useful TCE that enables automatic derivation and program synthesis suggestions on the manuscript. This work has been funded by for general many-electron theories and has demonstrated its the U.S. Department of Energy, Office of Basic Energy Science utility through its applications to the core theories (CI, CC, and Office of Biological and Environmental Research, Office MBPT) of quantum chemistry. of Science under contract DE-AC06-76RLO 1830 with Battelle The questions raised in the Introduction are answered as Memorial Institute. This work used the Molecular Science follows: (1) The canonicalization of expressions performed at Computing Facility in the William R. Wiley Environmental various stages of the symbolic manipulation process ensures Molecular Sciences Laboratory, Pacific Northwest National rapid pattern matching. (2, 5) Under certain conditions imposed Laboratory, which has been supported by the Office of Biologi- on the ansatz of many-electron theories, we can assume that an cal and Environmental Research. Tensor Contraction Engine J. Phys. Chem. A, Vol. 107, No. 46, 2003 9897

Supporting Information Available: The results of the (18) Bartlett, R. J.; Stanton, J. F. ReV. Comput. Chem. 1994, 5, 65. applications of the electron correlation methods implemented (19) Bartlett, R. J. Modern Electronic Structure Theory, Part II; Yarkony, by TCE to the problem of the singlet-triplet separation of D. P., Ed.; World Scientific: Singapore, 1995; p 1047. (20) It is assumed that the first binary contraction involves an integral methylene. This material is available free of charge via the tensor and that the subsequent contractions occur sequentially; ABCDE (with Internet at http://pubs.acs.org. A being an integral tensor) may be evaluated as (((AB)C)D)E, (((AD)E)C)B... but not as (((BC)A)D)E, (AB)((CD)E)... This assumption is usually correct References and Notes for an operation cost minimal solution to the strength reduction of an individual multiple tensor contraction, when the conditions discussed in (1) (a) Wong, H. C.; Paldus, J. Comput. Phys. Commun. 1973, 6,1. section 2.3 are met. (b) Wong, H. C.; Paldus, J. Comput. Phys. Commun. 1973, 6,9. (21) Baumgartner, G.; Bernholdt, D. E.; Cociorva, D.; Harrison, R.; (2) (a) Bartlett, R. J.; Purvis, G. D. Int. J. Quantum Chem. 1978, 14, Hirata, S.; Lam, C.-C.; Nooijen, M.; Pitzer, R.; Ramanujam, J.; Sadayappan, 561. (b) Purvis, G. D., III; Bartlett, R. J. J. Chem. Phys. 1982, 76, 1910. P. Proceedings of Supercomputing 2002; Baltimore, MD, 2002. (3) Janssen, C. L.; Schaefer, H. F., III. Theor. Chim. Acta 1991, 79,1. (22) Dowd, K.; Severance, C. High Performance Computing, 2nd ed.; (4) Kucharski, S. A.; Bartlett, R. J. Theor. Chim. Acta 1991, 80, 387. O’Reilly & Associates: Sebastopol, CA, 1998. (5) Li, X.; Paldus, J. J. Chem. Phys. 1994, 101, 8812. (23) The conditions imposed are readily satisfied by CC and CI theories, (6) Harris, F. E. Int. J. Quantum Chem. 1999, 75, 593. but they may become a severe constraint on MBPT, in which a deexcitation (7) (a) Nooijen M.; Lotrich, V. J. Chem. Phys. 2000, 113, 494. (b) tensor enters the formalisms when Wigner’s 2n + 1 rule is invoked. It is Nooijen, M.; Lotrich, V. J. Chem. Phys. 2000, 113, 4549. (c) Nooijen, M.; possible to relax these conditions at the expense of some of the restrictions Lotrich, V. J. Mol. Struct.: THEOCHEM 2001, 547, 253. in the index ranges or the flexibility in the strength reduction. This issue is (8) Laidig, W. D.; Fitzgerald, G.; Bartlett, R. J. Chem. Phys. Lett. 1985, deferred for future investigations and will not be discussed any further in 113, 151. this article. (9) Knowles, P. J.; Somasundram, K.; Handy, N. C.; Hirao, K. Chem. (24) Yanai, T.; et al. UTChem: A Program Suite for Ab Initio Electronic Phys. Lett. 1985, 113,8. Structure Theories, A DeVelopment Version; University of Tokyo: Tokyo, (10) Hirata, S.; Bartlett, R. J. Chem. Phys. Lett. 2000, 321, 216. Japan, 2002. (11) Ka´llay, M.; Surja´n,P.R.J. Chem. Phys. 2000, 113, 1359. (25) Harrison, R. J.; et al. NWChem: A Computational Chemistry (12) Olsen, J. J. Chem. Phys. 2000, 113, 7140. V (13) Ka´llay, M.; Surja´n,P.R.J. Chem. Phys. 2001, 115, 2945. Package for Parallel Computers, A De elopment Version; Pacific Northwest (14) Head-Gordon, M.; Maslen, P. E.; White, C. A. J. Chem. Phys. 1998, National Laboratory: Richland, WA, 2002. 108, 616. (26) Nieplocha, J.; et al. ParSoft: Parallel Computing Libraries and V (15) Hirata, S. Tensor Contraction Engine: A Symbolic Manipulation Tools Software, A De elopment Version; Pacific Northwest National Program for Quantum-Mechanical Many-Electron Theories, version 1.0; Laboratory: Richland, WA, 2002. Pacific Northwest National Laboratory: Richland, WA, 2002. (27) Lauderdale, W. J.; Stanton, J. F.; Gauss, J.; Watts, J. D.; Bartlett, (16) Shavitt, I.; Bartlett, R. J. Many-Body Methods in Quantum R. J. Chem. Phys. Lett. 1991, 187, 21. Chemistry, unpublished. (28) Stanton, J. F.; Gauss, J.; Watts, J. D.; Bartlett, R. J. J. Chem. Phys. (17) Bartlett, R. J. Annu. ReV. Phys. Chem. 1981, 32, 359. 1991, 94, 4334.