Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch

Year: 2015

Forces and stress in second order Møller-Plesset perturbation theory for condensed phase systems within the resolution-of-identity and plane waves approach

Del Ben, Mauro ; Hutter, Jürg ; VandeVondele, Joost

Abstract: The forces acting on the atoms as well as the stress tensor are crucial ingredients for calculating the structural and dynamical properties of systems in the condensed phase. Here, these derivatives of the total energy are evaluated for the second-order Møller-Plesset perturbation energy (MP2) in the framework of the resolution of identity Gaussian and plane waves method, in a way that is fully consistent with how the total energy is computed. This consistency is non-trivial, given the different ways employed to compute Coulomb, exchange, and canonical four center integrals, and allows, for example, for energy conserving dynamics in various ensembles. Based on this formalism, a massively parallel algorithm has been developed for finite and extended system. The designed parallel algorithm displays, with respectto the system size, cubic, quartic, and quintic requirements, respectively, for the memory, communication, and computation. All these requirements are reduced with an increasing number of processes, and the measured performance shows excellent parallel scalability and efficiency up to thousands of nodes. Additionally, the computationally more demanding quintic scaling steps can be accelerated by employing graphics processing units (GPU’s) showing, for large systems, a gain of almost a factor two compared to the standard central processing unit-only case. In this way, the evaluation of the derivatives of the RI-MP2 energy can be performed within a few minutes for systems containing hundreds of atoms and thousands of basis functions. With good time to solution, the implementation thus opens the possibility to perform (MD) simulations in various ensembles (microcanonical ensemble and isobaric-isothermal ensemble) at the MP2 level of theory. Geometry optimization, full cell relaxation, and energy conserving MD simulations have been performed for a variety of molecular crystals including NH3, CO2, formic acid, and benzene.

DOI: https://doi.org/10.1063/1.4919238

Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-114301 Journal Article Published Version

Originally published at: Del Ben, Mauro; Hutter, Jürg; VandeVondele, Joost (2015). Forces and stress in second order Møller- Plesset perturbation theory for condensed phase systems within the resolution-of-identity Gaussian and plane waves approach. Journal of Chemical Physics, 143(10):102803. DOI: https://doi.org/10.1063/1.4919238 Forces and stress in second order Møller-Plesset perturbation theory for condensed phase systems within the resolution-of-identity Gaussian and plane waves approach Mauro Del Ben, Jürg Hutter, and Joost VandeVondele

Citation: The Journal of Chemical Physics 143, 102803 (2015); doi: 10.1063/1.4919238 View online: http://dx.doi.org/10.1063/1.4919238 View Table of Contents: http://scitation.aip.org/content/aip/journal/jcp/143/10?ver=pdfcov Published by the AIP Publishing

Articles you may be interested in Analytic energy gradient for second-order Møller-Plesset perturbation theory based on the fragment molecular orbital method J. Chem. Phys. 135, 044110 (2011); 10.1063/1.3611020

Application of second-order Møller–Plesset perturbation theory with resolution-of-identity approximation to periodic systems J. Chem. Phys. 133, 184103 (2010); 10.1063/1.3503153

Electronic polarization effect on low-frequency infrared and Raman spectra of aprotic solvent: Molecular dynamics simulation study with charge response kernel by second order Møller–Plesset perturbation method J. Chem. Phys. 127, 244502 (2007); 10.1063/1.2813421

Application of Gaussian-type geminals in local second-order Møller-Plesset perturbation theory J. Chem. Phys. 124, 234107 (2006); 10.1063/1.2202102

A hybrid scheme for the resolution-of-the-identity approximation in second-order Møller–Plesset linear- r 12 perturbation theory J. Chem. Phys. 120, 10890 (2004); 10.1063/1.1742904

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 THE JOURNAL OF CHEMICAL PHYSICS 143, 102803 (2015)

Forces and stress in second order Møller-Plesset perturbation theory for condensed phase systems within the resolution-of-identity Gaussian and plane waves approach Mauro Del Ben,1,a) Jürg Hutter,1,b) and Joost VandeVondele2,c) 1Department of Chemistry, University of Zürich, Winterthurerstrasse 190, CH-8057 Zürich, Switzerland 2Department of Materials, ETH Zürich, Wolfgang-Pauli-Strasse 27, CH-8093 Zürich, Switzerland (Received 13 December 2014; accepted 22 January 2015; published online 4 May 2015)

The forces acting on the atoms as well as the stress tensor are crucial ingredients for calculating the structural and dynamical properties of systems in the condensed phase. Here, these derivatives of the total energy are evaluated for the second-order Møller-Plesset perturbation energy (MP2) in the framework of the resolution of identity Gaussian and plane waves method, in a way that is fully consistent with how the total energy is computed. This consistency is non-trivial, given the different ways employed to compute Coulomb, exchange, and canonical four center integrals, and allows, for example, for energy conserving dynamics in various ensembles. Based on this formalism, a massively parallel algorithm has been developed for finite and extended system. The designed parallel algorithm displays, with respect to the system size, cubic, quartic, and quintic requirements, respectively, for the memory, communication, and computation. All these requirements are reduced with an increasing number of processes, and the measured performance shows excellent parallel scalability and efficiency up to thousands of nodes. Additionally, the computationally more demanding quintic scaling steps can be accelerated by employing graphics processing units (GPU’s) showing, for large systems, a gain of almost a factor two compared to the standard central processing unit-only case. In this way, the evaluation of the derivatives of the RI-MP2 energy can be performed within a few minutes for systems containing hundreds of atoms and thousands of basis functions. With good time to solution, the implementation thus opens the possibility to perform molecular dynamics (MD) simulations in various ensembles (microcanonical ensemble and isobaric-isothermal ensemble) at the MP2 level of theory. Geometry optimization, full cell relaxation, and energy conserving MD simulations have been performed for a variety of molecular crystals including NH3, CO2, formic acid, and benzene. C 2015 AIP Publishing LLC. [http://dx.doi.org/10.1063/1.4919238]

I. INTRODUCTION growth of the computational effort with respect to systems’ size. Additionally, the basis set size must be large in order The energy evaluated with the second-order Møller- to represent the electron coalescence cusp and converge the Plesset (MP2) perturbation theory represents an effective way MP2 energy.6,7 During the past decades, several groups have to improve the Hartree-Fock (HF) ground state by includ- contributed to improving this situation and to extending the ing electron correlation effects.1,2 In this respect, MP2, also applicability of MP2 in various ways.8 referred as second-order many body perturbation theory Several approaches have been proposed in order to reduce (MBPT(2)), offers many appealing features, such as size con- the formal O(N5) scaling and they can be classified as Laplace- sistency and the capability to correctly account for dispersion transformed MP2,9–16 local MP2 (LMP2),17–26 and stochas- interactions.3 In addition to that, MP2 is an ab initio method tic27–30 methods, while explicitly correlated schemes can be that can accurately describe hydrogen-bond, covalent, and used for accelerating the convergence of the MP2 energy ionic interactions from first principles. Moreover, among the with respect to basis set size (F12-MP2).31–33 Furthermore, correlated electronic structure methods, MP2 is probably that the Resolution of Identity (RI)34–42 approximation, sometimes one displaying the simplest and most compact form. For these referred as Density Fitting (DF), has shown to greatly speed reasons MP2 is often used as a reference for testing and bench- up the evaluation of the MP2 energy giving almost a order marking new approximate methods, and MP2-like correlation of magnitude reduction of the computational cost without has also been included in Density Functional Theory (DFT) significant loss of accuracy.43–45 In addition, parallel comput- with the introduction of double-hybrid density functionals.4,5 ing has become of prime importance in However, despite the advantages of MP2, the inherent compu- as a tool for reducing the time to solution for these calcu- tational cost has limited its use. This is due to the quintic lations. In this respect several parallel algorithms have been proposed46–59 showing an efficiency growing at the same rate as the increase of the computational power. Recently, for a)Electronic mail: [email protected] b)Electronic mail: [email protected] the related direct random phase approximation method, we c)Electronic mail: [email protected] have demonstrated the feasibility of computing the correlation

0021-9606/2015/143(10)/102803/22/$30.00 143, 102803-1 © 2015 AIP Publishing LLC

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-2 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

energy of more than a thousand atoms, with good basis sets, An implementation of the analytical energy gradients at within hours.60 the MP2 level for extended systems has been reported by Thanks to all these improvements, the applicability of MP2 Hirata and Iwata.77 In this case, the formulation is based on theory has been increased continuously over time and recently the orbital theory implying that two-electron integrals a RI-MP2 Monte Carlo (MC) simulation under ambient condi- are obtained by k-point sampling in the first Brillouin zone. tions of bulk liquid water has been reported,61,62 demonstrating Moreover, the reported applications are limited to polymers the feasibility of computing tens of thousands of configurations (periodic 1D) with small basis sets. The difference compared of 64 molecules within a reasonable time. The advantage of to the method presented here relies in the way the two-electron the MC scheme is that only the total energy is required in integrals are computed. In our current implementation, the order to calculate ensemble averages. On the other hand, an sampling of the first Brillouin zone is restricted to the Gamma efficient MC sampling needs a sufficient knowledge of the point only and thus the approach converges to the same value system under study necessary to define “smart” trial move. as obtained from full k-point sampling if a sufficiently large This makes the use of MC methods less straightforward than supercell is chosen. Since our aim is to enable the study of Molecular Dynamics (MD), for which the ensemble averages large and disordered systems, our approach is suitable for are obtained by integrating the classical equations of motion. applications. In this case, the forces acting on atoms have to be computed, For the presented scheme, a massively parallel algorithm obtained as partial derivatives of the total energy with respect to has been designed and implemented in CP2K.78 The paral- the atomic positions. Furthermore, MC does not give access to lel algorithm displays, with respect to the system size, cu- truly dynamical properties, i.e., derived from time correlation bic, quartic, and quintic effort, respectively, for the mem- functions, such as, for example, diffusion constants and vibra- ory, communication, and computation. All these requirements tional spectroscopy. To obtain those, accurate energy conserv- scale increasing the number of processes, and the measured ing (microcanonical ensemble (NVE)) simulations have to be performance displays excellent parallel scalability and effi- performed, requiring consistent forces. ciency up to thousands of nodes. Moreover, in the actual im- The evaluation of the derivatives at the MP2 level is more plementation, the computationally more demanding part, that intricate compared to their computation at the HF or DFT is the quintic scaling steps, can be accelerated by employing level. This is because, contrary to the HF or DFT cases, the graphics processing units (GPU’s). Compared to the standard correlation energy obtained from perturbation theory is not central processing unit (CPU) only case, this leads, in general, stationary with respect to the molecular orbital (MO) expan- to a speedup of a factor greater than 4 for the O(N 5) parts of the sion coefficients, implying that first order orbital response has algorithm, resulting, for the largest cases, in an almost factor to be computed. The theory and equations for calculating the 2 reduction in the overall time for the calculation. energy derivatives at the MP263–66 and RI-MP267 level have Several benchmark calculations are reported with a partic- been derived and reported by many authors, together with ular focus on molecular crystals including NH3, CO2, formic many serial68–70 and parallel71–74 implementations. acid, and benzene. In general, it has been observed that the Here, the equations for evaluating the derivatives of the effort for the calculation of the derivatives at the RI-MP2 level second-order Møller-Plesset perturbation energy in the frame- is between 4 and 5 times more expensive than computing only work of the Resolution of Identity Gaussian and Plane Waves the energy. (RI-GPW) are presented. The derivatives are evaluated consis- tently to the way the RI-MP2 energy is computed59 and are II. THEORY of general validity for both finite and extended systems. The central idea in the RI approximation is the introduction of an In this section, the basic equations necessary for im- atom-center Gaussian auxiliary basis used for fitting pairwise plementing the first derivatives of the RI-MP2 energy are products of atomic orbital basis functions. In addition to briefly presented referring to the original works for more de- the representation in terms of Gaussian functions, the RI- tails.66,67,70,74 More information is reported in cases for which fitting densities within the RI-GPW method are expressed the general theory is combined with the GPW approach. The also employing an auxiliary basis of Plane Waves (PWs). following index notation has been adopted: i, j, k,... refer to This choice allows for rapid conversion between direct and canonical occupied MOs, a, b,c,... to canonical virtual MOs, reciprocal space representations of the density by employing p,q,r,... to general canonical MOs, µ, ν, λ,. . . to primary fast Fourier transformations (FFTs). In this way the treatment atomic orbital basis set functions (AO), P,Q, R,... to auxiliary of the Coulomb interactions is efficiently accomplished by AO basis set functions (AUX). The one electron MO, primary integration of the electrostatic potential associated to each AO, and auxiliary AO functions are symbolized, respectively, RI-fitting density over the pairs of primary basis functions, with ψ, φ, and χ. The number of occupied and virtual orbitals where the electrostatic potential is obtained in a plane wave is denoted by o and v, while the total number of primary and basis set after the solution of the Poisson equation in Fourier auxiliary basis functions as n and Na. In order to express, in space. As a drawback, the GPW method requires smooth general, the system size, the symbol N is used. Given a pertur- densities, implying that pseudopotentials have to be employed. bation parameter x, e.g., a nuclear displacement, the super- All-electron calculations are possible within the Gaussian and script x represents the derivative with respect to x, while (x) Augmented Plane Wave (GAPW) scheme,75,76 however the denotes the skeleton derivative, that is, derivatives of the AO actual implementation is currently limited to the GPW method integrals only (i.e., without considering the derivatives of the only. expansion coefficients of the MOs).

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-3 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

A. The MP2 energy within the RI-GPW method center ERIs are required. This means that, the integral computation requires formally O(N 3) operations while In second order Møller-Plesset perturbation theory, the the integral transformations (Eq. (8)) scale asymptoti- correlation energy E(2) for a closed shell restricted Hartree- cally as O(N 4). Fock reference wave function is given by • As shown by Eq. (4), the whole set of four index ERIs occ virt over MO (ia| jb)RI can be efficiently evaluated from the (2) (ia| jb)[2(ia| jb) − (ib| ja)] E = − (2 − δi j) , (1) three index intermediates BP by matrix-matrix multi-   ϵ + ϵ − ϵ − ϵ ia i ≤ j ab a b i j plications. • where ϵ p are the orbital energies, δi j is the Kronecker delta, Since for the generation of the (ia| jb)RI only the matrix and (ia| jb) is a two-electron repulsion integral (ERI) over MO B has to be stored, the required memory grows as O(N 3). in Mulliken notation, The application of the RI approximation to the MP2 en- 1 ergy calculation is straightforward.35 It consists simply in the (ia| jb) = ψi(⃗r1)ψa(⃗r1) ψj(⃗r2)ψb(⃗r2)d⃗r1d⃗r2. (2)   |⃗r1 − ⃗r2| replacement of the (ia| jb) integrals in Eq. (1) with the approx- In a standard canonical MP2 energy algorithm, the compu- imated (ia| jb)RI given in Eq. (4). The computation of the 2 2 tation of the (ia| jb) integrals is performed via four consecutive (ia| jb)RI requires O(o v Na) operations implying that the RI- 5 integral transformations of the ERIs over AO (µν|λσ), MP2 method is also scaling O(N ). Thus, the advantage of RI- MP2, compared to a standard MP2 implementation, relies on (ia| jb) =  Cµi  Cνa  Cλ j  Cσb(µν|λσ), (3) the reduced required memory and prefactor associated to the µ ν λ σ computation of the (ia| jb) via Eq. (4) instead of Eq. (3). According to what shown so far, it appears clear that where the Cκ p represent the elements of the MO coefficient matrix. Each of the four quarter transformations has a formal applying the RI approximation to the MO-ERIs requires the computational effort that grows as O(N 5) that eventually re- computation of three (µν|Q) and two (Q|P) center ERIs. In flects into the asymptotic scaling associated to the evaluation particular for condensed phase systems, for which periodic of MP2 energy.79 boundary conditions (PBCs) are mandatory, these intermedi- The resolution of identity approximation80,81 is an effec- ates have to account for the requirement that the simulation tive technique that allows to accelerate the evaluation of the cell is infinitely replicated in all directions in space. In or- (ia| jb) ERIs. It consists in the introduction of an auxiliary der to accomplish this task, the Gaussian and plane-waves method83,84 has been extended for handling integrals that arise Gaussian basis set { χP} used to factorize the (ia| jb) integrals 58,59 according to in wave function correlation methods. In general, the GPW method is an efficient approach for P P (ia| jb)RI =  BiaBjb, (4) treating Coulomb interactions between Gaussian basis ele- P ments and any electrostatic density ρ that fulfill the PBC of where B is a matrix with ov rows and Na columns given by the considered system BP = (ia|Q)V −1/2, (5) 1 ia  QP (µν|ρ) = φµ(⃗r1)φν(⃗r1) ρ(⃗r2)d⃗r1d⃗r2. (9) Q   |⃗r1 − ⃗r2| and V −1/2 are the matrix elements of the inverse square root of Here, if ρ is considered as the total electronic density, then the QP form of the last equation is essentially identical to the one used the Coulomb metric82 V = (Q|P), QP to compute matrix elements of the Hartree potential.83 Thus, 1 the highly efficient implementation of that operation in CP2K78 (Q|P) = χQ(⃗r1) χP(⃗r2)d⃗r1d⃗r2. (6)   |⃗r1 − ⃗r2| can be directly used and we refer to Ref. 84 for a detailed Since the three center integrals (ia|Q) are computed starting discussion. from integrals over AO In particular for the RI case, focusing on three center integrals, they are computed, Eq. (8), starting from the inte- 1 (µν|Q) = φ (⃗r )φ (⃗r ) χ (⃗r )d⃗r d⃗r , (7) grals over AOs that are subsequently transformed with the µ 1 ν 1 Q 2 1 2 −1/2   |⃗r1 − ⃗r2| two matrices C and V . Employing the GPW method, P Eq. (9), the index transformation over the auxiliary basis the final expression for the Bia elements reads can be avoided, since it is possible to directly compute half- P −1/2 ρP Bia =  Cνa  Cµi (µν|Q)VQP . (8) transformed integrals for an associated density as ν µ Q P −1/2 Bµν = (µν|Q)VQP The RI approximation to the ERIs over MO has many practical Q advantages as the following: 1 = ⃗ ⃗ ⃗ −1/2 ⃗ ⃗ • The (ia| jb)RI ERIs can be evaluated without significant φµ(r1)φν(r1)  χQ(r2)VQP dr1dr2   ⃗r12   loss of accuracy even employing an auxiliary basis that  Q  − 43,45,67 ρP(⃗r )  is only 2 4 times larger than the primary basis. ⃗ ⃗  2 ⃗ ⃗  • = φµ(r1)φν(r1)  dr2 dr1  The effort for the integral computation is strongly   ⃗r12  reduced since 4 index integrals over AO (µν|λσ) are = φ (⃗r )φ (⃗r )v P(⃗r )d⃗r . (10) never generated and only three (µν|Q) and two (Q|P)  µ 1 ν 1 1 1 Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-4 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

The same approach holds for the (P|Q) integrals with the In the above expression, for each summation, a common only difference that the potential is calculated from the density structure can be recognized, that is, the contraction of terms x x (x) (x) associated to a single Gaussian auxiliary basis function. involving AO derivatives (µν|Q) ,(P|Q) , Fpq , Spq, with el- P Q PQ (2) (2) The key aspect in GPW is that the density ρ is expressed ements of the intermediates Γµν,Γ , Ppq,Wpq. The contri- on a regular grid, or, in equivalent terms, ρP is expanded in an (2) bution to the derivatives of ERI coming from the first two auxiliary basis of plane waves (PW), summations, referred as non-separable part, is specific to the RI-MP2 method. It involves the contraction of 3- and 2-center 1 ⃗ · ⃗ P ⃗ ≈ P ⃗ iG R x x ρ (R) Ω  ρ (G)e , (11) RI integral derivative (µν|Q) ,(P|Q) with 3- and 2-index G⃗ quantities, namely, non-separable correction to the 2-particle ΓQ ΓPQ where ρP(G⃗) are the Fourier coefficients of the density, Ω is density matrix (2-PDM), µν, and . These two specific the volume of the simulation cell and the sum over the recip- quantities are given by ⃗ rocal lattice vectors G is determined by the size S of the PW occ virt ΓQ ΓQ basis. FFTs allow for switching representation between real µν =  Ci µ  Cνa ia, (14) P ⃗ P ⃗ space (ρ (R)) and reciprocal (ρ (G)) space with an associated i a computational effort that grows only as O (S log S). In this AUX − P ΓQ P 1/2 way, the electrostatic potential v in Eq. (10) can be efficiently ia =  YiaVPQ , (15) obtained in a plane waves basis set after solving the Poisson P 2(ia| jb) − (ib| ja) equation in Fourier space, Y P = BP , (16) ia  ϵ + ϵ − ϵ − ϵ jb 4π jb i j a b v P(G⃗) = ρP(G⃗), (12) G2 AUX ΓPQ = ΓP BR V −1/2. −1   ia ia RQ (17) while an additional back FFT (FFT ) will yield the potential R ia in real space. Q PQ Once Γµν and Γ are made available, the non-separable An extensive description, together with the implementa- x (2) x tion details, of the RI-GPW method can found in Ref. 59, here contribution to ERI is obtained by contraction with (µν|Q) x only the most important features are highlighted as follows: and (P|Q) , which are computed consistently to (µν|Q) and (P|Q) by employing the same GPW scheme. This leads, for • P The accuracy of the calculated Bµν integrals in Eq. (10) the 3-center case, to can be systematically improved by increasing the PW AUX AO basis set size (resolution of the grid).58 This is conven- ΓQ (µν|Q)x tionally done by specifying the energy cutoff that limits   µν Q µν the kinetic energy of the PWs. AUX AO AUX AO • PW auxiliary basis is a natural choice for periodic sys- ΓQ x x ΓQ x =   µν [(µ ν|Q) + (µν |Q)] + ( µνφµφν|Q ) tems, but it can equally be used for gas phase or surface Q µν Q µν calculations.85–87 AUX AO AUX • ΓQ x Q x Q ΓQ x All-electron calculations are not possible and pseu- =   µν (µ ν|ρ ) + (µν |ρ ) + (ρ |Q ) dopotentials have to be employed.88 Q µν   Q • P AUX AO For each electrostatic potential v , all matrix elements Q x x Q = Γ φ (⃗r)φν(⃗r) + φµ(⃗r)φ (⃗r) v (⃗r)d⃗r that are non-zero within a given threshold (ϵgrid) can be   µν  µ ν H obtained in linear scaling time.84 Q µν   • P AUX Bµν are transform from the AO basis to the MO ba- x ⃗ ΓQ ⃗ ⃗ P +  χQ(r)vH (r)dr, (18) sis (Bia) via two consecutive matrix-matrix multiplica-  BP C† BP C C C Q tions, MO = o AO v, with o and v being, respec- ΓQ tively, the occupied and virtual parts of the coefficient where vH is the electrostatic potential related to the matrix. The multiplication by C can exploit the spar- AO o Q P ΓQ ⃗ ⃗ B µνφµ(r)φν(r) density, while v is the counterpart associ- sity of AO, implying an O(no) scaling for each P, while  H the final multiplication cannot exploit sparsity and is µν ated to the single auxiliary Gaussian function χ (⃗r). For the asymptotically dominant, scaling as O(onv). Q 2-center case, exploiting the fact that ΓPQ is symmetric, the analogous approach gives B. The analytic derivatives for RI-MP2 AUX AUX ( ) The analytic derivative of the RI-MP2 energy E 2 with ΓPQ(P|Q)x = 2 ΓPQ χx (⃗r)vQ(⃗r)d⃗r RI    P H respect to a perturbation parameter x, for a closed-shell re- PQ PQ 67,70,74 stricted Hartree-Fock wave function, is given by AUX x ⃗ ΓQ ⃗ ⃗ (2) AUX AO AUX = 2  χQ(r)vH (r)dr, (19) (2)x dE  E = RI = 4 ΓQ (µν|Q)x − 2 ΓPQ(P|Q)x Q RI dx   µν  Q µν PQ Q where v Γ is the potential obtained from the ΓPQ χ (⃗r) den- MO H  P (2) (x) (2) (x) P + − Q 2  PpqFpq WpqSpq . (13) sity, while v is the same as in Eq. (18). The two formulations pq   H Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-5 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

giveninEq.(19)areequivalentand,evenifthelatteroffersmore the sake of completeness, it has to be mentioned that in the advantages in terms of computational efficiency, the former is GPW scheme the total Hartree potential includes an additional Q preferred since vH can be reused in Eq. (18). term that comes from the introduction of a Gaussian charge The last summation in Eq. (13) consists in the contraction distribution at each nucleus ρc(⃗r). This is commonly done in (2) of Ppq, the MP2 correction to the 1-particle density matrix Ewald sum methods in order to decouple the long and short (2) (1-PDM), and Wpq, the MP2 correction to the energy-weighted range treatment of the electrostatic interactions. The contribu- (2)x density matrix, with the skeleton derivatives of the Fock and tion to ERI associated to this additional term is accounted by ⃗ (2) overlap matrix elements integrating the ρc(r) derivative with the previously defined vH AO potential (x) x HF x Fpq = Cµp hµν + Pλσ(µν|λσ)   x (2) µν  2 ρ (⃗r)v (⃗r)d⃗r (23)  λσ  c H  1  84 − PHF (µλ|νσ)x C , (20) as likewise done for the similar term in standard HF method. 2  λσ νq λσ  For efficiency reasons, in order to avoid the recomputations  (2) (2) AO  of integrals derivatives, the contraction of Pµν and Wµν is (x) x  performed, when possible, at the same time with PHF and W HF Spq =  CµpSµνCνq. (21) µν µν µν matrices, i.e., simultaneously during the calculation of the HF x x energy derivatives. In Eq. (20), hµν and (µν|λσ) are, respectively, the derivatives of the one-electron Hamiltonian integrals and the 4-index ERIs At this stage, the only missing quantities that remain to be occ defined are P(2) and W (2). These matrices are usually calculated HF pq pq over AO, while Pµν = 2  CµiCνi is the Hartree-Fock density in the MO basis, and they are the result of the composition i matrix obtained from the converged SCF procedure. In order to of terms that have a different definition according to which take advantages from sparsity, the update of the E(2) derivative block of the matrix they refer, namely, occupied-occupied RI (occ-occ), virtual-virtual (virt-virt), and occupied-virtual (occ- is performed in the AO basis, previous back transformation of (2) (2) (2) virt). Concerning the MP2 correction to the 1-PDM Ppq, the Ppq and Wpq from the MO basis. x occ-occ and virt-virt blocks are defined as In the framework of the GPW method, hµν contains the derivative of the matrix element of the electronic kinetic en- virt occ (ja|kb) (2) = − ab ergy, short range part of the local pseudopotential, and the non- Pi j   tik , (24) ϵ j + ϵ k − ϵ a − ϵ b local pseudopotential. These terms are computed analytically ab k occ virt and explicit formulas can be found in Ref. 84. The exact HF (2) ac (ib| jc) Pab =   ti j , (25) exchange contributions (last summation inside the squared ϵ i + ϵ j − ϵ b − ϵ c bracket in Eq. (20)) are calculated consistently, via 4-index i j c Γ ab ERI derivatives, with the -point implementation based on where ti j are the MP2 amplitudes, which in the restricted a short range (truncated) Coulomb operator in the case of closed shell HF case, take the form 89,90 PBC. Due to the dual representation of the density in GPW, 2(ia| jb) − (ib| ja) special care has to be taken for the derivative of the Hartree tab = . (26) i j ϵ + ϵ − ϵ − ϵ matrix elements (second term inside the squared bracket in i j a b Eq. (20)). In particular, it is convenient to reformulate the The virt-occ block of P(2) contains information related to the contribution coming from the Hartree energy in terms of elec- orbital relaxation caused by the perturbation x64 (i.e., first trostatic densities; this is accomplished by exploiting the sym- order response of the MO coefficients).91 It is computed as the HF (2) 92 metry of Pµν , Pµν, and (µν|λσ) derivatives solution of the Z-vector equations P(2)PHF (µν|λσ)x virt occ  µν λσ δ δ (ϵ − ϵ ) + A P(2) = −L , (27) µνλσ   i j ab a i aib j ai b j a i (2) x HF   = 2 Pµν(µ ν| P φλφσ)   λσ where Aaib j is an element of the orbital Hessian matrix µν λσ HF (2) x Aaib j = 4(ai|bj) − (ab|ij) − (a j|bi), (28) + 2  Pλσ( Pµνφµφν|λ σ) λσ µν and L is a specific RI-MP2 Lagrangian matrix given by (2) x HF HF (2) x = 2  Pµν(µ ν|ρ ) + 2  Pλσ(ρ |λ σ) virt AUX occ AUX Q Q µν λσ Γ − Γ Lb j = 2  (ba|Q) ja 2  (ij|Q) ib (2) x ⃗ ⃗ HF ⃗ ⃗ a Q i Q = 2 Pµν φµ(r)φν(r)vH (r)dr   virt occ µν (2) (2) +  Pac Aacb j +  Pik Aikb j. (29) HF x (2) + 2 P φ (⃗r)φ (⃗r)v (⃗r)d⃗r, (22) ac ik  λσ  λ σ H λσ The first two terms in Eq. (29), namely, Lb j(1) and Lb j(2), HF (2) are computed within the mixed Lagrangian formalism,70 that where the vH and vH are the Hartree potentials associated to HF HF ⃗ ⃗ (2) (2) ⃗ ⃗ is, starting from the counterpart Lµ j(1) and Lbν(2) in a mixed ρ =  Pλσφλ(r)φσ(r) and ρ =  Pµνφµ(r)φν(r). For λσ µν AO/MO basis

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-6 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

AO This leads to the following matrix-vector update for the L jb(1) =  CµbLµ j(1), (30) Coulomb part: µ AO  Xai(ai|bj) =  Xai  CµaCνi(µν|λσ)CλbCσ j ia ai µνλσ L jb(2) =  Lbν(2)Cν j, (31) ν = where   CµaXaiCνi (µν|λσ)CλbCσ j µνλσ  ai  AO AUX   ΓQ = Yµν(µν|λσ)CλbCσ j Lµ j(1) = 2  (µν|Q) jν, (32)    µνλσ ν Q occ AUX =  CλbCσ j(Yµνφµφν|λσ) − ΓQ λσ µν Lbν(2) = 2  (iν|Q) ib. (33) i Y Q =  CλbCσ j(ρ |λσ) The above reformulation allows to accumulate the contribu- λσ ΓQ tions to Lµ j(1) and Lbν(2) together with the contraction of µν = C C φ (⃗r)φ (⃗r)vY (⃗r)d⃗r, (36) ΓPQ x x  λb σ j  λ σ H and with the integrals derivatives (µν|Q) and (P|Q) . λσ This choice is thus particularly convenient from a computa- Q Y tional standpoint, since many intermediates, such as Γ , are in where vH is the electrostatic potential obtained from the jν AO common for both updates and do not need to be recomputed. ρY(⃗r) = Y φ (⃗r)φ (⃗r) density and Y is the back trans- Moreover, the (µν|Q) integrals have to be recalculated and this  µν µ ν µν µν is performed at the same time with the computation of the formed matrix associated to the actual trial solution Xai. The corresponding derivatives, allowing a further saving of time required update for the exchange part via 4-index ERI97 reads since all grid operations, such as FFT’s, are performed once for both terms. −  Xai[(ab|ij) + (a j|bi)] (2) ia The calculation of the off-diagonal elements of Pi j and (2) = − Xai CµaCνi[(µλ|νσ) + (µσ|λν)]CλbCσ j Pab, defined in Eqs. (24) and (25), can be equivalently com-   67,74,93 ai µνλσ puted (within a canonical reformulation ) from Li j(1) = C L (1) and L (2) = L (2)C as µ µ j µi ab ν aν νb = −   CµaXaiCνi [(µλ|νσ) + (µσ|λν)]CλbCσ j   (2) 1 Li j(1) − L ji(1) µνλσ  ai  Pi j = , (34)   2 ϵ j − ϵ i = − Y [(µλ|νσ) + (µσ|λν)]C C  µν  λb σ j 1 L (2) − L (2) µνλσ P(2) = − ab ba . (35) ab 2 ϵ − ϵ b a = −  CλbCσ j(µλ|νσ)[Yµν + Yν µ], (37) This choice suffers from numerical instability in the case ϵ i µνλσ ≈ ϵ or ϵ ≈ ϵ , but, contrary to the case of Eqs. (24) and (25), j a b where the permutation symmetry of the AO-ERIs has been ff P(2) P(2) it o ers a way for computing i j and ab that does not require exploited. 5 O(N ) operations except for intermediates that are already Finally, the MP2 correction to the energy-weighted den- available. Moreover, in a parallel implementation, where usu- (2) sity matrix Wpq, is calculated as follows: Occupied-occupied ally the work load is achieved by distributing independent ij block, pairs, the usage of Eq. (34) allows to drastically reduce the 1 algorithmic complexity as well as avoiding the recomputation W (2) = W (2)[I] + W (2)[II] + W (2)[III] , (38) 74 i j 2 i j i j i j of MP2 amplitudes.   Due to the large size of the orbital Hessian matrix A (ov virt AUX AO (2) ΓQ × ov), the linear system of Eq. (27) is commonly solved by Wi j [I] = 2  ( ja|Q) ia =  Cµ j Lµi(1), (39) iterative techniques.63,94–96 According to these methods, rather a Q µ than calculating and storing the full A, which is computa- (2) (2) W [II] = (ϵ i + ϵ j)P , (40) tionally inaccessible even for relatively small systems, at each i j i j iteration, the matrix-vector product X A is computed, MO ia ai aib j W (2)[III] = P(2) A . (41) with X being a trial solution. In this respect, it has to be noted i j  pq pqi j pq that the orbital Hessian is made of a Coulomb part, first term in Eq. (28), and an exchange part, last two terms in Eq. (28).91 Virtual-virtual block, These two updates of the matrix-vector product have thus to be 1 computed consistently to the way the Coulomb and exchange W (2) = W (2)[I] + W (2)[II] , (42) ab 2 ab ab contributions to the Fock matrix are calculated during the SCF   procedure. In the actual case, this means that the former is occ AUX AO W (2)[I] = 2 (ib|Q)ΓQ = − C L (2), (43) obtained within the GPW scheme and the latter via 4-index ab   ia  νb aν ERIs. Again, for efficiency reasons, the AO representation is i Q ν (2) (2) preferred so that sparsity can be exploited. Wab[II] = (ϵ a + ϵ b)Pab. (44) Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-7 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

Occupied-virtual block, h = [a1,a2,a3]. According to this, a change in hαγ not only results in a scaling of all atom coordinates but also affects the 1 W (2) = W (2)[I] + W (2)[II] , (45) grid points over which the electrostatic densities and potentials ai 2 ai ai   are defined within the GPW scheme. The calculation of the occ AUX AO integral derivatives in Eq. (13) has thus to account for this W (2)[I] = 2 (ji|Q)ΓQ = − C L (2), (46) ai   ja  νi aν effect, resulting in additional terms to be considered. Note j ν Q that the number of grid points is kept fixed in simulations (2) (2) Wai [II] = 2ϵ iPai . (47) employing a variable cell. Again, the computation of Π(2) can be split in two terms, The methodology presented up until here is of general validity for any perturbation parameter x. In particular, for the Π(2) = Π(2)-NS + Π(2)-S, (49) calculation of the forces acting on the ions, the gradients of E(2) RI Π(2)-NS with respect to the atomic positions have to be computed. Thus, that is, the non-separable part , specific of the RI-MP2 within the GPW scheme, for which densities are represented in method, associated to the first two summations in Eq. (13), Π(2)-S terms of both atom centered Gaussians and plane waves, only and the separable part , giving the additional contribu- the derivatives of the former have to be considered since the tion derived from the contraction of the MP2 relaxed density latter are originless functions and do not depend of the atomic matrices with the stress derivative of the Fock and overlap matrix elements, last summation in Eq. (13). positions. x E(2) The RI-MP2 contribution to the total stress tensor is calcu- Since in the non-separable part of RI , only integrals Π(2)-NS lated according to98–100 involving Coulomb interactions are required, is ob- tained with a similar approach as that used for calculating 3 ∂E(2) the stress tensor of the Hartree energy,101 for which the grid Π(2) 1 RI T αβ = −  hγβ, (48) dependent contributions are evaluated following the work 3V ∂hαγ γ=1 of Corso and Resta.102 This leads, respectively, for the 3- Π(2)-NS Π(2)-NS-3c Π(2)-NS-2c where hαγ are the elements of the matrix of the cell vec- and 2-center contributions ( αβ = 4 αβ − 2 αβ ) tors (Bravais lattice vectors) given by a1, a2, and a3, that is, to

AUX AO AUX AO (2)-NS-3c 1 Q Q Q Q Π = − δαβ Γ (µν|Q) + Γ + Γ (RI β − rβ)∇Iαφµ(⃗r)φν(⃗r)v (⃗r)d⃗r αβ 3V    µν   µν ν µ  H * Q µν + Q µν   AUX . / AUX ′ ′ Q Q (rα − rα)(rβ − rβ) + (R, − r )∇ χ (⃗r)v-Γ (⃗r)d⃗r + ρQ(⃗r)ρΓ (⃗r ′) d⃗rd⃗r ′ , (50)   I β β Iα Q H    |⃗r − ⃗r ′|3 Q Q  AUX AUX (2)-NS-2c 1 PQ ΓQ Π = − δαβ Γ (P|Q) + 2 (RI β − rβ)∇Iα χQ(⃗r)v (⃗r)d⃗r αβ 3V     H * PQ + Q AUX . / ′ ′ Q (rα − rα)(rβ − rβ) + , ρQ(⃗r)ρΓ (⃗r ′)- d⃗rd⃗r ′ , (51)    |⃗r − ⃗r ′|3 Q 

where V is the volume of the cell, δαβ is the Kronecker delta, plished with a similar methodology as that one previously ∇Iα is the α component of the gradient with respect to the explained for the general derivative case. Again, special care atomic position, and RI β refers to the β component of the atom has to be taken in the case of the Hartree energy term, for which coordinate. All other terms appearing in the above expressions additional terms arise due to the dual representation of the have the same definitions given in Eqs. (18) and (19), note that density in GPW. These additional contributions are obtained the quantities labeled with the ΓQ superscript are computed employing a similar approach as done in the case of the non- Π(2)-NS-3c Π(2)-NS-2c differently for αβ and αβ . In both cases, the first separable part but starting from Eqs. (22) and (23). term arises from the scaling of the system’s volume while In this section, the general equations necessary for calcu- the last is associated with the derivative of the electrostatic lating the RI-MP2 energy derivatives have been presented with 99 potential vH. The remaining components are associated to a particular focus on the way each term is calculated in the the derivatives of the Gaussian basis functions.103,104 GPW framework. The presented approach has been applied for The RI-MP2 stress tensor is completed with the separable the calculation of the forces acting on the nuclei and the stress part Π(2)-S. This final update is performed together with the tensor components. In summary, among all intermediates, only calculation of the stress components of the Hartree-Fock en- few quantities can be recognized as fundamental and need to ergy. The approach is relatively straightforward and is accom- be constructed in order to compute all the others, that is,

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-8 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

• ΓP ΓPQ ia, → RI-MP2 non-separable correction to the using approximately half the length of the smallest edge of 2-particle density matrix. the simulation cell as truncation radius. The threshold for • (2) (2) the SCF convergence was 10−6 or tighter. The PW cutoff for Pi j , Pab → Occupied-occupied and virtual-virtual blocks of the MP2 correction to the 1-particle density the HF and DFT (Perdew-Burke-Ernzerhof, PBE108) parts of matrix. the calculations was Ecut = 800 and Ecut = 1200 Ry, respec- • Lµ j(1), Lbν(2) → Occupied and virtual Lagrangian in tively, to guarantee convergence of the exchange-correlation the mixed AO-MO representation. term. The correlation energy calculations employed high qual- rel −12 ity PW cutoffs of Ecut = 300 Ry, Ecut = 50 Ry, ϵfilter = 10 , −8 58,59 and ϵgrid = 10 , unless mentioned otherwise. The conver- III. IMPLEMENTATION gence threshold for the Z-vector equations, measured as the norm of the residual vector, was 10−5 or tighter. Cluster bound- The general flowchart of the algorithm for the calculation ary conditions have been adopted for solving the Poisson of the RI-MP2 energy derivatives can be summarized as fol- equation87 in the case of gas phase systems, with cubic cells lows: with edges between 15 and 20 Å depending on molecule size. −1/2 The pseudopotential and basis set parameters employed in this 1. Calculation of (P|Q) and its inverse square root VPQ , P work can be found in Ref. 59. subsequent evaluation of Bia. These intermediates are eval- uated within the RI-GPW approach. For all the considered crystals, supercells have been gener- P ated by replicating the unit cell, so that the smallest edge was 2. Formationofthe(ia| jb)RI integralsfrom Bia (Eq.(4)),calcu- P (2) larger than 9 Å, in order for the Γ-point approximation to lation of ERI-MP2, assembly of Y and P according to ia ab be reasonable. The experimental geometries of the molecular Eqs. (16) and (25), and evaluation of P(2) via Eq. (24) for i j crystals have been retrieved from the Cambridge Structural the diagonal and almost degenerate (ϵ ≈ ϵ ) elements. i j Database (CSD),109 and the structural data, together with the 3. Generation of ΓP ,ΓPQ from Y P, BP , and V −1/2 (Eqs. (15) ia ia ia PQ supercell used in the calculation and the CSD refcode, can be and (17)), evaluation of the non-separable contributions to x recovered from our previous works.58,59 For both geometry and E(2) , and assembly of the L (1), L (2), occupied and RI µ j bν cell optimizations (Opt’s), no symmetry constrains have been virtual Lagrangian in the mixed AO-MO representation considered, and the latter have been performed under ambient (Eqs. (32) and (33)). pressure. The convergence thresholds have been set to 3.0 4. Completion of P(2) with L (1) for the non-singular el- i j i j × 10−3 and 1.5 × 10−3 bohrs, respectively, for the maximum ements (Eq. (34)). Construction of the RI-MP2 specific and root mean square (RMS) of the geometry changes, 4.5 Lagrangian Lb j and solution of the Z-vector equations. −4 −4 −1 (2) (2) (2)x × 10 and 3.0 × 10 hartree bohrs , respectively, for the Assembly of Ppq and Wpq and final evaluation of ERI by maximum and RMS of the forces, while a pressure tolerance (x) (x) contraction with Fpq and Spq. of 100 bar has been considered for the cell optimization. The counterpoise (CP) corrected cohesive energy per The detailed description of each of these steps is reported in molecule at a given volume V and for a given basis has been the Appendix, with a particular focus on the parallelization computed as110,111 strategy.

CP Esupercell(V) gas crystal crystal Ecoh(V) = − Emol − Emol+ghost(V) + Emol (V). IV. BENCHMARK CALCULATIONS Nmol (52) A. Computational details 78 Here, Nmol is the number of molecules per supercell, The RI-GPW methods as implemented in CP2K have crystal Esupercell(V) the total energy of the supercell, and Emol+ghost(V), been employed for all calculations in this manuscript. The crystal gas Emol (V), and Emol the total energy of an isolated molecule correlation energy calculations are based on pseudopotentials crystal crystal of the form suggested by Goedecker, Teter, and Hutter (GTH) in either the crystal geometry (Emol+ghost(V) and Emol (V)) gas crystal in Ref. 88 but specifically parameterized for the methods em- or a gas phase geometry (Emol). Emol+ghost(V) includes ghost ployed to converge the wave function at the SCF level (HF atoms from the 12 nearest neighbor molecules for NH3 and or DFT). The same primary and auxiliary bases used in our CO2, and from the first coordination shell in all other cases. previous works have been adopted.58,59 These are labeled as cc- The gas phase geometries have been obtained by relaxation at DZVP, cc-TZVP, and cc-QZVP, denoting double, triple, and the RI-MP2 level. Note that all geometry and cell relaxations quadruple zeta quality, respectively. They consist in valence- have been performed without counterpoise correction. only correlation consistent type105,106 basis sets, generated for being used with these pseudopotentials. The Hartree-Fock B. Validation exchange calculations have been performed employing our robust Γ-point implementation89,90 that allows for stable calcu- In order to validate that forces and stress are computed lations in the condensed phase.90,107 The Schwarz screening consistently to the way the RI-MP2 energy is calculated, a se- threshold for the HF calculations is in the range of 10−8 − ries of short Born-Oppenheimer molecular dynamics (BOMD) 10−10 for the energy, while for the related derivatives, the simulations have been run with different time step ∆t, em- threshold is, in general, relaxed by one order of magnitude. ploying the velocity-Verlet algorithm for the integration of the Periodic calculations require a truncated Coulomb operator,90 equations of motion. The simulations have been performed in

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-9 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

FIG. 1. Energy fluctuation with respect to the average during a sequence of Born-Oppenheimer molecular dynamics simulations with periodic boundary conditions as a function of the time step ∆t. The results are obtained employing the NVE and the NpT for (a) and (b), respectively. In both cases, the system is made of 4 molecules of NH3 in a cubic box with the cc-DZVP basis.

(2) (2) the NVE, Figure 1(a), and in the isobaric-isothermal ensemble contraction of Ppq and Wpq with the skeleton derivatives of the (NpT), Figure 1(b). In the former, only the forces acting on Fock and overlap matrix elements, that is considered as part of the atoms have to be computed while in the latter also the the calculation of the HF energy derivatives. This means that calculation of the stress tensor is required. The model system also the solution of the Z-vector equation has been traced. In is made of 4 NH3 in a cubic box with periodic boundary this respect, due to the limited amount of memory available conditions and employing the cc-DZVP basis. for the smaller run, the AO-ERI’s computed at the SCF level Within the velocity-Verlet integration scheme, the total could not be kept in core during the calculation of the RI-MP2 energy of an equilibrated system fluctuates around the average specific quantities, and their recomputation is thus necessary value with a standard deviation σE that is expected to be before solving the Z-vector equation. proportional to the square of the time step employed in the The speedup and efficiency measured on a CRAY-XC30 2 simulation, σE ∝ ∆t , meaning that, if ∆t is halved, then σE is machine are reported in Figure 2. This machine mounts a GPU reduced roughly by a factor four. This is of course holds only on each node, but for the actual measurements, the usage of in the case for which the forces, from which the accelerations these devices has not been exploited. The algorithm displays a are obtained, are computed as exact derivatives of the potential good parallel scalability resulting in an efficiency higher than energy. The energies obtained from the BOMD trajectories are 80% for almost the whole range. At the full scale-out (32 768 reported in Figure 1, qualitatively showing that the magnitude processes), the time for computing the RI-MP2 energy gradi- of the fluctuations is roughly reduced by a factor four every ents and stress is 106 s. The relatively large drop in efficiency time ∆t is halved. More precisely, the value of σE calculated observed in going from 3072 to 4096 nodes is related to the for the NVE and NpT runs are 0.31, 1.2, 4.7 and 0.18, 0.72, scarce number of ij pairs processed by each Message Passing 2.9 microhartree, respectively, for time step of 0.05, 0.1, and Interface (MPI) task in the latter, such that the time spent 0.2 fs. These results are thus confirming the correctness of the in computation becomes of the same order of the overheads RI-MP2 energy derivative implementation. We find that this related to communication. approach is a stronger check than the mere comparison with In Table I, the timing for different benchmark calculations, numerical derivatives. For example, a large set of configura- obtained employing 512 nodes of a CRAY-XC30 machine, is tions are sampled making possible to track the propagation of reported, in this case also the impact of the usage of the GPU’s possible small errors that may not be detected by numerical has been considered. In general, for the actual implementation, differentiation. the GPU’s have been used to accelerate all the steps that are performed in the algorithm as matrix multiplication. This is of particular advantage for the RI-MP2 method since the expected C. Performance of the methods most computationally intense part, i.e., the O(N 5) steps, is all The parallel performance of the algorithm for calculating accomplished in this way. the RI-MP2 energy gradients and stress has been measured For sake of completeness, in Table I also the time ttot neces- for a system made of 64 water molecules in a cubic box with sary for the evaluation of the energy gradients and stress of the PBC at experimental density. The cc-TZVP basis has been total energy (HF + RI-MP2) is reported. At the Hartree-Fock employed resulting in 256 occupied orbitals, and 3648 primary level, the most expensive operations are related to the update of and 8704 auxiliary basis functions. The measured time includes the Fock matrix with the exact exchange contributions, which all operations described in the Appendix, excluding only the involve the calculation of the AO-ERI’s and relative derivatives.

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-10 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

FIG. 2. Speedup (a) and efficiency (b) with respect to 64 nodes for the calculation of the RI-MP2 energy gradients and stress of 64 bulk water molecules (cc-TZVP basis). Calculation performed on a CRAY-XC30 machine, each node consists of 8 processes.

For the reported benchmarks, computing the derivatives of the limit since the calculation of the integral derivatives as well as RI-MP2 energy results in a percentage of the total time that the solution of the Z-vector equations will always give a non- grows systematically increasing the system size up to 75% for negligible overhead to the calculation. the largest case. The exception to that is the 64 bulk water case, The relative time spent in each part of the algorithm, for the for which the Schwarz screening, in particular for the (µν|λσ) different benchmark calculations, is reported in Figure 3. For all derivatives, is particularly effective, resulting in a small time cases, except the ammonia crystal, the time spent for calculat- spent at HF level. ing the O(N5) intermediates is more that 50% of the total time, The total time necessary for calculating the RI-MP2 en- reaching almost 80% for the largest case (cyclotrimethylene- ergy gradients and stress (tD), reported in Table I, is of the trinitramine (CT)). The computation of the RI specific quan- −1/2 P order of minutes for all the cases and results to be a factor tities, VPQ and Bia, is the cheapest operation requiring less between 4 and 5 times larger (tD/tE) than what required for than 10% of the total computational effort for all calculations. (2)x the calculation of the RI-MP2 energy only. According to the The evaluation of the non-separable contributions to ERI is analysis done in Subsection 2 of the Appendix in the limit of dominated by the calculation of the 3-center integrals and very large system, i.e., when the O(N 5) steps are by far the most associated derivatives, and results to be roughly a factor 3 −1/2 P time consuming part of the total computation, the ratio tD/tE more expensive than the computation of VPQ and Bia. In this is expected to be between 3 and 4. This is just the theoretical respect, the computation of the non-separable contributions

TABLE I. Benchmark calculations for the RI-MP2 energy gradients and stress, time in min.@CRAY-XC30, 4096 processes, 512 GPU. U = urea, B = benzene, FA = formic acid, SA = succinic anhydride, D = 2,3-diazanaphthalene, PD = pyromellitic dianhydride, CT = cyclotrimethylene-trinitramine, H2O = 64 bulk water molecules, NH3 = ammonia crystal (32 molecules), CO2 = carbon dioxide crystal (32 molecules). o, n, and Na represent the number of occupied orbitals, basis functions, and auxiliary basis functions, respectively. The reported timings represent: ttot = total time for computing HF and RI-MP2 energy, gradients, and stress; tD = time for computing RI-MP2 energy, gradients and stress; tD = ratio between t and the time for computing only the RI-MP2 energy; tGPU = the tE D D tD same as tD but employing GPU; GPU = observed speedup when using GPU. tD t o n N t t tD tGPU D a tot D tE D GPU tD

NH3 128 2272 5 312 3.15 1.53 4.20 1.47 1.04 U 192 2752 6 784 5.97 3.58 4.59 2.89 1.24 FA 216 2760 6 912 5.83 3.87 4.28 2.95 1.31 D 192 2992 7 520 12.84 5.27 5.15 4.26 1.24 CO2 256 2784 7 296 7.94 4.99 4.15 3.50 1.43 H2O 256 3648 8 704 10.17 9.34 4.00 5.85 1.60 B 240 4128 10 176 23.01 13.77 4.45 8.81 1.56 PD 312 3936 10 208 28.96 17.48 4.13 9.80 1.78 SA 304 4144 10 432 27.00 19.29 4.26 10.94 1.76 CT 336 4152 10 560 29.71 22.30 4.16 11.97 1.86

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-11 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

FIG. 3. Relative time, express in terms of percentage, spent in each of the most relevant part of the algorithm for the same benchmark calculations reported in Ta- −1/2 P 5 5 ble I. The meaning of the label in the legend stand for: Integrals = evaluation of VPQ and Bia (Subsection 1 of the Appendix); O(N ) = evaluation of the O(N ) x = (2) scaling intermediates (Subsection 2 of the Appendix); Non-Sep. evaluation of the non-separable contributions to ERI (Subsection 3 of the Appendix); (2) (2) Z-vector = solution of the Z-vector equations, assembly of Ppq and Wpq (Subsection 4 of the Appendix).

to the RI-MP2 stress tensor takes around 30% of time spent reporting separately the measured timings for the generation of P (2) in this part while the rest is related to the evaluation of the the (ia| jb)RI integrals, the update ofYia and Pab, and communi- forces. The remaining part is associated with the solution of cation. Note that, the solution of the Z-vector equations in this the Z-vector equations that can require a variable percentage of case, due the combination of the system topology and small the overall time according the number of iterations necessary basis, takes a negligible time with respect to the total time and to reach convergence. For all the reported cases, the amount thus has not been reported in the plot. of memory was enough for keeping in core the AO-ERI’s The observed scaling for “Integrals” and “Non-Sep.” is computed at the SCF level during the evaluation of the RI-MP2 2.1 and 2.3, respectively, showing that the integration of the specific quantities. This allowed to avoid their recomputation electrostatic potential over pairs of basis elements µν is the for the solution of the Z-vector, making this operation less dominant part within the tested sizes. This operation is in fact computationally demanding. expected to scale as O(N 2), while the additional parts, such as Finally, the time for the calculation of the RI-MP2 energy integral transformation, scaling as O(N 4), make the exponent gradients and stress by exploiting the GPU’s for the operations a slightly larger than 2. This effect is more pronounced for the performed as matrix multiplication is reported in Table Ilabeled GPU as tD together with the observed speedup compared to the GPU CPU only case (tD/tD ). As shown in the table, the speedup is modest for the smaller cases while approaching a factor 2 for the larger ones. Focusing on the largest benchmark calcu- lation (CT), the observed speedup for the overall calculation is roughly 1.9, while specifically for the steps performed as matrix multiplications, the observed speedups are in general greater than 4.

D. System size scaling In order to verify the cost models presented in the Appen- dix, the time for each of the individual steps of the algorithm has been measured for increasing system sizes. The test system is based on a supercell containing 32 water molecules with a cc-DZVP basis set that has been replicated in one dimension up to 5 times. The results are reported in Figure 4 where the obtained timings have been fitted with the function y = bxa, yielding the measured scaling exponent a associated with each FIG. 4. Time spent in the various significant parts of the algorithm for the different step. calculation of the RI-MP2 energy gradients, as a function of the number In Figure 4, the labels “Integrals” and “Non-Sep.” refer to of replicas of the supercell, containing 32, 64, 96, 128, and 160 molecules all operations described in Subsections 1 and 3 of the Appendix, of H2O, respectively (cc-DZVP basis). Timing measured on a CRAY-XK7 5 machine employing 2400 processes without GPU’s. Lines represent a linear respectively. The evaluation of the O(N ) scaling intermediates two-parameter fit of the form y = bxa. The values of a for each operation (Subsection 2 of the Appendix) has been traced in more details, are reported in the legend.

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-12 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

lattercomparedtotheformerduetothehighernumberofO(N4) bothsystems, without CPcorrection the crystalsresultoverbind (2)x steps performed in the update of the non-separable part of ERI . with shorter equilibrium lattice parameter and larger cohesive 5 For the evaluation of the O(N ) intermediates ((ia| jb)RI, energy. The CP correction fixes this issue giving values for a P (2) Yia, and Pab), the observed scaling is in all cases 4.8, while closer to the experimental one but Ecoh in general higher. communication has a measured a of 4.2. This is in agreement The cell relaxation provides converged structures that pre- with the performance models derived for these operations, that serve the cubic symmetry of the crystals within the numerical is 5 for the former and 4 for the latter. From the comparison of accuracy of the method. With the cc-TZVP basis, the obtained the timings of the individual O(N 5) steps, it is observed that lattice parameters from Opt are essentially the same as those P the update of Yia takes roughly a factor 2 more compared to obtained from the non-CP curve optimizations (Fit-Opt) and the generation of (ia| jb)RI, while the latter results two times thus substantially shorter than those evaluated with the CP (2) ff more expensive than the update of Pab. The reason for the first correction (Fit-Opt-CP). This divergence is e ectively reduce observation is related to the fact that, for a given ij pair, the when using the cc-QZVP basis, showing a clear trend in the P update of Yia and the generation of (ia| jb)RI require for both convergence. In fact, by inspection of Figure 5, it can be seen 2 P O(v Na) operations, but for the former (Yia), this is performed that the Fit-Opt-CP approach tends to converge, with respect (2) to the basis set, from larger values of a opposite to the case of 2 times (for i and j, respectively). The update of Pab requires 3 O(2v ) for each ij and since in this case Na ≃ 4v, the observed the cell optimization. 2 time scales as O(v Na/2) that is half than what is needed for At the quadruple zeta level, the lattice parameters obtained generating (ia| jb)RI. from the cell optimization are 5.01 and 5.48 Å , respectively, for NH3 and CO2, with associated cohesive energies of −33.2 and −24.9 kJ/mol. With the same basis the CP curve opti- E. Applications mization procedure, Fit-Opt-CP(QZ), gives similar results for E , but slightly larger values of a being, respectively, 5.05 1. Solid NH and CO coh 3 2 and 5.52 Å. From the observations previously stated, it can be Ammonia and carbon dioxide molecular crystals repre- concluded that, for the supercell considered in this work, the sent two simple benchmark systems useful for judging the complete basis set limit for the equilibrium lattice parameter performance of a method. The dominant interactions in the should be within these values, that is, between 5.01–5.05 and two cases are very different in nature, being weak hydrogen 5.48–5.52 Å, respectively, for NH3 and CO2. bond for NH3 and purely van der Waals for CO2. These sys- For ammonia, a good agreement is found with the values tems have been extensively investigated both experimentally reported by Maschio et al.110 obtained with the aug(d,f)-TZPP and theoretically. Concerning the theoretical studies, many basis, that is, a = 5.02 Å and Ecoh = −36.6 kJ/mol, while a of them are MP2 theory based methods such as periodic- larger deviation in the lattice parameter is observed for car- 58 110 canonical MP2, periodic-LMP2, incrementally corrected bon dioxide (a = 5.59 Å and Ecoh = −26.6 kJ/mol). A better 118 119,120 LMP2, embedded many-body expansion, and hybrid agreement for the lattice constant a of the CO2 crystal is ob- Quantum Mechanics/ (QM/MM) frag- tained when comparing with the values of 5.52 and 5.46 Å re- ment method.121 ported, respectively, by Bygrave et al.119 and Sode et al.120 Two approaches can be used for calculating the equi- calculated with a CP augmented triple-zeta basis for the former librium lattice parameter (a) and cohesive energy (Ecoh) of and augmented quadruple-zeta basis for the latter. In this case these crystals. The first approach (direct method) is to perform also the optimized C-O bond length matches the values re- a cell optimization followed by the calculation of Ecoh for ported by these authors. the equilibrium structure. The second one (indirect method) As a comparison, also the results obtained with the PBE consists in the optimization of the geometry at various fixed functional including the Grimme D3122 correction have been volumes from which the equilibrium quantities a and Ecoh reported. For ammonia, a good agreement is found between are derived by fitting employing, e.g., a third order Birch- RI-MP2 and PBE-D3 in the lattice constant, while the cohesive Murnaghan equation. The former is computationally more energies display a large discrepancy. On the other hand, for efficient since a single optimization has to be carried out; CO2, the Ecoh is estimated roughly the same but the value of a moreover, it allows to gain more information on the local is around 5% larger than that calculated with RI-MP2. molecular structure at equilibrium. On the other hand, this Since many effects, such as temperature dependence and approach suffers from basis set superposition error (BSSE) that zero-point vibrational energies, are neglected, caution has to can be particularly large within MP2 theory. This drawback be exercised when comparing the obtained results with exper- can be remedied by increasing the basis set, or, in the indirect iments. For both crystals, at the QZ level, the lattice parameters method, by considering counterpoise corrected energies. Both are less than 2% shorter than the experimental ones while the approaches have been considered employing the cc-TZVP cohesive energies display larger deviations. and cc-QZVP bases; the computed equilibrium properties are A result of the analysis reported here is that, due to the summarized in Table II. slow convergence of the MP2 energy with respect to basis As shown in Figure 5, at the triple zeta level, the lattice set size, also the bulk structures are subject to BSSE. Here, parameter optimization curves are calculated both with and in order to remedy this issue, counterpoise correction and without CP correction. The position of the obtained minima, larger basis sets have been used. An alternative which requires indicated by the crosses in the plots, shows clearly the large significant additional development, and is thus not tested here, discrepancies between the two approaches. As expected, for is to use explicitly correlated treatments such as F12-MP2,31,32

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-13 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

FIG. 5. Location of the minima for NH3 (a) and CO2 (b), computed at the RI-MP2 level of theory with different basis sets obtained with different approaches. The lattice parameter optimization curves have been fitted with a third order Birch-Murnaghan equation; the crosses represent the location of the minimum point for each curve. CP means that the cohesive energies have been counterpoise corrected.

which is also possible in the condensed phase.33 In addition, Also the size of the molecules across the investigated set is in previous work,58,59 the effect of the supercell size on the quite different going from a minimum of 5 (FA) up to 21 cohesive energy has been studied in detail, showing that an L−3 atoms for the largest case (CT). For the relaxed structures, the extrapolation, with L the edge of the unit cell, can be used for counterpoise corrected cohesive energy has been computed; converge the system size effect inherent to the used Γ-point the obtained results, compared with the experimental values, approximation. are reported in Table III. For all cases, the cell optimization preserves the ortho- rhombic symmetry of the crystals, roughly keeping the exper- 2. Molecular crystals imental aspect ratio. The obtained lattice parameters are in all Geometry and cell optimization at the RI-MP2 level em- cases underestimated compared to experiment, with deviations ploying the cc-TZVP basis have been carried out for a set of ranging from 1% up to 7%. This leads to a much larger discrep- molecular crystals. This set includes the crystal of Urea (U), ancies for the cell volumes for which a maximum deviation Formic Acid (FA), Benzene (B), Pyromellitic Dianhydride of 15% is observed for the benzene crystal. A slightly smaller (PD), Succinic Anhydride (SA), and CT. The dominant inter- error, around 8% for the volume, is observed for the urea and molecular interactions for these crystals cover a large range, formic acid crystals; cases for which the intermolecular inter- from hydrogen-bond to dipole-dipole to purely van der Waals. actions are mainly of dipole-dipole and hydrogen-bond types.

TABLE II. Equilibrium cohesive energy (Ecoh in (kJ/mol)/ molecule) and structural properties (lattice parameter a and bond length in Å, angles in degree) for the NH3 and CO2 crystals calculated employing different methods. Except when specified otherwise, the basis set is cc-TZVP and the level of theory RI-MP2 (QZ stands for cc-QZVP basis). CP means that Ecoh is counterpoise corrected, Opt denotes full cell optimization while Fit-Opt refers to the results obtained by fitting the curves shown in Figure 5 (third order Birch-Murnaghan equation). Experimental values from Ref. 110 (see also Refs. 112–114) and Refs. 115–117, and references therein for NH3 and CO2, respectively.

NH3 CO2

a rN–H ∠H–N–H Ecoh a rC-O Ecoh

Opt-CP (PBE-D3) 5.00 1.027 107.5 −44.97 5.73 1.171 −26.50 Fit-Opt 4.98 ...... −37.76 5.44 . . . −31.04 Fit-Opt-CP 5.08 ...... −31.54 5.54 . . . −22.26 Fit-Opt-CP (QZ) 5.05 ...... −33.25 5.52 . . . −24.96 Opt-CP 4.98 1.017 107.4 −31.32 5.44 1.166 −21.90 Opt-CP (QZ) 5.01 1.017 107.2 −33.23 5.48 1.168 −24.88 Expt. 5.048 1.01-1.06 107.5 −36.3 5.55-5.62 1.155-1.12 −31.1

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-14 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

This difference can be rationalized by inspection of the S22 overbinding of dispersion interactions at the MP2 level. Note set,123 for which it is shown that MP2 is in general giving poor that the cell optimization for FA at the PBE level with the performance for the complexes with predominant dispersion cc-TZVP basis gives a = 11.19, b = 4.20, and c = 5.24 Å, in contribution, such as the case of benzene, while the results are good agreement with a = 10.91, b = 4.11, and c = 5.28 Å, usually better for hydrogen bonded complexes.123,124 reported by Tosoni et al.125 obtained employing the Ahlrichs’ As in the cases of solid NH3 and CO2, the cell optimization TZP basis. at the triple-zeta level is clearly affected by the BSSE. In order The CP corrected cohesive energies Ecoh reported in Table to estimate to which extent the observed error in the converged III have been computed for the relaxed structures obtained lattice parameters is due to BSSE or to intrinsic limitation of after geometry optimization at experimental volume and cell the MP2 theory, a full cell relaxation employing the cc-QZVP optimization. Even if a strong structural relaxation takes place has been performed for the formic acid crystal. The obtained in cell optimization, this does not reflect into a large variation lattice constants at the quadruple-zeta level are a = 10.20, b of Ecoh compared to that obtained after geometry optimization = 3.41, and c = 5.33 Å while the computed cohesive energy at fixed experimental volume. This is a direct consequence of is −59.3 kJ/mol. Increasing the basis set is thus reducing the the weak binding interactions that dominate in these systems, error in the equilibrium volume to 4.2%. In particular, while giving rise to relatively flat potential energy surfaces with for the a and c vectors the agreement is fairly good, a large respect to the cell parameters. Moreover, the shrinking of the deviation is observed for the b lattice parameter, such that volume during the cell relaxation leads to structures of the indi- the discrepancy in the volume is almost completely deter- vidual molecules that are usually less stable when extracted mined by this underestimation. Solid formic acid consists of from the relaxed geometries obtained from Cell-Opt than infinite chains of molecules linked by hydrogen bonds, while those obtained from Geo-Opt. This effect partially compen- inter-chain interactions are dominated by dispersion and weak sates the gain in energy due the optimization of the cell C–H··· O contacts. The weak intermolecular interactions act parameters. along the cell vectors a and b, while the hydrogen bonded When comparing Ecoh with experimental sublimation en- formic acid molecules form infinite chains approximately ori- thalpies ∆H(s), it can be noted that at the MP2 level, a good ented along cell vector c.125,126 The compression observed agreement is found when the crystals are bound with mixed along the b vector can be interpreted as the result of the electrostatic-dispersion interactions such in the case of urea, succinic anhydride and cyclotrimethylene-trinitramine.127 For crystals such as benzene and pyromellitic dianhydride, purely TABLE III. Counterpoise corrected cohesive energy Ecoh (kJ/mol) equilib- rium volume V (Å3) and lattice parameters abc (Å) for different molecular bounded with van der Waals dispersion interactions, large crystals calculated after structural relaxation at the RI-MP2 level of theory deviations are observed, resulting in general in an overestima- employing the cc-TZVP basis. The meaning of the labels are U = urea, tion of the cohesive energy. FA = formic acid, B = Benzene, PD = pyromellitic dianhydride, SA = succinic anhydride, and CT = cyclotrimethylene-trinitramine. Geo-Opt refers to geom- etry optimization at experimental volume while Cell-Opt stands for full cell relaxation. The experimental Ecoh are obtained from sublimation enthalpies V. CONCLUSIONS ∆H(s) with opposite sign. For the experimental data, see Refs. 58 and 111, as well as http://webbook.nist.gov/chemistry/and Cambridge Structural In this work, the equations for calculating the derivatives Database.109 of the MP2 energy in the framework of the resolution of identity Gaussian and plane waves method have been derived Geo-Opt Cell-Opt Expt. and presented in detail. The central aspect in the computation

Ecoh abc V Ecoh abc V Ecoh of the derivatives of the correlation energy within the RI-GPW approach is the dual representation of the RI fitting density in 5.45 5.65 U −97.1 5.45138 −96.6 5.65 150 −92 terms of Gaussian and plane waves auxiliary functions. The 4.64 4.70 latter representation is equivalent to expressing the electro- static densities over regular grids in space. This allows the 10.06 10.24 straightforward conversion of these quantities into the asso- FA −55.6 3.36179 −54.9 3.54 194 −68 5.31 5.36 ciated potentials by solving the Poison equation in G space and exploiting Fourier transformations for switching between 7.14 7.40 direct and reciprocal representations. In this way, the evalua- B −58.8 8.78400 −63.0 9.44 473 −45 6.39 6.78 tion of the integral derivatives is accomplished consistently to the way the energy is calculated. This approach is of general 10.16 10.79 validity and it has been applied to the calculation of the forces PD −125.7 10.18754 −130.4 10.79 863 −83 7.29 7.41 acting on the atoms (gradients) as well as for the derivative with respect to the cell volume (stress tensor). 5.14 5.43 For the presented scheme a massively parallel algorithm SA −82.6 6.57382 −84.8 6.97 443 −81 11.31 11.72 has been designed displaying, with respect to the system size, cubic, quartic, and quintic requirements, respectively, for the 12.88 13.18 memory, communication, and computation. All these require- CT −116.6 11.051452 −115.8 11.57 1634 −112 ments scale with increasing number of processes. The im- 10.21 10.71 plementation is based on a hybrid OpenMP/MPI scheme for

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-15 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

−1/ 2 P which the parallelization is achieved by distributing the work 1. Evaluation of VPQ and Bia with the RI-GPW Method over subgroups of processes rather than over a single task. The parallel algorithm for the evaluation of V −1/2 and This allowed to achieved a more flexible memory manage- PQ P ment and reduced communication without loss of computa- Bia within the RI-GPW approach has been presented in great tional efficiency. The measured performance displays excel- details in Ref. 59. Here, only the most important features lent parallel scalability and efficiency up to thousands of nodes. of the parallel implementation are recalled in order to help Moreover, in the actual implementation the computationally the description of the algorithm in Subsections 2–5 of this more demanding part, that is the O(N 5) steps, is accelerated Appendix. by employing GPU’s showing a gain of almost a factor two The parallelization strategy is based on a two level distri- compared to the standard CPU only case for large systems. bution of the workload, obtained by splitting the total Np Several benchmark calculations have been reported for processes available into NG groups, consisting of Nw processes validating the theoretical and methodological aspects. It is (Np = NGNw). The first level is associated to the work per- shown that the presented scheme is efficient, accurate, and formed for a single given auxiliary basis function χP or vector |P) = χ (⃗r)V −1/2. The parallelization at this level is ob- robust especially for systems in the condensed phase. The Q Q QP effort for the calculation of the derivatives at the RI-MP2 level tained within the Nw processes of each group based on a hybrid is between 4 and 5 times more expensive than computing only OpenMP/MPI scheme involving, for example, parallel FFTs, the energy. Geometry optimization as well as full cell relax- halo-exchanges, numerical integration of the basis functions ation has been performed for a variety of molecular crystals. over the electrostatic potential and sparse matrix multiplica- The obtained results are in general good agreement with both tions. The second level corresponds to a distribution of these previously reported calculations and experimental data. Fur- nearly independent calculations among the different groups. thermore, the actual implementation allows to fully exploit the This is achieved by splitting the total number of auxiliary basis nP nP computational power of new generation supercomputers, such function Na into NG ranges [Pstart, Pend], each of them labeled that the derivatives of the RI-MP2 energy can be performed with a given nP coordinate, and assigned to the corresponding within minutes for systems containing hundreds of atoms and group. Additionally, each of the Nw processes within a group thousands of basis functions. This opens the possibility to is given an index nw, so that a processes is uniquely identified perform structural relaxation or even molecular dynamics at by its coordinate (nP,nw). Finally, the a virtual index is split nw nw the MP2 level of theory for condensed phase systems with in Nw ranges [astart,aend], while the index i is kept over the full accurate basis sets, and our recent study on the relative stability set of occupied orbitals. of the different phases of ice XV128 is a case in point. The workload distribution described so far allows for a The methodology presented here is a general framework scalable parallel implementation for the integral computation. that can be extended for the calculation of the energy deriva- In fact, the communication intense steps are restricted to within tives evaluated at the double-hybrid density functional129–131 the group, made of a small number of tasks, while each group and random-phase approximation132 level, for which O(N4) works independently for the different χP or |P) associated to nP nP scaling implementations have been reported. its range [Pstart, Pend]. P nP Focusing on the calculation of Bia, for each P ∈ [Pstart, nP Pend], the computational procedure can be summarized as fol- ACKNOWLEDGMENTS lows: J. V. acknowledges financial support by the European Union FP7 in the form of an ERC Starting Grant under • Evaluation of the density ρP(⃗r) = χ (⃗r)V −1/2 on Q Q QP Contract No. 277910. This research was partly supported the real space grid. • P ⃗ by NCCR MARVEL funded by the Swiss National Science Calculation of the electrostatic potential vH(r) asso- Foundation. We acknowledge that the results of this research ciated to ρP(⃗r). This is obtained by first transferring have been achieved using the PRACE Research Infrastruc- ρP(⃗r) from the real to reciprocal space via FFT, solving ture resource Hermit based in Germany at Stuttgart (HLRS). the Poisson equation in Fourier space and finally back Additional calculations were enabled by the Swiss National transferring the potential, with an additional FFT, from Supercomputer Centre (CSCS) under Project ID ch5. The reciprocal to real space. research leading to these results has received funding from the • Integration of the potential over the pairs of basis set P ⃗ ⃗ P ⃗ ⃗ Swiss University Conference through the High Performance elements, Bµν = φµ(r)φν(r)vH(r)dr. • P and High Productivity Computing (HP2C) Programme and Transformation of Bµν from the AO to the MO basis the Platform for Advanced Scientific Computing (PASC) pro- by two consecutive matrix-matrix multiplication, that gramme. is, BP = C BP and finally BP = BP C . iν µ µi µν ia ν iν νa

The asymptotically dominating part of this procedure is associ- APPENDIX: PARALLEL ALGORITHM ated to the last index transformation that has a formal scaling The parallel algorithm described here is a scalable im- of O(ovnNa/Np) while the integration of the potential has a plementation of the RI-MP2 energy derivatives displaying, cost that grows only quadratically with the system size. Nev- with respect to the system size, cubic, quartic, and quintic ertheless, due to the small prefactor associated to the former, requirements, respectively, for the memory, communication, the latter is usually more computationally demanding, even for and computation. relatively large systems.59

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-16 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

At the end of this step, each process stores the elements other groups, while keeping the virtual index distribution, P nP nP nw nw nw nw of Bia for all i, P ∈ [Pstart, Pend], and a ∈ [astart,aend]. a ∈ [astart,aend]. 2. The (ia| jb)RI integrals are generated for the actual ij pair 5 in a matrix-multiplication fashion (Eq. (4)). 2. Evaluation of the O(N ) scaling intermediates ab ∆ab 3. The amplitudes ti j are formed from (ia| jb)RI and i j 5 With O(N ) scaling intermediates, all the quantities that = ϵ i + ϵ j − ϵ a − ϵ b. require for their construction, a quintic computational effort (2) (2) 4. The contributions to Pab and Pii are accumulated into the is implied. Within the RI-MP2 method, these intermediates relative local buffers. P (2) (2) P are (ia| jb)RI, Yia, Pab, and Pi j evaluated, respectively, with 5. The contribution to Yia are accumulated into two interme- Eqs. (4), (16), (25), and (24) for which the formal compu- Ξi Λj diates, namely, aP and aP distributed within the group 2 2 2 2 2 3 tational effort grows as O(o v Na), O(o v Na), O(o v ), and such that each process stores the full auxiliary index P and O(o3v2). The efficient construction of these intermediates is nw nw a ∈ [astart,aend]. of prime importance since they are the asymptotically most Ξi Λj 6. aP and aP are redistributed over all groups and accumu- expensive calculations of the RI-MP2 energy derivatives eval- P lated into the local Yia. uation. While in a serial algorithm their computation is relatively In the above procedure, the first and the last steps involve inter- straightforward, a parallel implementation has to face the prob- group communication. They can be considered isomorphic lems connected with the distributed storage of the precursors, with the difference that in the former, for the actual ij pair, P each process collects the full range of auxiliary functions for such as Bia, as well as the balancing of the workload over nw nw processes. Regarding the latter issue, a distribution of inde- a ∈ [astart,aend], in the latter, each task collects all the ij indices pendent ij pairs (i ≤ j) is of particular convenience for the processed by all other groups for its preassigned range of P P (2) (2) ∈ [PnP , PnP] and a ∈ [anw ,anw ]. All other operations are evaluation of (ia| jb)RI, Yia, and Pab, but not for the Pi j , for start end start end which the distribution of the ab pairs would be preferred. In performed locally within the group, that is, only a small order to overcome this complication, while retaining the ease amount of communication takes place restricted to the mem- of the ij distribution as well as avoiding additional events of bers of the group. (2) The main source of communication of the parallel algo- communication, the Pi j are evaluated via Eq. (24) only for the diagonal and almost degenerate elements (|ϵ i − ϵ j| < tsing) rithm is thus related to the inter-group redistribution steps while employing Eq. (34) for all the others. This approach is mentioned above. According to the implementation designed referred as semi-canonical and tsing is a threshold for discrim- for the RI-MP2 energy, three key aspects are considered in inating which of the ij pairs have to be treated as almost order to enhance the efficiency of these operations. degenerate, expected anyway to be a small fraction of the total. • P P Bia is replicated into βia distributed within subgroups The pseudocode for the parallel evaluation of (ia| jb)RI, P (2) (2) (the replication groups R). This allows to restrict the Yia, Pab and the diagonal elements Pii is sketched in Figure 6, (2) number of processes that have to communicate at each while the update of Pi j for the almost degenerate pairs is cycle. (2) • shown in Figure 7. The completion of Pi j for the remaining The messages are exchanged employing point-to-point elements is performed later, when Lµ j(1) is made available. In communication. This allows to avoid global synchroni- the pseudocodes, the expected computational effort, expressed zation of processes while keeping a low memory usage. in terms of order of, is reported for the most important opera- • The ij pairs are communicated in batches, reducing the tions. overall number of messages. The algorithm shown in Figure 6 closely resembles the structure employed for the RI-MP2 energy evaluation While the first point is more a technical ingredient, the other described in Ref. 59. and 8704 auxiliary basis functions. As two are mandatory in order to obtain a scalable implementa- P tion, so that the required communication is reduced increasing mentioned in Subsection 1 of the Appendix, the Bia inter- mediate is distributed such that each process stores elements the number of processes. nP nP nw nw In more details, following the pseudocode given in for all occupied i, P ∈ [Pstart, Pend] and a ∈ [astart,aend]. That is, the virtual index a is distributed over a small number of Figure 6, as a first stage, according to the available memory, MPI tasks within the group G while the auxiliary index P is the size Nr of the replication groups R is defined and the elements of BP are replicated into βP distributed among the distributed over the large amount of NG groups. The same kind ia ia P (2) (2) members of R. NR being the number of replication groups, a of distribution has been adopted for Yia while Pab and Pi j are created in a replicated form within each group, only the virtual ratio NR/NG of ∼ 0.1 has been shown to be a good compromise index a of the former is distributed over the elements of the between the time necessary for the replication and the gain in nw nw communication in the subsequent steps. The remaining mem- group with the usual a ∈ [astart,aend]. The parallelization is achieved by distributing indepen- ory available per process is then used to define the maximum batch size B , and the total number of IJ batches (I ≤ J) dent ij pairs over the NG groups. Each group, for a given ij S pair, perform the following set of operations: is then distributed statically over the NG groups. To achieve the best possible load balance, the number of IJ batches is restricted to be a multiple of the number of groups NG and the P 1. Bia is redistributed such that the full range of the auxiliary remaining ij single pairs are again statically distributed over index P, for i and j, is collected on local buffers from all groups.

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-17 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

(2) (2) P (2) ∗ FIG. 6. Pseudocode of the parallel algorithm for computing E , Pab, Yia, and the diagonal elements of Pi j . All statements labeled with ( ) involves in-group communication, these operations are shown explicitly in the algorithm only for the calculation of Iab, in all other cases the procedure is performed likewise.

At this point, each group loops over its preassigned IJ allows to calculate the contribution coming from the actual (2) (2) (2) Ξi Λj batches and, as a first task, collects from all other members ij to Pab, Pii ,Pj j and aP, aP. Before moving to the next i P j of the replication group R, the elements AaP = βia and EaP batch, the second inter-group communication step takes place, P Ξi Λj = βja. For all ij pairs in the actual IJ batch, the operations redistributing aP and aP within the members of the repli- between 2 and 5 previously described are performed. This cation group. This corresponds in the pseudocode to lines 10

(2) FIG. 7. Pseudocode of the parallel algorithm for computing the off diagonal elements of Pi j for almost degenerate i j pairs (ϵi ≈ ϵ j). The meaning of the symbols is the same as in Figure 6.

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-18 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

P and 11, for which the intermediate Xia is introduced. This well as accelerated by employing a hybrid implementation that P P P P quantity stays to Yia as βja stays to Bja, i.e., Xia represents the utilizes GPU’s. P replicated version of Yia collecting the contributions coming from all the ij pairs processed by the groups G in the repli- x R IJ (2) cation group . For this reason, at the end of the loop over 3. Non-separable contribution to ERI , assembly P batches, an additional step of decomposition of Xia is required of Lµj(1) and Lbν(2) P in order to generate Y in its final form, that is, distributed x ia The calculation of the non-separable contribution to E(2) , such that each process stores the elements for all occupied i, RI P ∈ [PnP , PnP], and a ∈ [anw ,anw ]. The two final steps are Lµ j(1), and Lbν(2) is performed within the same procedure start end start end since many intermediates, deriving, for example, from the global summation of the elements of P(2) and diagonal of P(2), ab i j back-transformation of ΓP , are in common between these eval- with the difference that the latter is summed over all processes ia uations and can be computed within a loop over the auxiliary while the former only across those that share the same virtual n n basis function index P. In general, the non-separable contribu- index range a ∈ [a w ,a w ], i.e., those labeled with the same x start end tions to E(2) are calculated by contraction of 3- and 2-center coordinate n . RI w RI integral derivatives with the non-separable correction to At this stage, what remains to be done is the update of ( ) the 2-particle density matrix, computed in the framework of the elements of P 2 for the potentially singular ij pairs. This i j the RI-GPW approach (Eqs. (18) and (19)); while L (1) and is accomplished first by checking the total number d of almost µ j L (2) are evaluated with Eqs. (32) and (33). The pseudocode degenerate ij pairs, i.e., the pairs for which |ϵ − ϵ | < t . bν i j sing of this procedure is reported in Figure 8. For each of these, a loop over all occupied k is performed, the Prior the calculation of these quantities it is thus neces- (ia|kb) and (ja|kb) are generated, and the P(2) element is RI RI i j sary to assembly ΓP and ΓPQ (3- and 2-index non-separable updated according to Eq. (24). ia correction to the 2-PDM). This is achieved from the BP , Y P, As shown by the pseudocode in Figure 7, the paralleliza- ia ia and V −1/2 intermediates by a sequence of parallel matrix multi- tion of these steps is obtained in a very similar way as done PQ (2) BP for the computation of P , with the main difference that in plications (lines 1, 2, and 3 in Figure 8). Since both ia and ab Y P are distributed such that each process stores the elements this case the ijk triplets are distributed over the NG groups. ia for all i, P ∈ [PnP , PnP], and a ∈ [anw ,anw ], a redistribution The number of ijk triplets (d × o) is usually small compared start end start end to the total number of ij pairs (o2), for this reason, the commu- step is require in order to reorganize the data into the form of ffi nication scheme employing batches is not exploited since it a parallel distributed matrix suitable for performing e ciently may lead to poor balance of the work load. As in the previous the multiplications. A second redistribution step is then per- ΓP ΓQP formed for the computed ia and such that each group case, the procedure is finalized with a global summation over n n (2) G stores for P ∈ [P P , P P] all ia of the former, and all Q of all processes of the almost degenerate elements of P . It has start end i j the latter. The elements of ΓP reshuffled in this way are further to be noted that, in the case that no potentially singular ij pairs ia are detected (as in most of the cases), this part of the algorithm organized within the group in the form of a parallel distributed is completely skipped. matrix. At this point, the actual computation of the non-separable As a summary, the parallel algorithm described in this (2)x section can be spit into two relevant parts: communication part of ERI as well as Lµ j(1) and Lbν(2) is performed by accu- and computation of the O(N 5) scaling intermediates. The first mulating the contributions to these quantities associated to P 2 each auxiliary element P. As done for the calculation of B , part has a cost that can be estimated to be O(o v Na/(NpBS)) ia that is derived by considering that the total number of mes- the parallelization is achieved by letting each group G work on nP nP 2 2 its preassigned range of P ∈ [Pstart, P ]. Again all operations sages exchanged by each process is O(o /(BS NG)) while the end time required for each event of communication (considered to associated to a given P are performed in parallel within the members of the groups. be proportional to the message sizes) is O(v NaBS/Nw). This implies that communication is eventually an O(N4) operation The series of required operations for each P are shown in details in the pseudocode reported in Figure 8. As a first step, whose effort scales with the number of processes Np = Nw NG. It has to be noted that, compared to the RI-MP2 energy algo- according to the RI-GPW scheme, the electrostatic potential P ⃗ ⃗ rithm, this operation is expected to be roughly two times more vH(r) associated to the single auxiliary basis function χP(r) P is evaluated and made available on the real space grid. The expensive since it involves not only the redistribution of βia, line 1 in the pseudocode 6, but also the similar operation for potential is thus integrated over the auxiliary basis function j x ⃗ Ξi and Λ , line 10. Concerning the computation of the derivatives χQ(r) for all Q and subsequently contracted with aP aP QP O(N5) intermediates, these are reported in the pseudocode 6: the relative elements of Γ giving the non-separable contri- (2)x at line 2, the generation of the (ia| jb)RI integrals; at lines 4 and bution to ERI from the 2-center ERI’s. (2) i j ΓP 7, update of P ; and at lines 6 and 9, update of Ξ and Λ . At this point, two steps of back-transformation of ia ab aP aP ΓP ΓP Again, a comparison with the energy RI-MP2 algorithm, for are performed obtaining both iν and µν. The previously P ⃗ which only the generation of the (ia| jb)RI is required, leads to calculated potential vH(r) is now integrated over the pair of pri- 5 ⃗ ⃗ P the conclusion that, for the actual implementation, the O(N ) mary basis functions φµ(r)φν(r) → Iµν and associated deriv- x ⃗ ⃗ P P part is expected to be roughly 3–4 times more expensive. atives φµ(r)φν(r) → Dµν. The integral derivatives Dµν are ΓP As an additional remark, all these tasks are accomplished as contracted with the fully back-transformed µν giving the first (2)x matrix multiplications, and thus the performance of highly non-separable contribution to ERI from the 3-center ERI’s, P ΓP optimized routines, such as DGEMM, can be exploited as while the plain integrals Iµν are multiplied with iν and accu- Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-19 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

(2)x FIG. 8. Pseudocode of the parallel algorithm for computing the mixed AO-MO Lagrangian L µi(1), Laν(2), and the non-separable contribution to ERI with the RI-GPW approach.

G P mulated into a local buffer Lµ j(1) of Lµ j(1). This update can Similarly to the case of the computation of Bia, the asymp- P 4 exploit the sparsity of Iµν making this step O(on) for each totically dominating steps of this procedure scales as O(N ). P. The contribution to Lbν(2) is calculated by first trans- As shown in the pseudocode of Figure 8, these are associated P P ΓP ΓPQ forming the first index of Iµν to the occupied MO Iiν and to the calculation of ia and , update of Lbν(2) and ΓP then performing the update with ia, again obtained as a ma- indices transformations AO ↔ MO. These operations display trix multiplication and accumulate on the relative local buffer a relatively small prefactor since they are performed as matrix G Lbν(2). In this case, the matrices are not sparse, and thus multiplications. On the other hand, the calculation of the the associated cost is O(ovn) for each P. Finally, the second integrals and their derivatives is linear scaling for each P since (2)x non-separable contribution to ERI from the 3-center ERI’s only pairs of overlapping Gaussians need to be considered, ΓP ⃗ and only a finite number of grid points within a spherical is computed by integrating the potential vH (r) associated to ΓP ⃗ ⃗ the µν µνφµ(r)φν(r) electrostatic density with the auxiliary region around the center of the primitive Gaussian func-  x ⃗ tions is required. This makes the overall effort in the inte- basis function derivative χP(r) only for the actual P. At the end of the loop over the auxiliary index P, each gral computation O(N 2). Nevertheless, this operation dis- G G plays a quite large prefactor and results in a large amount group stores the two buffers Lµ j(1) and Lbν(2) containing the contribution to Lµ j(1) and Lbν(2) associated to the P of the total time (30%-40%) even for relatively large nP nP systems. ∈ [Pstart, Pend]. In order to obtain Lµ j(1) and Lbν(2) in their final form, i.e., defined over all process, a redistribution step This part of the algorithm is specific to the computation is required, for which each process receives and accumulates of the RI-MP2 energy derivatives, meaning that it constitutes from all others the data associated with its new local portion an overhead that is not necessary in the case for which only of the two matrices. the energy is required. Even if the structure of the described

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-20 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

P (2) procedure is similar to that employed for the evaluation of Bia, virt-virt and occ-occ blocks of Ppq with the integrals generated in this case not only the integrals are computed but also their by coupled-perturbed Hartree-Fock (CPHF) theory (the Apqr s derivatives. This implies a cost for integral computation that is matrix). This contraction is often referred as CPHF-like update roughly double than that associated to the relative operation in and has a computational cost that is similar to the update of the P the calculation of Bia. Fock matrix in the standard SCF procedure. Equations (36) and In the case the stress tensor has to be computed, additional (37) display the operations that have to be performed for each operations have to be considered. These operations are not CPHF-like update, note that in the actual case, the summation reported in Figure 8, but they can be derived by inspection ranges are over virt-virt and occ-occ orbitals, respectively, for (2) (2) of Eqs. (50) and (51). Note that in this case also elements the contraction with Pab and Pi j . of the type (RI β − rβ)∇Iαφµ(⃗r)φν(⃗r) have to be integrated, Once Lb j is assembled the Z-vector equations are solved resulting in an additional overhead roughly equivalent to the employing the Pople method.63 From a computational stand- computation of the integral derivatives. point, this is equivalent to solving a large system of linear equations with an iterative technique, for which, at each iter- ation, only the matrix-vector product (CPHF-like update) has (2)x to be performed. The parallelization of the CPHF-like update 4. Final evaluation of ERI closely follows the scheme employed in CP2K for evaluation The evaluation of the RI-MP2 energy derivatives is com- of the Fock matrix elements84,90 and will not be described (2) pleted by a series of operations that allow to generate the Ppq further here. (2) (2) (2) and Wpq in their final form. Once Ppq and Wpq are made Note that, for dense systems with large basis, the compu- available, their contraction with the skeleton derivatives of the tation of the 4-index ERI’s over AO, necessary to calculate Fock and overlap matrix elements is performed at the same the exchange part of the CPHF-like update, is by far the most time with the evaluation of the derivatives of the HF energy demanding task of this procedure. Since these integrals are leading to the final result. the same as those employed in the SCF procedure, if enough The sequence of these operations is summarized in the memory is available, they can be stored in core, avoiding their (2) pseudocode of Figure 9. The virtual-virtual block of Ppq is recomputation. This can greatly speed up the solution of the already available from the procedure described in Subsection Z-vector equations. 2 of the Appendix, while the occupied-occupied part has to be The RI-MP2 correction to the energy-weighted density (2) (2) completed for the non-singular elements according to Eq. (34). matrix Wpq is finally generated from Ppq, Lµ j(1), and Lbν(2) The occupied-virtual block is instead obtained as the solution according to Eqs. (38)–(47), for which an additional CPHF- of the Z-vector Eq. (27). like update is required for the occupied-occupied block. With In order to do so, first the RI-MP2 specific Lagrangian these matrices defined, previous a step of back-transformation Lb j has to be assembled. As shown in Eq. (29), four terms from the AO to the MO basis, the derivatives of the total contribute to Lb j. The first two are calculated from Lµ j(1) and energy (RI-MP2 + HF) can be finalized by contraction with the (x) (x) Lbν(2) just by transforming the indices from the AO to the MO skeleton derivatives of the Fock Fµν and overlap Sµν matrix basis. The remaining two are computed by contraction of the elements.

(2) (2) (2)x FIG. 9. Pseudocode of the parallel algorithm for computing Ppq, Wpq, and the final contributions to ERI .

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-21 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

5. Memory usage loss of computational efficiency. Which of the two strategies is to be preferred is not obvious since it depends on many aspects The RI-MP2 method, both for energy and derivatives, such as the machine architecture and the implementation of the displays a memory requirement that grows cubically with the parallel libraries. As a rule of thumb, using more processes per system size. This is related to the storage of the BP , Y P, ia ia group in general lead to better workload distribution within the and ΓP quantities, while all the other intermediates require ia group, while more threads per MPI task give better memory at most an O(N 2) memory. An important feature of a parallel management. algorithm is that not only the computation but also the required The computation of the exchange contribution for each storage space per task is reduced by increasing the number of CPHF-like update requires the contraction with 4-index ERI’s processes. over atomic orbitals (µν|λσ). These are usually calculated Reported in Table IV is the amount of memory that needs only at the first cycle of the procedure and then reused for to be allocated per MPI task for the storage of the most rele- the subsequent steps. For dense systems employing large basis vant intermediates. All the cubically demanding quantities are sets, that is, situations for which the integral screening is distributed over the total amount of processes (N ) or, for p not very effective, the storage of these integrals can exceed βP and X P , within the large number of members of each ia ia the amount of memory available per process. In these cases replication group. The algorithm is designed such that the only the largest possible number of ERI’s is stored, while the computation, at the process level, involves only the allocation remaining part is computed on the fly at each iteration. of quadratic intermediates. Nevertheless, these O(N2) quan- tities can still require a relatively large amount of memory, for 1C. Møller and M. S. Plesset, Phys. Rev. 46, 618 (1934). example, the generation of the (ia| jb)RI integrals implies the 2A. Szabo and N. S. Ostlund, Modern Quantum Chemistry (McGraw Hill, product of v × Na matrices that, even for medium size systems, New York, 1982). 3 would need hundreds of Mb. That is the reason why the group S. Hirata, X. He, M. R. Hermes, and S. Y. Willow, J. Phys. Chem. A 118, 655 (2014). has been introduced, in order to share these objects over more 4S. Grimme, J. Chem. Phys. 124, 034108 (2006). processes. In fact, as shown in Table IV, all the quadratically 5L. Goerigk and S. Grimme, J. Chem. Theory Comput. 7, 291 (2011). demanding quantities (except for the small P(2) matrix) require 6A. Halkier, T. Helgaker, P. Jørgensen, W. Klopper, H. Koch, J. Olsen, and i j 286 an amount of memory that is reduced increasing the group size A. K. Wilson, Chem. Phys. Lett. , 243 (1998). 7J. J. Shepherd, A. Grüneis, G. H. Booth, G. Kresse, and A. Alavi, Phys. Rev. Nw. Moreover, since the actual implementation is based on a B 86, 035111 (2012). hybrid OpenMP/MPI scheme, a similar gain can be achieved 8D. Cremer, Wiley Interdiscip. Rev.: Comput. Mol. Sci. 1, 509 (2011). by increasing the number of threads employed per MPI task. 9J. Almlöf, Chem. Phys. Lett. 181, 319 (1991). 10 96 This leads to more memory per MPI task without significantly M. Häser and J. Almlöf, J. Chem. Phys. , 489 (1992). 11M. Häser, Theor. Chim. Acta 87, 147 (1993). 12P. Y. Ayala and G. E. Scuseria, J. Chem. Phys. 110, 3660 (1999). 13D. S. Lambrecht, B. Doser, and C. Ochsenfeld, J. Chem. Phys. 123, 184102 TABLE IV. Memory usage in the different parts of the parallel algorithm (2005). 14 expressed as “order of” the calculation parameters. n and Na denote the B. Doser, D. S. Lambrecht, J. Kussmann, and C. Ochsenfeld, J. Chem. Phys. number of primary and auxiliary basis functions, o and v the number of 130, 064107 (2009). 15 occupied and virtual orbitals, S the grid size, NG and Nw the number of P. Y. Ayala, K. N. Kudin, and G. E. Scuseria, J. Chem. Phys. 115, 9698 groups and group size, NR and Nr the number and size of the replication (2001). 16 28 group (NG = NR Nr), BS the batch size for i j pairs, and Np the number of Y.Jung, Y.Shao, and M. Head-Gordon, J. Comput. Chem. , 1953 (2007). 17 44 processes. NG, Nw, and Np are the related by Np = NG Nw. The notation S. Saebø and P. Pulay, Annu. Rev. Phys. Chem. , 213 (1993). 18 69 employed for the entries is referred to the different algorithms. P. Pulay and S. Saebø, Theor. Chim. Acta , 357 (1986). 19G. Rauhut, P. Pulay, and H.-J. Werner, J. Comput. Chem. 19, 1241 (1998). 20 111 Memory M. Schütz, G. Hetzer, and H.-J. Werner, J. Chem. Phys. , 5691 (1999). 21G. Hetzer, M. Schütz, H. Stoll, and H.-J. Werner, J. Chem. Phys. 113, 9443 Evaluation of V −1/2 and BP (2000). PQ ia 22S. Saebø and P. Pulay, J. Chem. Phys. 115, 3975 (2001). (Subsection 1 of the Appendix) 23C. Pisani, M. Busso, G. Capecchi, S. Casassa, R. Dovesi, L. Maschio, C. ⃗ ⃗ ρ(⃗r), v(⃗r), ρ(G), v(G) S/Nw Zicovich-Wilson, and M. Schütz, J. Chem. Phys. 122, 094113 (2005). −1/2 − P 2 − 24 VQP Bia Na/Np ov Na/Np C. Pisani, L. Maschio, S. Casassa, M. Halo, M. Schütz, and D. Usvyat, J. O(N 5) scaling intermediates Comput. Chem. 29, 2113 (2008). 25 283 (Subsection 2 of the Appendix) P. Maslen, Chem. Phys. Lett. , 102 (1998). 26P. E. Maslen and M. Head-Gordon, J. Chem. Phys. 109, 7093 (1998). P P − P − βia, Xia Yia ov Na/(Nr Nw) ov Na/Np 27 137 j j S. Y. Willow, K. S. Kim, and S. Hirata, J. Chem. Phys. , 204122 (2012). i Ξi Λ 28 AaP, EaP, aP, aP v Na BS/Nw S. Y. Willow, M. R. Hermes, K. S. Kim, and S. Hirata, J. Chem. Theory (2) − (2) 2 − 2 9 Iab, tab, Pab Pi j v /Nw o Comput. , 4396 (2013). (2)x 29D. Neuhauser, E. Rabani, and R. Baer, J. Chem. Theory Comput. 9, 24 Non-separable E , L µ j(1) and Lbν(2) RI (2013). (Subsection 3 of the Appendix) 30Q. Ge, Y. Gao, R. Baer, E. Rabani, and D. Neuhauser, J. Phys. Chem. Lett. ⃗ ⃗ ρ(⃗r), v(⃗r), ρ(G), v(G) S/Nw 5, 185 (2014). ΓPQ ΓP 2 31 − ia Na/Np − ov Na/Np W. Klopper, F. R. Manby, S. Ten-No, and E. F. Valeev, Int. Rev. Phys. Chem. ΓP P − ΓP P G − G − − 25 µν, Iµν iν, Iiν, L µ j(1) Lbν(2) n/Nw on/Nw vn/Nw , 427 (2006). 32 112 L µ j(1)− Lbν(2) on/Np − vn/Np C. Hättig, W. Klopper, A. Köhn, and D. P. Tew, Chem. Rev. , 4 (2012). 33 (2)x A. Grüneis, J. J. Shepherd, A. Alavi, D. P. Tew, and G. H. Booth, J. Chem. Final evaluation of E RI Phys. 139, 084112 (2013). (Subsection 4 of the Appendix) 34K. Eichkorn, O. Treutler, H. Öhm, M. Häser, and R. Ahlrichs, Chem. Phys. (2) (2) (2) 2 Ppq, Wpq − Pia, L jb n /Np − ov/Np Lett. 240, 283 (1995). 35 CPHF-like update NAO-ERI/Np M. Feyereisen, G. Fitzgerald, and A. Komornicki, Chem. Phys. Lett. 208, 359 (1993).

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26 102803-22 Del Ben, Hutter, and VandeVondele J. Chem. Phys. 143, 102803 (2015)

36H.-J. Werner, F. R. Manby, and P. J. Knowles, J. Chem. Phys. 118, 8149 86P. E. Blöchl, J. Chem. Phys. 103, 7422 (1995). (2003). 87L. Genovese, T. Deutsch, A. Neelov, S. Goedecker, and G. Beylkin, J. 37A. F. Izmaylov and G. E. Scuseria, Phys. Chem. Chem. Phys. 10, 3421 Chem. Phys. 125, 074105 (2006). (2008). 88S. Goedecker, M. Teter, and J. Hutter, Phys. Rev. B 54, 1703 (1996). 38L. Maschio, D. Usvyat, F. R. Manby, S. Casassa, C. Pisani, and M. Schütz, 89M. Guidon, F. Schiffmann, J. Hutter, and J. VandeVondele, J. Chem. Phys. Phys. Rev. B 76, 075101 (2007). 128, 214104 (2008). 39D. Usvyat, L. Maschio, F. R. Manby, S. Casassa, M. Schütz, and C. Pisani, 90M. Guidon, J. Hutter, and J. VandeVondele, J. Chem. Theory Comput. 5, Phys. Rev. B 76, 075102 (2007). 3010 (2009). 40M. Katouda and S. Nagase, J. Chem. Phys. 133, 184103 (2010). 91J. Gerratt and I. M. Mills, J. Chem. Phys. 49, 1719 (1968). 41Y. Jung, R. C. Lochan, A. D. Dutoi, and M. Head-Gordon, J. Chem. Phys. 92N. C. Handy and H. F. Schaefer, J. Chem. Phys. 81, 5031 (1984). 121, 9793 (2004). 93T. J. Lee, S. C. Racine, J. E. Rice, and A. P. Rendell, Mol. Phys. 85, 561 42A. M. Burow, M. Sierka, and F. Mohamed, J. Chem. Phys. 131, 214101 (1995). (2009). 94Y. Osamura, Y. Yamaguchi, P. Saxe, D. Fox, M. Vincent, and H. Schaefer, 43F. Weigend, M. Häser, H. Patzelt, and R. Ahlrichs, Chem. Phys. Lett. 294, J. Mol. Struct.: THEOCHEM 103, 183 (1983). 143 (1998). 95V. Weber and C. Daul, Chem. Phys. Lett. 370, 99 (2003). 44D. E. Bernholdt and R. J. Harrison, J. Chem. Phys. 109, 1593 (1998). 96T. Abe, Y. Sekine, and F. Sato, Chem. Phys. Lett. 557, 176 (2013). 45F. Weigend, A. Köhn, and C. Hättig, J. Chem. Phys. 116, 3175 (2002). 97H. Weiss, R. Ahlrichs, and M. Häser, J. Chem. Phys. 99, 1262 (1993). 46A. C. Limaye and S. R. Gadre, J. Chem. Phys. 100, 1303 (1994). 98O. H. Nielsen and R. M. Martin, Phys. Rev. Lett. 50, 697 (1983). 47A. M. Marquez and M. Dupuis, J. Comput. Chem. 16, 395 (1995). 99O. H. Nielsen and R. M. Martin, Phys. Rev. B 32, 3780 (1985). 48I. M. B. Nielsen and E. T. Seidl, J. Comput. Chem. 16, 1301 (1995). 100D. Marx and J. Hutter, Ab Initio Molecular Dynamics: Theory and Imple- 49J. Baker and P. Pulay, J. Comput. Chem. 23, 1150 (2002). mentation, Modern Methods and Algorithms of Quantum Chemistry (John 50K. Ishimura, P. Pulay, and S. Nagase, J. Comput. Chem. 27, 407 (2006). Von Neumann Institute for Computing, Forschungszentrum Julich, 2000), 51D. E. Bernholdt and R. J. Harrison, Chem. Phys. Lett. 250, 477 (1996). pp. 329–477. 52M. Katouda and S. Nagase, Int. J. Quantum Chem. 109, 2121 (2009). 101J. Schmidt, J. VandeVondele, I.-F. W. Kuo, D. Sebastiani, J. I. Siepmann, J. 53B. Doser, D. S. Lambrecht, and C. Ochsenfeld, Phys. Chem. Chem. Phys. Hutter, and C. J. Mundy, J. Phys. Chem. B 113, 11959 (2009). 10, 3335 (2008). 102A. Dal Corso and R. Resta, Phys. Rev. B 50, 4327 (1994). 54E. F. Valeev and C. L. Janssen, J. Chem. Phys. 121, 1214 (2004). 103L. C. Balbás, J. L. Martins, and J. M. Soler, Phys. Rev. B 64, 165110 (2001). 55Y. Nakao and K. Hirao, J. Chem. Phys. 120, 6375 (2004). 104K. Doll, R. Dovesi, and R. Orlando, Theor. Chem. Acc. 112, 394 (2004). 56I. M. B. Nielsen and C. L. Janssen, J. Chem. Theory Comput. 3, 71 (2007). 105T. H. Dunning, J. Chem. Phys. 90, 1007 (1989). 57L. Maschio, J. Chem. Theory Comput. 7, 2818 (2011). 106D. E. Woon and T. H. Dunning, J. Chem. Phys. 98, 1358 (1993). 58M. Del Ben, J. Hutter, and J. VandeVondele, J. Chem. Theory Comput. 8, 107J. Paier, C. V. Diaconu, G. E. Scuseria, M. Guidon, J. VandeVondele, and 4177 (2012). J. Hutter, Phys. Rev. B 80, 174114 (2009). 59M. Del Ben, J. Hutter, and J. VandeVondele, J. Chem. Theory Comput. 9, 108J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev. Lett. 77, 3865 ( 2654 (2013). 1996). 60M. D. Ben, O. Schtt, T. Wentz, P. Messmer, J. Hutter, and J. VandeVondele, 109F. H. Allen, Acta Crystallogr., Sect. B: Struct. Sci. 58, 380 (2002). Comput. Phys. Commun. 187, 120 (2015). 110L. Maschio, D. Usvyat, M. Schütz, and B. Civalleri, J. Chem. Phys. 132, 61M. Del Ben, M. Schönherr, J. Hutter, and J. VandeVondele, J. Phys. Chem. 134706 (2010). Lett. 4, 3753 (2013). 111L. Maschio, B. Civalleri, P. Ugliengo, and A. Gavezzotti, J. Phys. Chem. A 62M. Del Ben, M. Schönherr, J. Hutter, and J. VandeVondele, J. Phys. Chem. 115, 11179 (2011). Lett. 5, 3066 (2014). 112A. W. Hewat and C. Riekel, Acta Crystallogr., Sect. A: Cryst. Phys., Diffr., 63J. A. Pople, R. Krishnan, H. B. Schlegel, and J. S. Binkley, Int. J. Quantum Theor. Gen. Crystallogr. 35, 569 (1979). Chem. 16(S13), 225 (1979). 113R. Boese, N. Niederprüm, D. Bläser, A. Maulitz, M. Y. Antipin, and P. R. 64E. Kraka, J. Gauss, and D. Cremer, J. Mol. Struct.: THEOCHEM 234, 95 Mallinson, J. Phys. Chem. B 101, 5794 (1997). (1991). 114L. L. Shipman, A. W. Burgess, and H. A. Scheraga, J. Phys. Chem. 80, 52 65T. Helgaker, P. Jørgensen, and N. Handy, Theor. Chim. Acta 76, 227 (1989). (1976). 66C. M. Aikens, S. P. Webb, R. L. Bell, G. D. Fletcher, M. W. Schmidt, and 115W. Keesom and J. Kohler, Physica 1, 655 (1934). M. S. Gordon, Theor. Chem. Acc. 110, 233 (2003). 116A. Curzon, Physica 59, 733 (1972). 67F. Weigend and M. Häser, Theor. Chem. Acc. 97, 331 (1997). 117A. Simon and K. Peters, Acta Crystallogr., Sect. B: Struct. Crystallogr. 68M. J. Frisch, M. Head-Gordon, and J. A. Pople, Chem. Phys. Lett. 166, 275 Cryst. Chem. 36, 2750 (1980). (1990). 118C. Müller and D. Usvyat, J. Chem. Theory Comput. 9, 5590 (2013). 69M. Head-Gordon, Mol. Phys. 96, 673 (1999). 119P. J. Bygrave, N. L. Allan, and F. R. Manby, J. Chem. Phys. 137, 164102 70R. A. Distasio, R. P. Steele, Y. M. Rhee, Y. Shao, and M. Head-Gordon, J. (2012). Comput. Chem. 28, 839 (2007). 120O. Sode, M. Keceli, K. Yagi, and S. Hirata, J. Chem. Phys. 138, 074501 71I. M. Nielsen, Chem. Phys. Lett. 255, 210 (1996). (2013). 72G. D. Fletcher, A. P. Rendell, and P. Sherwood, Mol. Phys. 91, 431 (1997). 121S. Wen and G. J. O. Beran, J. Chem. Theory Comput. 7, 3733 (2011). 73K. Ishimura, P. Pulay, and S. Nagase, J. Comput. Chem. 28, 2034 (2007). 122S. Grimme, J. Antony, S. Ehrlich, and H. Krieg, J. Chem. Phys. 132, 154104 74C. Hättig, A. Hellweg, and A. Köhn, Phys. Chem. Chem. Phys. 8, 1159 (2010). (2006). 123P. Jurecka, J. Sponer, J. Cerny, and P. Hobza, Phys. Chem. Chem. Phys. 8, 75G. Lippert, J. Hutter, and M. Parrinello, Theor. Chem. Acc. 103, 124 (1999). 1985 (2006). 76M. Krack and M. Parrinello, Phys. Chem. Chem. Phys. 2, 2105 (2000). 124T. Takatani, E. G. Hohenstein, M. Malagoli, M. S. Marshall, and C. D. 77S. Hirata and S. Iwata, J. Chem. Phys. 109, 4147 (1998). Sherrill, J. Chem. Phys. 132, 144104 (2010). 78The CP2K Developers Group, CP2K is freely available from: http://www. 125S. Tosoni, C. Tuma, J. Sauer, B. Civalleri, and P. Ugliengo, J. Chem. Phys. .org/, 2014. 127, 154102 (2007). 79M. Head-Gordon, J. A. Pople, and M. J. Frisch, Chem. Phys. Lett. 153, 503 126S. Hirata, J. Chem. Phys. 129, 204104 (2008). (1988). 127K. E. Riley and P. Hobza, J. Phys. Chem. A 111, 8257 (2007). 80J. L. Whitten, J. Chem. Phys. 58, 4496 (1973). 128M. Del Ben, J. VandeVondele, and B. Slater, J. Phys. Chem. Lett. 5, 4122 81B. I. Dunlap, J. W. D. Connolly, and J. R. Sabin, J. Chem. Phys. 71, 3396 (2014). (1979). 129F. Neese, T. Schwabe, and S. Grimme, J. Chem. Phys. 126, 124115 (2007). 82O. Vahtras, J. Almlöf, and M. Feyereisen, Chem. Phys. Lett. 213, 514 130R. C. Lochan, Y. Shao, and M. Head-Gordon, J. Chem. Theory Comput. 3, (1993). 988 (2007). 83G. Lippert, J. Hutter, and M. Parrinello, Mol. Phys. 92, 477 (1997). 131H. Ji, Y. Shao, W. A. Goddard, and Y. Jung, J. Chem. Theory Comput. 9, 84J. VandeVondele, M. Krack, F. Mohamed, M. Parrinello, T. Chassaing, and 1971 (2013). J. Hutter, Comput. Phys. Commun. 167, 103 (2005). 132A. M. Burow, J. E. Bates, F. Furche, and H. Eshuis, J. Chem. Theory 85G. J. Martyna and M. E. Tuckerman, J. Chem. Phys. 110, 2810 (1999). Comput. 10, 180 (2014).

Reuse of AIP Publishing content is subject to the terms: https://publishing.aip.org/authors/rights-and-permissions. Downloaded to IP: 130.60.47.22 On: Thu, 12 May 2016 10:40:26