Enabling Large-Scale Condensed-Phase Hybrid Density Functional Theory Based Ab Initio I: Theory, Algorithm, and Performance

Hsin-Yu Ko,1, 2 Junteng Jia,1 Biswajit Santra,2, 3 Xifan Wu,3 Roberto Car,2, 4 and Robert A. DiStasio Jr.1, ∗ 1Department of Chemistry and Chemical Biology, Cornell University, Ithaca, NY 14853, USA 2Department of Chemistry, Princeton University, Princeton, NJ 08544, USA 3Department of , Temple University, Philadelphia, PA 19122, USA 4Department of Physics, Princeton University, Princeton, NJ 08544, USA (Dated: May 10, 2021) By including a fraction of exact exchange (EXX), hybrid functionals reduce the self-interaction error in semi-local density functional theory (DFT), and thereby furnish a more accurate and reliable description of the underlying electronic structure in systems throughout biology, chemistry, physics, and materials science. However, the high computational cost associated with the evaluation of all required EXX quantities has limited the applicability of hybrid DFT in the treatment of large and complex condensed-phase materials. To overcome this limitation, we describe a linear-scaling approach that utilizes a local representation of the occupied orbitals (e.g., maximally localized Wannier functions (MLWFs)) to exploit the sparsity in the real-space evaluation of the quantum mechanical exchange interaction in finite-gap systems. In this work, we present a detailed description of the theoretical and algorithmic advances required to perform MLWF-based ab initio molecular dynamics (AIMD) simulations of large-scale condensed-phase systems of interest at the hybrid DFT level. We focus our theoretical discussion on the integration of this approach into the framework of Car-Parrinello AIMD, and highlight the central role played by the MLWF-product potential (i.e., the solution of Poisson’s equation for each corresponding MLWF-product density) in the evaluation of the EXX energy and wavefunction forces. We then provide a comprehensive description of the exx algorithm implemented in the open-source Quantum ESPRESSO program, which employs a hybrid MPI/OpenMP parallelization scheme to efficiently utilize the high-performance computing (HPC) resources available on current- and next-generation supercomputer architectures. This is followed by a critical assessment of the accuracy and parallel performance (e.g., strong and weak scaling) of this approach when performing AIMD simulations of liquid water in the canonical (NVT ) ensemble. With access to HPC resources, we demonstrate that exx enables hybrid DFT based AIMD simulations of condensed-phase systems containing 500−1000 atoms (e.g., (H2O)256) with a walltime cost that is comparable to semi-local DFT. In doing so, exx takes us one step closer to routinely performing AIMD simulations of complex and large-scale condensed-phase systems for sufficiently long timescales at the hybrid DFT level of theory.

I. INTRODUCTION Schrödinger equation.8 In this hierarchical classification of DFT, the first rung is given by the local (spin) density 4,9 LDA In view of its quite favorable balance of accuracy and approximation (LDA), in which the form of Exc is computational cost, Kohn-Sham (KS) density functional obtained from the solution to the homogeneous theory1–4 (DFT) has become the most widely used elec- gas. As such, LDA works particularly well for systems tronic structure method for ab initio molecular dynam- with a (nearly) uniform electron density (ρ(r)), e.g., the ics (AIMD) simulations of large molecules and complex valence in metallic solids. The next rung in- 5–7 cludes xc functionals based on the semi-local generalized condensed-phase materials. Within the framework of 10–12 KS-DFT, the total ground-state energy (E) is given as gradient approximation (GGA), which utilize the gra- the sum of the following contributions: dient of the electron density (∇ρ(r)) to correct the LDA description of systems with spatially varying ρ(r), e.g., E = Ekin + Eext + EH + Exc, (1) molecules and heterogeneous materials. At the current time, GGAs such as the non-empirical Perdew-Burke- in which Ekin is the KS kinetic energy, Eext is the external 12 arXiv:1911.10630v2 [physics.comp-ph] 7 May 2021 potential which accounts for the nuclear-electronic and Ernzerhof (PBE) xc functional are the computational nuclear-nuclear potential energies (as well as any exter- workhorses for AIMD simulations of condensed-phase sys- tems containing 100s–1000s of atoms. In this size regime, nal fields), EH is the Hartree energy, i.e., the average (classical) Coulomb interaction energy of the electrons, GGA-based approaches provide a favorable compromise between accuracy and computational cost, and have been and Exc is the electronic exchange-correlation (xc) energy. Explicit forms for all of the energy contributions in Eq. (1) quite successful in qualitatively (and sometimes even quan- titatively) describing a number of systems and processes are known except Exc, the approximation of which is still the subject of active research to date. of interest throughout chemistry, physics, and materials science. Functional approximations to Exc are often described as the rungs of “Jacob’s Ladder,” which connect the Hartree Despite such widespread success, GGA functionals are world to the exact solution of the time-independent unable to account for non-local electron correlation effects, 2 which are responsible for the ubiquitous class of disper- in which φi and φj represent the occupied KS orbitals sion (or van der Waals) interactions. As such, several and the sum extends over all No states. Defining the approaches have been devised to incorporate these long- orbital-product density as 13–16 range forces into the framework of DFT, and include ∗ 17–21 ρij(r) φ (r)φj(r), (4) effective pairwise models, methods that account for ≡ i many-body dispersion interactions,22–26 as well as non- i.e. local xc functionals.27–29 We note in passing that third- and the corresponding orbital-product potential ( , the Coulomb potential felt by a test charge located at r rung meta-GGA functionals, which incorporate second- 0 2 ρ (r ) derivative information via the Laplacian ( ρ(r)) or the originating from the ij charge distribution) as ∇ kinetic-energy density (τ(r)), are able to account for Z 0 30–36 0 ρij(r ) intermediate-range correlation effects. As such, these vij(r) dr , (5) ≡ r r0 approaches have experienced a resurgence with the recent | − | introduction of the SCAN functional,35 which has shown 37–39,41,42 allows one to express Eq. (3) in the following compact promising results for bulk water systems and form: interfacial water.40,43 X Z Another significant shortcoming associated with GGA Exx = dr ρij(r)vji(r). (6) − (as well as meta-GGA) functionals is their propensity ij to suffer from self-interaction error (SIE), an artifact in approximate xc functionals that manifests as a spurious Evaluation of vji(r) is therefore of central importance in interaction between an electron and itself.44,45 In the pres- EXX calculations. For periodic systems, this quantity ence of SIE, ρ(r) is too delocalized, which in turn often is usually computed through the convolution theorem leads to deleterious effects such as inadequate descriptions (shown here at the Γ-point only), of transition states and charge transfer complexes,46–48 49 fwdFFT underestimation of band gaps, overestimation of lat- ρji(r) ρji(G) tice parameters in a wide variety of solids,50 as well as −−−−→ 51–53 ρji(G) invFFT excessive proton delocalization in liquid water, to vji(G) = 4π vji(r), (7) G 2 −−−−→ name a few. While SIE can be largely eliminated by | | self-interaction correction (SIC) based methods,44,54–56 in which ρji(G) and vji(G) are the Fourier coefficients of the most commonly adopted approach for ameliorating ρji(r) and vji(r), respectively. We note in passing that the SIE present in semi-local KS-DFT is through the the divergence of vji(G) when G = 0 needs to be treated admixture of a fraction of exact exchange (EXX) in the with care when evaluating Exx using reciprocal-space underlying GGA (or meta-GGA) xc functional.57 These methods. In the real-space algorithm described herein, so-called hybrid (or hyper-GGA) xc functionals constitute we sidestep this divergence as vji(G = 0) is implicitly de- the fourth rung in the DFT hierarchy, and can be written termined by the boundary conditions imposed during the as (shown here as a correction to a GGA xc functional): solution of Poisson’s equation (i.e., vji(r ) = 0). The → ∞ hybrid GGA GGA computational scaling associated with both the forward E = axExx + (1 ax) E + E , (2) xc − x c (fwdFFT) and inverse (invFFT) fast Fourier transforms GGA is (NFFT log NFFT), where NFFT is the size of the recip- in which Exx is the EXX energy, Ex is the GGA O GGA rocal space (planewave) grid, which grows linearly with exchange energy, and Ec is the GGA correlation energy. system size. Since the evaluation of Exx in Eq. (6) re- The mixing parameter (ax) in this expression depends on 57–59 quires a sum over the contributions from all No(No + 1)/2 the hybrid xc functional approximation, the optimal unique pairs of occupied orbitals, the overall computa- value of which (for a given system) can be determined 2 60 tional scaling becomes (N NFFT log NFFT). Neglecting from a self-consistent GW calculation. By reducing the O o the logarithmic dependence, the resulting cubic-scaling SIE, hybrid xc functionals are typically more accurate cost makes this reciprocal-space EXX algorithm quite com- than GGA (or meta-GGA) approaches, in particular for 50 putationally demanding and limits routine performance the prediction of lattice parameters, reaction energy 46–48 61 of hybrid DFT based AIMD simulations on large-scale barriers, and band gaps. In this work, we limit our 58 condensed-phase systems. Hence, most condensed-phase focus to the non-empirical PBE0 hybrid xc functional, 1 12 calculations with hybrid DFT still remain limited to pre- in which ax = /4 and the PBE GGA functional is used GGA GGA dicting energetic and structural properties in the absence for E and E . Application of our approach (which x c of thermal effects. is described below) to other popular hybrid xc functionals 11,57 Significant progress has been made to accelerate such as B3LYP is straightforward. condensed-phase EXX calculations by employing the For a closed-shell system with No doubly occupied following theoretical and numerical techniques: range- orbitals (bands), Exx can be written as: separation62 or truncation63 of the underlying Coulomb Z Z ∗ ∗ 0 0 operator, implementation of massively parallel algo- X 0 φi (r)φj (r )φj(r)φi(r ) 64–67 Exx = dr dr , (3) rithms, employment of auxiliary atom-centered (lo- − r r0 68–70 ij | − | calized) basis sets, adaptive compression (low-rank 3 decomposition) of the EXX operator (ACE),71–73 use of MLWFs has already been demonstrated,109–111 making the projected commutator direct inversion of the iterative the computational cost associated with orbital localiza- subspace (PC-DIIS) method to reduce the number of self- tion negligible between AIMD steps. In light of this consistent field (SCF) iterations,74 utilization of sparsity computationally efficient orbital localization scheme, the through localization methods (e.g., maximally localized wide availability of MLWFs, and the promise of a ro- Wannier functions (MLWFs),75–78 recursive subspace bi- bust tool for on-the-fly analytics, we will now focus our section (RSB),79,80 selected columns of the density matrix discussion on the development and implementation of a (SCDM),81–83 and other localized representations84), as linear-scaling (order(N)) MLWF-based EXX algorithm well as combinations thereof.63,68,85–87 which can be used to perform large-scale condensed-phase AIMD simulations at the hybrid DFT level of theory. To enable large-scale hybrid DFT based AIMD simula- In this work, we will focus on Car-Parrinello molecular tions in the condensed phase, the most promising methods 112 dynamics (CPMD) simulations of sufficiently large and for reducing the intrinsic computational cost and scaling finite-gap condensed-phase systems such that the first via associated with EXX exploit sparsity localized repre- Brillouin zone can be accurately sampled at the Γ-point. sentations of the occupied space or density matrix. For 79,80 Extensions to Born-Oppenheimer molecular dynamics example, the RSB method of Gygi and coworkers uses (BOMD) and metallic systems113,114 are possible and will a non-iterative algebraic decomposition of the wavefunc- be discussed in future work. Working at the Γ-point allows tion coefficients, which provides a transformation from the ∗ us to consider real-valued orbitals only, i.e., φi(r) = φ (r), occupied KS eigenstates to a set of localized orbitals that i from which it follows that ρij(r) = ρji(r) and vij(r) = are contained within prescribed domains in real space. vji(r) in Eqs. (4)–(6). Without loss of generality, we This method has already enabled a number of AIMD sim- will also assume that the total wavefunction is closed e.g. ulations using hybrid xc functionals ( , computational shell (spin-unpolarized). Under these conditions, one can investigations into the density of ice at finite tempera- show that the set of MLWFs, which are obtained via an ture,88 ion solvation,89,90 as well as the structural and 51–53 orthogonal (unitary) transformation of the occupied KS vibrational properties of liquid water ) and is partic- eigenstates, i.e., ularly convenient for simulating heterogeneous systems such as solid-liquid interfaces91 due to the ease of selecting X φei(r) = Uijφj(r), (8) the prescribed localization domains. The SCDM method j by Damle, Lin, and Ying exploits the sparsity of the off- diagonal elements of the density matrix,81–83 and does not have a significantly smaller support (or compact domain) rely on an initial guess to iteratively localize the occupied than the entire simulation cell, and are in fact exponen- space. As such, this approach sidesteps issues related to tially localized in real space.75,115–119 These features of gauge invariance and can furnish more robust (i.e., non- the MLWF representation of the occupied space provide iterative) localized orbitals than other optimization-based a theoretical and computational framework for exploit- schemes.92 The MLWF formalism introduced by Marzari ing the natural sparsity in the real-space evaluation of and Vanderbilt75 uses an iterative scheme to obtain a lo- the EXX energy (and wavefunction forces) that we will calized representation of the occupied KS orbitals by min- explore in this work. imizing the total spread functional (e.g., the sum of the To demonstrate that the use of MLWFs leads to a linear- spreads of the individual localized orbitals) and therefore scaling EXX approach, consider the expression for Exx extends the well-known Boys orbital localization scheme93 in Eq. (6). Since this quantity is invariant to orthogonal used in into the condensed phase. ML- transformations of the occupied orbitals (see Sec. II C), WFs have shown great promise as both qualitative and evaluation of Exx can be performed exactly within the quantitative analysis tools due to their similarity to the MLWF representation. The first level of computational orbitals encountered in molecular orbital (MO) theory savings originates from the fact that a given MLWF only (i.e., bonding and lone pairs) and the fact that they allow appreciably overlaps with a subset of neighboring ML- one to obtain molecular multipole moments,94–96 parti- WFs. This makes the number of non-vanishing EXX pair tion the charge density97 and/or electrostatic potential,98 interactions per orbital independent of the system size, and even compute non-bonded dispersion interactions99 and thereby reduces the total number of orbital pairs in complex condensed-phase environments. Numerous al- required in the summation over i and j in Eq. (6). In gorithms (such as wannier90100) for obtaining MLWFs addition, one can further exploit the fact that a numeri- have been incorporated into a number of existing com- cally exact evaluation of Exx only requires that the spatial munity codes such as Quantum ESPRESSO (QE),101,102 integral in Eq. (6) be performed on the support of the SIESTA,103 ABINIT,104 NWChem,105 GPAW,106 CP2K,107 orbital-product density. Since this quantity is sparse in and VASP,108 which makes this localization scheme read- the MLWF representation, this integration can be re- ily available and quite practical for a posteriori analyses stricted to a real-space domain that is also independent of of DFT-based calculations and AIMD simulations. Fur- the system size. Taken together, these observations can thermore, the MLWF localization scheme is particularly be leveraged to construct a computationally efficient and suitable for large-scale hybrid DFT based AIMD sim- linear-scaling MLWF-based algorithm for computing Exx. ulations since a Car-Parrinello-like propagation of the For a more detailed description of the theoretical under- 4 pinnings of this approach and the associated algorithmic and parallel performance (e.g., strong and weak scaling) implementation, see Secs. II C and III, respectively. of our implementation when performing AIMD simula- The initial concept and several pilot algorithms for this tions of liquid water in the NVT ensemble on multiple MLWF-based EXX approach76,78 have already been suc- different HPC architectures. In doing so, we demon- cessfully used to enable a number of large-scale hybrid strate that exx enables hybrid DFT based AIMD simula- DFT based applications, e.g., computational investiga- tions of (H2O)256—a condensed-phase system containing tions into the electronic structure of semi-conducting > 750 atoms—with a walltime cost that is comparable solids,120,121 the structural properties of ambient liquid to semi-local DFT and minimal errors in the EXX con- water,78,122,123 the structure and dynamics of aqueous tribution to the total energy, wavefunction forces, ionic ionic solutions,124,125 as well as the thermal properties of forces, and binding energetics. As such, the work de- the pyridine-I molecular .126 In this manuscript, scribed herein will further enable us to utilize the fourth we build upon our earlier work by presenting a detailed rung of DFT in the study of the structure, properties, description of the theoretical and algorithmic advances and dynamics of a number of important condensed-phase that are required to perform accurate and efficient MLWF- systems, as well as perform hybrid DFT based AIMD based AIMD simulations of large-scale condensed-phase simulations across extended length and time scales which systems at the hybrid DFT level. We focus our theoretical have been prohibitively difficult to access to date. discussion on the integration of this approach into the Although the current version of exx is restricted to CPMD framework by providing a detailed derivation of condensed-phase systems in fixed orthorhombic simula- the EXX contributions to the equations of motion under- tion cells, an extension of this approach that treats gen- lying fixed-cell CPMD simulations in the microcanonical eral Bravais lattices and allows for hybrid DFT based (NVE) and canonical (NVT ) ensembles. In particular, AIMD simulations in the isobaric-isoenthalpic (NpH) and we include an in-depth discussion of a dual-level strategy isobaric-isothermal (NpT ) ensembles will be discussed in which describes how the use of localized orbitals (like the next paper in this series. Since exx is quite mod- MLWFs) can lead to a linear-scaling EXX algorithm by ular, this algorithm can also be incorporated into any exploiting the underlying sparsity in the real-space eval- planewave-based DFT code; when combined with linear- 127,128 129 uation of the exchange interaction. Influenced by the scaling GGA codes such as PARSEC, BigDFT, 130 131 work of Gygi and co-workers,79,80,91 we also introduce the ONETEP, or CONQUEST, exx could also be leveraged concepts of MLWF-orbital and MLWF-product domains, to achieve a fully (overall) linear-scaling hybrid DFT ap- which can be used to design an algorithmic framework proach. We note in passing that the MLWF-based EXX that has the potential to enable accurate and efficient approach described herein also sets the stage for per- hybrid DFT simulations of condensed-phase systems with forming large-scale condensed-phase AIMD simulations widely varying MLWF spreads. based on quantum chemical (i.e., wavefunction theory) In addition to this theoretical discussion, we also pro- methodologies. Since a majority of the theoretical and al- vide a comprehensive description of a massively paral- gorithmic developments presented in this work are directly lel algorithm that extends well beyond our earlier pi- applicable to the iterative solution of the Hartree-Fock lot algorithms and uses this MLWF-based approach to (HF) equations, this approach can be extended to enable compute all of the EXX contributions needed during a hierarchy of post-HF local electron correlation meth- ods. Additional directions also include range-separated hybrid DFT simulations of large-scale condensed-phase 132–136 NVE NVT hybrids (RSH) as well as fifth-rung xc functionals systems in the and ensembles. Recently im- 137,138 plemented in the - and planewave-based (e.g., MLWF-based GW approaches ). open-source QE package,102 the so-called exx module The remainder of the paper is organized as follows. In exploits this dual-level linear-scaling strategy and em- Sec. II, we describe the theoretical framework for perform- ploys a hybrid message-passing interface (MPI) and open ing CPMD simulations at the hybrid DFT level of theory multi-processing (OpenMP) parallelization scheme to ef- within the MLWF representation. Sec. III contains details ficiently utilize high-performance computing (HPC) re- of our massively parallel algorithmic implementation in sources. Compared with earlier pilot versions, exx signif- the open-source QE package. This is followed by a detailed icantly improves the applicability of our MLWF-based ap- systematic analysis of the accuracy and computational proach to large-scale AIMD (i.e., the strong-scaling limit) performance of the current implementation in Sec. IV. by introducing a hybrid parallelization scheme (which al- The paper is then completed in Sec. V, which provides lows users to exploit both internode and intranode compu- some brief conclusions as well as the future outlook of tational resources), a completely revised algorithm (which AIMD simulations using hybrid DFT. balances computation and communication, and reduces the overall memory footprint), as well as a more flexible and general-purpose implementation (which accommo- II. THEORY dates a wide range of users, including those with limited computational resources as well as those working at the In this section, we describe the theory behind our real- massively parallel HPC limit). space MLWF-based framework for performing large-scale This is followed by a critical assessment of the accuracy AIMD simulations of finite-gap condensed-phase systems 5 at the hybrid DFT level of theory. We will focus the 2. EXX Contribution to the Wavefunction Forces discussion below on the equations of motion underly- NVE NVT ing fixed-cell CPMD simulations in the and In KS-DFT, Exc is a functional of the electron density, ensembles. Extension to constant-pressure CPMD simu- P ∗ which is given by ρ(r) = 2 i φi (r)φi(r). As such, one lations will be discussed in a forthcoming paper in this can write the Exc contribution to the (negative of the) series. Although we limit our scope here to CPMD, which wavefunction force for the i-th KS orbital as the action provides a computationally efficient localized orbital prop- of the so-called xc potential, vxc(r) (δExc/δρ(r)), on 109–111 ≡ agation scheme, a cost-effective and competitive the orbital itself, i.e., extension to BOMD has been achieved by our group and      will also be addressed in another paper in this series. δExc δExc δρ(r) ∗ = ∗ = 2vxc(r)φi(r). (11) δφi (r) δρ(r) δφi (r)

Since the explicit functional dependence of Exx (in A. Index Conventions hybrid Exc ) on ρ(r) is unknown, one needs special proce- dures such as the optimized effective potential (OEP) We will utilize the following conventions for the various method139 to derive the EXX contribution to the wave- indices encountered in this work: function forces within a strict KS-DFT scheme. In this work, we adopt a generalized KS-DFT scheme (i.e., by • i, j, k: indices for the No occupied orbitals (or allowing for an orbital-dependent vxc(r)), which requires MLWFs) significantly less computational effort and yields the same ground-state energies as the OEP formalism. In this ap- • a, b, c: indices corresponding to the Cartesian di- proach (which is currently the standard practice in the rections x, y, and z field), we compute the corresponding orbital-dependent i ∗ EXX wavefunction forces, D (r) = (∂Exx/∂φ (r)), by xx − i taking the functional derivative of Exx in Eq. (6) with • I, J, K: indices for the NA ions ∗ respect to φi (r), yielding:

• q: index for points on the real-space grid i X X ij D (r) = vij(r)φj(r) D (r). (12) xx ≡ xx j j • l, m: indices for spherical harmonics To derive this expression, we have used Eqs. (4) and (5) for the orbital-product density and potential, ρij(r) and v (r) Dij (r) v (r) φ (r) B. EXX-Based CPMD in the NVE Ensemble ij , and defined xx as the action of ij on j . From Eq. (12), it is again clear that the evaluation of the orbital-product potential, vij(r), is of central importance 1. Equations of Motion i to the calculation of Dxx(r).

In CPMD simulations, fictitious dynamics are intro- C. Real-Space EXX Calculations: Linear Scaling duced on the No occupied KS orbitals φi (r) via artifi- { } cial (fictitious) masses µ. Hence, CPMD simulations in via Orbital Localization the NVE ensemble are governed by the following equa- tions of motion for the electronic and ionic degrees of The efficient evaluation of vij(r)—which is a required 6 i freedom: ingredient for computing Exx and Dxx(r)—is key to en- abling large-scale condensed-phase AIMD simulations at   ¨ δE X the hybrid DFT level of theory. In this section, we will µφi(r) = + Λijφj(r) (9) − δφ∗(r) describe a linear-scaling EXX method that exploits the i j natural sparsity of the quantum mechanical exchange in- ¨ MI RI = ( RI E) , (10) teraction in real space via the use of a localized (MLWF) − ∇ representation of the occupied orbitals. Within this frame- in which Newton’s dot notation is used to indicate time work, veij(r) (which is the MLWF analog of vij(r) in derivatives, E is the total ground-state DFT energy in Eq. (5)) only needs to be computed for overlapping pairs Eq. (1), (δE/δφ∗(r)) is the force acting on the i-th of MLWFs on a real-space domain that is independent of − i occupied KS wavefunction, Λij is a Lagrange multiplier the system size, thereby paving the way to a linear-scaling enforcing orthonormality in φi(r) , and R E is the EXX method in the condensed phase (see Sec. III for algo- { } −∇ I force acting on the I-th ion (which is located at RI with rithmic details). As such, the cornerstone of our method mass MI ). At the hybrid DFT level, the equations of is the efficient real-space evaluation of veij(r), which is motion in Eqs. (9) and (10) will only depend on Exx accomplished herein via the solution of Poisson’s equation via the wavefunction forces, (δE/δφ∗(r)), which are on a system-size independent real-space domain for each − i discussed in detail below. overlapping MLWF pair. 6

For a finite-gap condensed-phase system, the occupied employed representation. It is this freedom in the choice KS orbitals (or bands) can be mapped via an orthogonal of the orthogonal transformation that allows one to se- transformation onto a unique set of MLWFs (see Eq. (8)), lect an appropriate localized orbital representation (e.g., that are exponentially localized in real space75,115–119 MLWF) for exploiting the underlying sparsity in the EXX and have a significantly smaller support than the entire interaction. Throughout this work, we will dress each of simulation cell. As such, the MLWF representation of the MLWF-specific quantities with a tilde to distinguish the occupied space allows one to exploit the underlying them from their analogous expressions in the canonical sparsity in the quantum mechanical exchange interaction, KS representation. and provides a theoretical and computational framework Given the expression for Exx in the MLWF representa- for substantially reducing the computational scaling and tion (cf. Eqs. (14)–(15)), the corresponding EXX contri- cost associated with EXX-based approaches. butions to the wavefunction forces that are required to To see how MLWFs can be leveraged to attain a linear- propagate the CPMD equations of motion (Eqs. (9) and scaling EXX algorithm, we first transform the canonical (10)) can be derived following the same procedure given Exx expression in Eq. (3) into the MLWF representation. above in Sec. II B 2. In this regard, the wavefunction force P −1 i ∗ Since φj(r) = (U )jiφei(r) (cf. Eq. (8)), Exx can be on the i-th MLWF, Dexx(r) = (δExx/δφe (r)), can be i − i written as follows: obtained from Eqs. (14)–(17), yielding: Z Z 0 0 i X X ij X X 0 φek(r)φek0 (r )φek00 (r)φek000 (r ) Dexx(r) = vij(r)φej(r) Dexx(r), (18) Exx = dr dr e ≡ − r r0 j j ij kk0 00 000 | − | k k ij −1 −1 −1 −1 where Dexx(r) has been defined as the action of veij(r) (U )ik(U )jk0 (U )jk00 (U )ik000 . (13) i ij × on φej(r). Here, Dexx(r) and Dexx(r) also depend on the Utilizing the fact that UU T = UU −1 = I for an or- MLWF representation and therefore take on different val- thogonal matrix, summation over i and j in this expres- ues when compared to their KS analogs in Eq. (12). From P −1 −1 −1 −1 Eqs. (15) and (18), it is again clear that the evaluation of sion leads to ij(U )ik(U )jk0 (U )jk00 (U )ik000 = P −1 P −1 the MLWF-product potential, vij(r), is the cornerstone U (U ) 000 U 0 (U ) 00 = δ 000 δ 0 00 e i ki ik j k j jk kk k k , from of our MLWF-based EXX approach. which we see that With the expressions required for the evaluation of Exx Z Z 0 0 and De i (r) in hand, we will now discuss in detail how X 0 φei(r)φej(r )φej(r)φei(r ) xx Exx = dr dr (14) − r r0 MLWFs lead to a linear-scaling EXX algorithm by exploit- ij | − | ing the underlying sparsity in the exchange interaction. Since the set of MLWFs are exponentially localized in real upon dummy variable substitutions of k i and k0 j. → → space and therefore have a significantly smaller support This proof demonstrates that the expression for evalu- than the entire simulation cell, this allows us to exploit ating Exx is invariant to the orthogonal transformation two levels of sparsity during the computational evaluation between the KS and MLWF representations. In fact, this of all required EXX-related quantities. The first level invariance property of Exx also holds for any arbitrary of computational savings originates from the fact that a orbital representation ψi(r) that is derived from an 0 { } given MLWF, φei(r), will only appreciably overlap with a orthogonal rotation U within the occupied KS subspace P 0 number, nei, of neighboring MLWFs. For all other MLWFs, (i.e., ψi(r) = j Uijφj(r)). In analogy to Eq. (6), the the product density, ρeij(r) = φei(r)φej(r) (and hence the MLWF expression for Exx in Eq. (14) can also be written corresponding product potential, vij(r)), will be vanish- in the following compact form: e ingly small. In these cases, the contributions to Exx and i X Z Dexx(r) are numerically zero, and this directly reduces Exx = dr ρij(r)vij(r), (15) − e e the number of terms that are required in the summation ij over j in Eqs. (15) and (18). As such, the number of EXX pair interactions per orbital becomes independent in terms of the MLWF-product density, of system size (assuming a fixed system density), which reduces the total number of orbital pairs, Npair, from ρij(r) φei(r)φej(r) = ρji(r), (16) 2 e ≡ e (No ) to (No), i.e., Npair = No(No + 1)/2 nNo. In O O → e this last expression, n = maxi ni < No is independent and the corresponding MLWF-product potential, e {e } of the system size, hence nNe o represents an upper bound Z 0 to the number of EXX pair interactions in our approach. 0 ρeij(r ) vij(r) dr = vji(r). (17) e ≡ r r0 e Since the contributions from the omitted MLWF pairs to i | − | Exx and Dexx(r) are vanishingly small, this reduction in We note in passing that while Exx is invariant to any or- Npair still allows for a numerically exact evaluation of all thogonal transformation, the values of ρeij(r) and veij(r)— EXX-related quantities. Although this leads to significant despite the fact that they have the same expression as computational savings, the overall scaling associated with that given in Eqs. (4) and (5)—do in fact depend on the evaluating these quantities is still formally quadratic as 7 the real-space domain associated with the simulation cell, hence, the center of this charge distribution cannot be R R Ω, grows linearly with the size of the system. analogously defined as dr rρeij(r) / dr ρeij(r). As such, we utilize an analog of the standard gauge in molecular quantum mechanics for an electrically neutral system, wherein the “center of charge” is taken as the position at which the nuclear (ionic) dipole moment vanishes. This R R allows us to define Ceij = dr r ρij(r) / dr ρij(r) as |e | |e | the corresponding center for ρeij(r). By making all sectors of this charge distribution positive, ρij(r) now has a siz- |e | able monopole and a well-defined center of charge given by Ceij. By construction, this choice of gauge recovers the correct center of charge when i = j, i.e., Ceii = Cei, and is therefore consistent with the expression used above for ρeii(r). Within this framework, both Ωi and Ωij are system- size independent and substantially smaller than Ω. Fur- FIG. 1. Graphical depiction of an overlapping pair of expo- thermore, since Ωij is defined as the overlapping region nentially localized MLWFs, φei(r) (solid) and φej (r) (dashed). between two exponentially decaying MLWFs, φei(r) and As such, the supports for φei(r) and φej (r) are compact and φej(r), the extent of this domain is smaller than both Ωi labelled by Ωi and Ωj , respectively. The support of the corre- and Ωj, which holds true even when i = j. From Eqs. (15) sponding MLWF-product density, ρeij (r) = φei(r)φej (r) (green), and (16), one sees that a numerically exact evaluation of is also compact and labelled by Ω = Ω ∩ Ω . As described ij i j Exx (neglecting self-consistency effects, vide infra) only in the text, a numerically exact evaluation of the EXX contri- requires summation over overlapping ij pairs (denoted by bution to the energy (Exx) requires the solution to Poisson’s ij ) and spatial integration over Ωij, i.e., equation for the near-field potential (veij (r)) on the Ωij domain h i (see Eqs. (19) and (21)), while a numerically exact evaluation X Z ij Exx = dr ρij(r)vij(r). (19) of the EXX contribution to the wave function forces (Dexx(r) − e e ji hiji Ωij and Dexx(r)) requires a converged multipole expansion for the far-field potential (veij (r)) on the Ωj \Ωij and Ωi \Ωij domains (see Eqs. (20) and (22)). In the same breath, Eq. (18) shows that a numerically ij exact evaluation of Dexx(r) only requires the action of To achieve linear scaling with system size, one can fur- veij(r) over Ωj, i.e., ther exploit the fact that the set of exponentially localized ij De (r) = vij(r)φej(r) r Ωj. (20) MLWFs have a substantially smaller support than Ω. This xx e ∈ allows us to employ real-space domains that are indepen- This implies that one only needs veij(r) on Ωij for a dent of system size and still maintain a numerically exact E v ( ) Ω i numerically exact evaluation of xx, and eij r on j for evaluation of Exx and De (r). To harness this second xx a numerically exact evaluation of De ij (r). As such, the level of computational savings, we follow the work of Gygi xx evaluation of vij(r) can also be restricted to system-size and co-workers79,80,91 by defining an MLWF-orbital do- e independent real-space domains, despite the fact that this main as Ωi = r Ω | φei(r) >  , in which  is a small { ∈ | | } quantity is formally non-zero across Ω, and asymptotically (but finite) positive threshold. For small enough values goes as 1/r for i = j (due to the non-vanishing monopole of , Ωi will encompass the support of φei(r) in real space 2 associated with ρeii(r)) and 1/r (or higher order) for (see Fig. 1). This domain is focused around the so-called i = j (due to the vanishing monopole associated with 6 MLWF center, Cei, which is given by the expectation value ρij(r)). This leads to even further computational savings R e (or first moment) of r, i.e., Cei = φei r φei = dr rρii(r). as vij(r) can be obtained exactly by solving Poisson’s h | | i e e In analogy, we also define an MLWF-product domain equation (PE) over Ωij in the near field, Ωij Ωi Ωj  as , which (for sufficiently small values) 2 ≡ ∩ ρ ( ) Ω vij(r) = 4πρij(r) r Ωij, (21) encompasses the support of eij r (see Fig. 1). Since ij ∇ e − e ∈ φ ( ) corresponds to the points in real space where ei r and subject to Dirichlet boundary conditions given by an φej(r) are both non-negligible, this domain is even more appropriately converged multipole expansion (ME) of Ω Ω i = j sparse than i or j. When , one can straight- ρeij(r) in the far field, i.e., forwardly compute the corresponding center of charge R R for ρii(r) as Ceii = dr rρii(r) / dr ρii(r). Since ρii(r) X Qlm Ylm(θ, ϕ) e e e e vij(r) = 4π r / Ωij. (22) e (2l + 1) rl+1 ∈ integrates to unity, Ceii = Cei, which is simply the center lm of the i-th MLWF given above. When i = j, ρij(r) now 6 e corresponds to a localized charge distribution with a van- In this expression, Ceij is taken as the origin, r = (r, θ, ϕ) ishing monopole due to the orthogonality of the MLWFs; is given in spherical polar coordinates, Ylm(θ, ϕ) are the 8 spherical harmonics, and is accomplished using either conjugate gradient (CG) or Z second-order damped dynamics (SODD) to minimize the Q = d Y ∗ (θ, ϕ)rlρ ( ), fictitious kinetic energy associated with the electronic lm r lm eij r (23) 140 Ωij degrees of freedom (while keeping the ions fixed). Dur- ing the SODD minimization, the proto-KS orbitals are are the multipole moments corresponding to ρeij(r). For evolved according to the following equations of motion typical systems, a ME with a maximum value of l = 6 is (which are equivalent to Eq. (9) with an additional damp- 76,78 sufficiently converged. We note in passing that the ing term): ME in Eq. (22) serves a dual purpose and will also be ij µϕ¨ ( ) = D ( ) 2µγϕ˙ ( ), employed during the evaluation of Dexx(r), which requires i r i r i r (24) ij − vij(r) on Ωj. In other words, Dexx(r) is computed with e in which ϕi(r) are the proto-KS orbitals during the veij(r) on Ωij via the solution to the PE in Eq. (21), and { } ∗ P SCF calculation, Di(r) (δE/δϕi (r)) + j Λijϕj(r) vij(r) on the Ωj Ωij domain, i.e., for all points in Ωj ≡ − e \ is the force acting on the i-th orbital, and γ is a damping that are not contained in Ωij, via the ME in Eq. (22). parameter. To evolve the proto-KS orbitals, Eq. (24) can This discussion again clearly highlights that an effi- be integrated to yield:140 cient real-space evaluation of veij(r)—on compact and system-size independent domains—is the cornerstone of 2 1 Γ ϕi(r, τ + ∆τ) = ϕi(r, τ) − ϕi(r, τ ∆τ) our linear-scaling MLWF-based EXX approach. In the 1 + Γ − 1 + Γ − next section, we will focus our discussion on the algorith- 2 ∆τ Di(r, τ) mic implementation of this approach, which can be used + , (25) 1 + Γ µ to perform large-scale condensed-phase AIMD simulations with hybrid DFT. in which ∆τ is the time step for the fictitious proto-KS dy- namics and Γ γ∆τ.141 Upon convergence of the SODD ≡ procedure, ϕi(r, τ) becomes a set of ground-state KS III. IMPLEMENTATION AND ALGORITHMIC { } DETAILS orbitals, which is chosen as the initial condition for the AIMD simulation (i.e., φi(r, t = 0) ). In doing so, cubic- { } scaling matrix operations such as diagonalization of the In this section, we describe the implementation of Fock (or effective Hamiltonian) matrix are completely our linear-scaling MLWF-based EXX algorithm in the 101,102 sidestepped during the SCF procedure; as such, this ap- CP module of QE. This algorithm has been im- proach does not require (nor produce) unoccupied/virtual plemented as a standalone module named exx, which states, and provides a solid foundation upon which one has been integrated with the MLWF-enabled semi-local can build a fully linear-scaling DFT (or HF) code base. DFT routines in QE via a portable input/output inter- In fact, this CP-like approach to the SCF solution face (see flowchart in Fig. 2). During each CPMD step, of the KS equations can be combined with the MLWF the main input required for exx includes the current set localization procedure by performing a nested SODD of MLWFs, φei(r) , while the output produced by this optimization of the Marzari-Vanderbilt75,77 functional { } i module includes Exx and De (r) . As such, adaptation to incrementally localize the proto-KS orbitals between { xx } of the exx module to other periodic DFT codes should each SCF step.109 In CP, this is accomplished by splitting be straightforward, as long as the capability to produce Eq. (25) into an extrapolation step, MLWFs “on-the-fly” during CPMD simulations is avail- able (vide infra). In fact, the current exx module only 2 1 Γ χj(r, τ + ∆τ) = ϕj(r, τ) − ϕj(r, τ ∆τ) requires that the input orbitals are sufficiently local and 1 + Γ e − 1 + Γ e − 2 form an orthonormal set, and can therefore accommodate ∆τ Dej(r, τ) + , (26) (with appropriate modifications) other orbital localiza- 1 + Γ µ tion schemes such as RSB79,80 and SCDM.81–83 To enable large-scale EXX-based AIMD using this approach, we followed by a localization step, employ a hybrid MPI and OpenMP parallelization scheme X that allows us to differentially exploit both internode and ϕei(r, τ + ∆τ) = Uij(τ + ∆τ)χj(r, τ + ∆τ). (27) intranode computational resources provided by massively j parallel supercomputer architectures. In the extrapolation step, an intermediary set of or- bitals, χj(r, τ + ∆τ) , is formed via SODD evolu- { } A. MLWF-based EXX-CPMD: Prerequisites tion of the proto-MLWF orbitals, ϕej(r) , according ∗ P { } to Dej(r, τ) (δE/δϕ (r, τ)) + Λejk(τ)ϕk(r, τ), the ≡ − ej k e To start a CPMD simulation, one needs to reach the force acting on the j-th proto-MLWF orbital (which in- electronic ground state for a given initial configuration of cludes Λejk(τ) , the set of Lagrange multipliers needed { } the system via a SCF calculation. In the CP module of to preserve orthonormality). In the localization step, QE, the iterative solution of the non-linear KS equations the proto-MLWFs, ϕj(r, τ + ∆τ) , are incrementally { e } 9 localized via a unitary (orthogonal) transformation over Input MLWFs χi(r, τ + ∆τ) , the intermediary set of orbitals obtained { } during the extrapolation step in Eq. (26). This unitary transformation is accomplished via U(τ + ∆τ), a matrix which is generated from a nested SODD optimization109 75,77 of the Marzari-Vanderbilt functional. During the SCF φei(r) procedure, it is computationally unfavorable (and not nec- I Redistribution essary) to converge the nested SODD minimization for of MLWFs U at each step, as the current (unconverged) orbitals φei(r) do not yet represent the ground electronic state. In this Construction of regard, only a few nested SODD steps (e.g., a maximum II Pair List and of 20 during the SCF procedure for liquid water) were Proto-Subdomains used to incrementally localize the proto-MLWF orbitals. Upon convergence of this combined SCF and localization φei(r) Communication procedure, the set of proto-MLWF orbitals, ϕi(r, τ) , { e } III of MLWFs becomes the set of ground-state MLWF orbitals, which can now be chosen as the initial condition for an MLWF- ρeij(r) based AIMD simulation (i.e., φei(r, t = 0) ). As such, Solution of { } this approach provides a cost-effective alternative to the IV Poisson’s standard a posteriori procedure of localizing the canoni- Equation cal (Bloch) occupied orbitals from a fully converged SCF vij(r) calculation. e At the hybrid DFT level, we adopt this CP-like ap- Computation of proach to incrementally localize the occupied orbitals V Energy and Forces during the EXX-based SCF procedure, thereby avoiding a preliminary EXX calculation in reciprocal space. Since i De (r) Exx the incrementally localized proto-MLWF orbitals are not Redistribution xx equivalent to the final set of MLWFs at a given SCF step, VI of Wavefunction the orbital-dependent EXX contributions to vxc(r) are ap- Forces proximately evaluated via Eq. (20); however, the resulting i errors are inconsequential as the incremental refinement of Dexx(r) i the localized orbitals (and therefore Dexx(r)) at each step leads to the desired set of MLWFs upon SCF convergence. Output EXX Since our approach is based on an incremental “on-the-fly” Energy and Forces refinement of the proto-MLWF orbitals during the SCF procedure, it is therefore unsuitable for standard Fock matrix diagonalization routines, in which global rotations FIG. 2. Flowchart of the exx module (dashed green box) in between the occupied and virtual orbitals during each CP. As described in the main text, the input required by this diagonalization step would lead to marked delocalization module includes the current set of MLWFs, {φei(r)}, at each of the occupied states. Such delocalization would require CPMD step. The output produced by exx includes the EXX energy (E ) and the EXX contribution to the wavefunction substantial effort (essentially from scratch) to relocalize xx forces ({Di (r)}). Purple (brown) circles indicate that a given the orbitals after each diagonalization step, and would exx quantity is represented according to the GRID (ORBITAL) data thereby nullify the computational savings obtained from distribution scheme (see Fig. 3 and Sec. III B), while the pale a sparse evaluation of all EXX-related quantities. yellow circles represent data that are globally broadcasted. In practice, EXX-based SCF calculations in CP take The r notation indicates local (relative) Cartesian coordinates advantage of the incremental nature of the aforemen- in a given subdomain. For a detailed description of each step tioned MLWF refinement process by starting with a rela- in the exx module, see Secs. III C 1–III C 6. tively inexpensive semi-local xc functional (e.g., PBE12 for PBE058,142), which stabilizes ρ(r) and initiates the orbital localization procedure. Once the semi-local DFT itera- this approach has a significantly reduced computational tions reach 10 the target SCF convergence threshold, cost when compared to the alternative procedure of: (i) ≈ × the orbitals are typically quite localized and closely resem- performing a standard (canonical) PBE calculation, (ii) ble the final set of MLWFs corresponding to the chosen localizing the converged PBE orbitals from scratch, and semi-local xc functional. At this point, the exx module (iii) using these localized PBE orbitals as input into the is activated to perform the remaining steps required to incremental localization procedure described above to reach SCF convergence at the hybrid DFT level, upon perform a PBE0 SCF calculation to convergence. For which one obtains the final set of MLWFs corresponding insulating systems with small band gaps (e.g., InN), one to the chosen hybrid xc functional. For all systems tested, should exercise caution when using GGA orbitals for 10 the initial guess in this procedure, as the GGA xc func- B. MLWF-based EXX-CPMD: Data Distribution tional might incorrectly predict a metallic system with Schemes the wrong band-index ordering.120 With that caveat in mind, we also note that the computational cost associated As mentioned above, we employ a hybrid MPI/OpenMP with this initial SCF procedure is completely negligible parallelization scheme to enable large-scale EXX-based when compared to the overall CPMD simulation, which AIMD on massively parallel supercomputer architectures is the focus of this work. In future work, we hope to fur- containing 1000s of nodes. Our algorithm, which is de- ther optimize this incremental SCF procedure to enable scribed in Sec. III C below, is primarily based upon the high-throughput hybrid DFT-based single-point energy MPI distributed-memory paradigm, which requires spe- evaluations on large-scale condensed-phase systems (as cific data distribution schemes to minimize communication well as linear-scaling EXX-based BOMD). overhead and maximize computational efficiency. Dur- In analogy to the incremental localization procedure ing a GGA-based CPMD simulation in QE, the orbitals, described above for EXX-based SCF calculations (cf. charge density, and potential are constantly transformed Eqs. (26)–(27)), we have also introduced this nested between real- and reciprocal-space via the fwdFFT and SODD determination of U into the propagation of the invFFT operations. With all real-space quantities nu- CPMD equations of motion. This is accomplished by a merically represented on a grid (mesh) that is discretized CPMD propagation step, along the corresponding lattice vectors, QE employs the GRID data distribution scheme to scatter these quantities across Nproc MPI processes (ranks). In the GRID data χj(r, t + ∆t) = 2φej(r, t) φej(r, t ∆t) − − distribution scheme (see Fig. 3), the real-space grid is 2 ∆t partitioned into Nslab slabs along the z axis. Assuming + Dej(r, t), (28) µ Nproc = Nslab for simplicity, each MPI process will hold the data corresponding to all distributed real-space quan- single in which an intermediary set of orbitals, χj(r, t + ∆t) , tities on a slab of the real-space grid. In doing so, { } is formed via CPMD evolution (with time step ∆t) of the this data distribution scheme facilitates efficient parallel FFT by dividing the 3D FFT into a set of 2D FFTs (each MLWF orbitals, φej(r) . During the CPMD propagation { } of which can be executed by a given MPI process within step, these intermediary orbitals become slightly more a given slab) followed by a 1D FFT along the direction delocalized than the set of MLWFs (yet remain on the of the slab partition. ground state potential energy surface), and are therefore exx refined by a subsequent localization step, As depicted in Fig. 2, the input to the module in QE includes the current set of MLWFs, φei(r) , at each { } X CPMD step. These MLWFs are distributed across MPI φei(r, t + ∆t) = Uij(t + ∆t)χj(r, t + ∆t), (29) processes according to the GRID data distribution scheme, j in which a given process holds the data corresponding to all MLWFs on a given slab of the real-space grid. Al- in which the unitary transformation U is generated by though the GRID scheme is convenient for efficient parallel tightly converging the nested SODD optimization of the FFT, this data distribution model is far from ideal for an Marzari-Vanderbilt functional. By doing so after every efficient massively parallel implementation of our MLWF- CPMD propagation step, we ensure that the resulting based EXX approach. As such, we have introduced an φei(r, t) and Dei(r, t) are indeed the MLWFs and the alternative ORBITAL data distribution scheme in QE (see { } { } forces acting on them. We note in passing that the need Fig. 3), in which a given MPI process now holds quantities i to perform the additional localization step in Eq. (29) like φei(r) and Dexx(r) for a single MLWF across the en- reflects the lack of gauge invariance in the electronic tire real-space grid (for the case in which Nproc = No; for CPMD equations of motion within the MLWF repre- other cases, see the discussion below in Sec. III C 1). The sentation.110,111 Nevertheless, the intermediary orbitals details behind the transformation between the GRID and generated by Eq. (28) are typically good approximations ORBITAL data distribution schemes are provided below to the MLWFs, and thereby provide a rather good initial in Secs. III C 1 and III C 6. guess to the SODD localization procedure.109 As a result, The ORBITAL data distribution scheme is particularly the localization procedure typically converges with a small suited for our real-space MLWF-based EXX algorithm, number of nested SODD iterations (e.g., 3–4 iterations since this approach is centered around orbital sparsity and for the liquid water systems in Sec. IV B 1), which re- the efficient evaluation of veij(r). For one, the ORBITAL sults in minimal computational overhead when compared scheme allows us to utilize a significantly larger number of to the cost of the EXX calculation. Moving forward, MPI processes (Nproc Nslab), as the number of MLWFs  this incremental localization scheme could be avoided us- or overlapping MLWF pairs (both of which grow linearly ing the field-theoretic approach proposed by Tuckerman with system size) quickly exceeds Nslab (which grows with and coworkers,111 which introduces additional fictitious the cubic root of the system size). The ORBITAL scheme dynamics on a set of gauge fields to enable “on-the-fly” also allows us to exploit intranode parallelization with propagation of the MLWF transformation (U) matrix. Nthread OpenMP threads during the most computationally 11

Step I Step VI

Input Steps II-V Output

FIG. 3. Schematic illustration of the GRID and ORBITAL data distribution schemes in QE. For simplicity, we consider a system consisting of a single water with No = 4 MLWFs (φei(r)), a simulation cell consisting of a real-space simple-cubic grid that has been partitioned into Nslab = 4 slabs along the z-direction (Zi), and a pool of Nproc = 4 MPI processes (Pi). As depicted at the top of the figure, each of these MPI processes (and the corresponding data it holds in local memory) is assigned a color: P1 (red), P2 (green), P3 (blue), and P4 (yellow). As input into the exx module, the φei(r) are provided in the GRID scheme, in which a given MPI process, Pi, holds the data corresponding to all No MLWFs on one slab, Zi, of the real-space grid. During Step I of the exx module (Sec. III C 1), the φei(r) are redistributed according to the ORBITAL scheme, in which a given MPI process, Pi, holds the data corresponding to only one MLWF, φei(r), across all Nslab slabs of the real-space grid. As described in Secs. III C 2–III C 5, Steps II–V involve selective communication of the φei(r) between MPI processes and i i computation of all EXX-related quantities (Exx and {Dexx(r)}). At the end of Step V, the {Dexx(r)} are stored according to the ORBITAL scheme, and are redistributed back to the GRID scheme during Step VI (Sec. III C 6), the final step of the exx module. intensive steps in our algorithm, e.g., solving the PE to 1. Step I: Redistribution of MLWFs obtain veij(r) (see Sec. III C 4). As a result, this hybrid MPI OpenMP / parallelization scheme not only provides us In the exx algorithm, the assignment of MLWFs to with access to even more computational resources during a given MPI process is based on ζ Nproc/No, i.e., the ≡ EXX-based simulations, but also allows us to sidestep the ratio of available MPI processes to the number of MLWFs. prohibitively large data communication overhead associ- When ζ = 1, there is one MPI process per MLWF, and ated with an MPI-based solution to the PE. each process, Pi, is assigned a unique MLWF, φei. With limited computational resources (Nproc < No), ζ < 1 and multiple MLWFs are assigned to each process; as such, a balanced distribution of MLWFs across MPI processes is only possible when Nproc is a divisor of No. In the strong-scaling limit, our exx algorithm allows for ζ to take on integer values greater than one, in which a given C. MLWF-based EXX-CPMD: Algorithm MLWF is assigned to multiple MPI processes. Unless otherwise specified, we will assume that ζ = 1 throughout the remainder of Sec. III C. In this section, we provide a detailed description for Given the current set of MLWFs in real space, φei(r) , { } each of the steps inside the exx module in QE. Our dis- which are distributed among the available Nproc MPI pro- cussion will follow the flowchart depicted in Fig. 2, in cesses according to the GRID scheme, the first step in the which the current set of MLWFs in real space ( φei(r) , exx module is the forward redistribution of these quan- { } distributed according to the GRID scheme) are provided tities into the ORBITAL data distribution scheme. For as input into the exx module. Subsequent output of exx this purpose, each MPI process collects an assigned φei(r) includes the EXX energy (Exx) as well as the EXX con- across the entire real-space grid via an ALL-TO-ALL in- tribution to the wavefunction forces ( De i (r) , which are ternode communication step, as shown in Fig. 3. This { xx } again distributed according to the GRID scheme). This ALL-TO-ALL communication is performed twice per preserves compatibility with the rest of CP, and allows CPMD step: once here in the forward redistribution for a modular exx codebase. of φei(r) from the GRID to the ORBITAL scheme, and { } 12 once in Step VI in the inverse redistribution of De i (r) ing all parallel computational resources, one needs to (i) { xx } from the ORBITAL to the GRID scheme (see Sec. III C 6). minimize the total computational workload, (ii) minimize As discussed in Sec. IV, this communication overhead the number of interprocess communication events, and represents a substantial fraction of the total cost asso- (iii) maintain a balanced workload among the pool of ciated with the exx module, and can be significantly available MPI processes. To accomplish this goal, we now reduced by a more sophisticated communication scheme describe the procedure employed in the exx module to over select subsets of the MPI process pool, i.e., those construct the so-called unique MLWF-pair list, , which L containing the regions of real space containing the rel- defines the computation and communication protocol in evant MLWF-orbital domain, Ωi (see Sec. II C). This our algorithm. To construct , the indices corresponding L algorithmic improvement is currently underway and will to all overlapping MLWF pairs (as determined by the be described in future work. aforementioned criteria based on Rpair) are first assembled into the non-unique MLWF-pair list, 0, which contains L all possible permutations ij and ji of these overlapping 2. Step II: Construction of Pair List and Proto-Subdomains MLWF pairs. Since the ij - and ji -pair contributions h i h i to Exx are equivalent (cf. Eqs. (15)–(17)), it is clear that With the MLWFs distributed among MPI processes 0 is redundant and contains twice as many pairs as L according to the ORBITAL scheme, we now explain how needed. the exx module exploits the sparsity of the MLWFs, and Before discussing the procedure used to determine , we L utilizes system-size independent subdomains of Ω during first demonstrate that exploiting such redundancy within the computation of all EXX-related quantities. To accom- the more parallelizable ORBITAL data distribution scheme plish this goal, we will first describe the construction of leads to the requirement for two interprocess communi- the so-called unique MLWF-pair list, , which not only L cation events per unique MLWF pair. To see this more contains the relevant set of overlapping MLWF pairs, but clearly, one only needs to consider a minimalistic system also determines how the computational workload associ- which contains a single ij -pair of overlapping MLWFs. h i ated with these pairs is distributed among the pool of Throughout this example, spatial arguments will be sup- available MPI processes. This is followed by a detailed pressed (e.g., φei will be used instead of φei(r)), since all description of the set of “proto-subdomains” employed in computation and communication events will be performed the exx module, which represent computationally effi- using system-size independent subdomains (vide infra). cient alternatives to the formal Ωi and Ωij subdomains With φei located on Pi and φej located on Pj, we will first introduced above in Sec. II C. consider what happens if the inherent pair redundancy is Construction of the MLWF-Pair List not exploited. In this case, φei (φej) is first communicated to Pj (Pi) for a total of two interprocess communication To exploit the first level of computational savings, which events. At this point, each MPI process constructs the originates from the fact that MLWFs are exponentially corresponding MLWF-product density, ρeij = ρeji, and localized and only overlap with a limited number of neigh- proceeds to compute veij = veji by solving two equivalent bors, two MLWFs, φei(r) and φej(r), are considered an PEs. Since the solution of the PE is the dominant compu- overlapping pair if Cei Cej < Rpair. A judicious choice tational step in our EXX algorithm, this will count for a | − | v = v for Rpair is required for accurately calculating all EXX- total of two computation events. With eij eji available related quantities, and an analysis of the convergence of on both Pi and Pj, each process is now in a position to ij ji E via Exx with respect to Rpair will be provided in Sec. IV A 1. compute the - and -pair contributions to xx h i hij i ji Eq. (15), as well as D = vijφj and D = vjiφi via At the current point in the algorithm, each φei(r) exx e e exx e e i ORBITAL Eq. (18). As depicted in Fig. 2, the Dexx are needed in is stored according to the data distribution { } scheme on one (or more) MPI processes (depending on the ORBITAL data distribution scheme before these quan- the value of ζ employed during runtime, see Sec. III C 1). tities are finally redistributed back to the GRID scheme For simplicity, we will discuss the ζ = 1 case first, in to ensure compatibility with the other modules in QE ij which there is only one MPI process per MLWF, and (see Fig. 3). As such, the local evaluation of Dexx on Pi ji each process, Pi, is assigned a unique MLWF, φei(r). As and Dexx on Pj directly provides these quantities in the such, Pi lacks direct access to φej(r) for j = i, which is requisite ORBITAL data distribution scheme without the 6 required for the construction of ρeij(r) and the subsequent need for any additional communication. Hence, the total computation of veij(r) for evaluating the ij -pair contri- cost per unique MLWF pair amounts to two units of com- ij ji h i munication followed by two units of computation, when bution to Exx, Dexx(r), and Dexx(r). Although ρii(r) can e pair redundancy is not exploited. be constructed locally on Pi to evaluate the ii -pair (self h i pair) contribution to each of these quantities, all of the Since the removal of all MLWF-pair redundancy is cru- other ij -pair contributions will require communication cial for minimizing the total number of computational h i between MPI processes. events (and hence the overall time to solution), we now To design an MLWF-based EXX algorithm that consider the case where this inherent pair redundancy achieves a minimal time to solution while efficiently utiliz- is exploited. In this case, only φej would be sent to Pi 13 with an associated cost of one unit of communication, tains i (self pair) and will retain this non-redundant index and Pi will therefore be solely responsible for computing throughout the refinement of 0 to in Algorithm 1. L L all EXX-related quantities. Since the ij - and ji -pair While there are still redundant pairs in 0, this algorithm h i h i L contributions to Exx are equivalent, these quantities can will consecutively sweep over MLWFs to locate redundant be computed on Pi via a single computation event (i.e., pairs such as ij and ji ; in our approach, this is tan- h i h i the solution to the corresponding PE), and then sent to tamount to finding both j 0[i] and i 0[j]. Once ∈ L ∈ L any other process with minimal communication (i.e., one located, the algorithm eliminates this redundancy from double-precision number for each pair contribution to 0 by removing the index i from 0[j]. At the end of ij ji L L Exx). Although D = D , this also poses no problem as these sweeps, all of the redundancies in 0 are removed, exx exx L 6 ij and we are left with , the unique MLWF-pair list. This Pi has direct access to φei and φej, and hence both Dexx and L ji list contains the minimum number of computational tasks De can be computed locally. With the requirement that xx required to evaluate all EXX-related quantities, and dic- the De i are stored in the ORBITAL data distribution { xx} tates how this computational workload will be distributed scheme, this will incur an additional communication event ji among the pool of available MPI processes. By virtue of as Dexx is shipped back to Pj. Hence, exploiting the inher- the ORBITAL data distribution scheme, also encodes the L ent MLWF-pair redundancy reduces the computational communication protocol that will be followed throughout workload by half (as expected), but it does not change the remainder of the exx module (see Sec. III C 3). With the requirement for two communication events per unique ji ζ = 1, this amounts to sending φej Pi and De Pj MLWF pair. → xx → for each unique ij pair, as depicted in the right panel During this non-redundant evaluation of the ij - and h i h i of Fig. 4. ji -pair contributions, the fact that Pj was idle while h i Pi performed all of the required computations creates Algorithm 1 Refinement of 0 to L L an imbalance in the computational workload assigned to any_removal ← TRUE each MPI process. With the freedom to assign the compu- while any_removal do tational workload associated with the ij -pair to either h i any_removal ← FALSE Pi or Pj, the exx module is now tasked with determining for i = 1,No do how the total computational workload will be distributed for j 6= i ∈ L0[i] do among the pool of available MPI processes. Armed with if i ∈ L0[j] then knowledge of the total number of non-unique MLWF pairs L0[j] ← L0[j] \{i} in the system (via 0) as well as the use of system-size any_removal ← TRUE L independent subdomains to regularize the computational break cost associated with the solution to each PE (vide infra), end if end for the process for doing so involves a static load-balancing end for algorithm which seeks to minimize the overall time to end while solution by reducing the imbalances present in the compu- L ← L0 tational workload, and hence the number of idle processes. Although it is certainly possible in the current version of the algorithm, we chose not to involve a third pro- By construction, static load-balancing algorithms (such cess, Pk, in the evaluation of the ij -pair contribution, as Algorithm 1) yield fairly well-balanced workload dis- h i as this would introduce two additional communication tributions by mitigating potential imbalances during the ij refinement of 0 to . Here, we note that the distance- events, i.e., φei to Pk and Dexx back to Pi (in addition L L ji based sorting of the indices in each row of 0 is crucial to φej to Pk and Dexx back to Pj). In this regard, the L for avoiding severe workload imbalances due to sequen- Dij P local computation of exx on i not only avoids additional tial index ordering. In this regard, an equally effective unnecessary communication events, but also allows for load-balancing algorithm would be possible by performing reduced storage requirements as this quantity can be cu- random sweeps over row indices (and completely avoid- j mulatively incremented (over multiple ) within a single ing the initial distance-based sorting procedure). Since Ω array corresponding to i. the number of overlapping pairs per MLWF will often This static load-balancing algorithm can be represented change throughout an AIMD simulation, this static load- by the so-called unique MLWF-pair list, , the construc- L balancing algorithm is performed during each MD step tion of which is described in the left panel of Fig. 4 (for in an attempt to determine an optimal workload balance. the illustrative case of a single water molecule) as well For a detailed discussion regarding the performance of this as Algorithm 1 (for the general case). We start with static load-balancing algorithm during CPMD simulations the 0 array, which contains all possible permutations of L of liquid water, as well as future possible improvements overlapping MLWF pairs, i.e., 0[i] (the i-th row of 0) is L L of this approach, see Sec. IV B 1. populated with a list of indices, j , corresponding to all { } When computational resources are limited, the exx φej that overlap with φei. For each i, the indices j 0[i] module can utilize less MPI processes during runtime (i.e., ∈ L are sorted into ascending order based on their vicinity to ζ < 1). In this case, multiple MLWFs are contiguously as- φei via Cei Cej . By construction, each 0[i] also con- signed to each process, and a balanced distribution of the | − | L 14

SWEEP01 SWEEP02 [1]: 1243 1243 1243 12 4

[2]: 2431 2431 24-- 24-

[3]: 3241 3241 32-1 321

[4]: 4231 4231 4-31 432-

FIG. 4. Graphical depiction of the unique MLWF-pair list construction process and corresponding MLWF communication scheme in the exx module. For simplicity, we will again consider a single water molecule with No = 4 MLWFs (φei) and a pool of Nproc = 4 MPI processes (Pi), i.e., ζ = 1. Starting with the non-unique MLWF-pair list, L0, which contains all possible permutations of overlapping MLWF pairs, the step-by-step procedure employed to transform L0 into the unique MLWF-pair list, L, is depicted in the left panel. Since all of the MLWFs are mutually overlapping in a single water molecule, L0[i] contains {j}, the indices corresponding to all MLWFs (including j = i), which have been sorted according to |Cei − Cej |. During the process of reducing L0 to L, i is always selected (bold black font) to remain in L0[i], while the remaining non-unique indices are shown in gray. In the first sweep (SWEEP01) of Algorithm 1, the next element j ∈ L0[i] is selected to remain in L0[i], while the now redundant index i is removed (red slash) from L0[j]. During the first sweep in this example, the indices 2, 4, 2, 3 were selected to remain in L0[1], L0[2], L0[3], L0[4], while the corresponding redundant indices, 1, 2, 3, 4, were removed from L0[2], L0[4], L0[2], L0[3]. This process is repeated until all of the MLWF-pair redundancy is removed from L0, upon which one is left with the final L. For a single water molecule, only two sweeps are required to reach this stage; at that point, all of the unique MLWF pairs have been assigned to a given L[i], and no redundant indices remain. With each MLWF stored according to the ORBITAL data distribution scheme (in which φei is assigned to Pi), the final L determines how the computational workload will be distributed among the pool of available MPI processes. Even in this simple example, there exists a mismatch in the number of pairs assigned to each process, with three MLWF pairs assigned to P1 and P3, and only two MLWF pairs assigned to P2 and P4. Such discrepancies are expected (even for the most homogeneous systems) and lead to an imbalance in the computational workload. By virtue of the ORBITAL data distribution scheme, L also determines the corresponding MLWF communication protocol, which is depicted in the right panel for the single water molecule. After a given φej is directly communicated to Pi, this MPI process forms ρeij , solves the corresponding PE for veij (depicted by the blue dashed line), and computes the hiji-pair ij ji contribution to Exx, Dexx, and Dexx (which is sent back to Pj ). As an example, consider L[2], which contains two indices (2, 4). Since P2 already holds φe2, ρe22 (required for computing ve22) can be constructed locally without the need for any interprocess 42 communication. To construct ρe24 (and hence ve24), φe4 is sent from P4 to P2 (solid arrow). After the corresponding Dexx is formed, it is shipped back to P4 (dashed arrow). Besides this straightforward dependency between receiving an MLWF and computing the corresponding EXX contributions to the wavefunction forces, the communication in the exx module has no discernible time axis in the figure above. workload (within the framework defined by Algorithm 1) This allows for a more computationally efficient means to is more likely when Nproc is a divisor of No; as such, this performing an EXX calculation, albeit with a longer time is the current recommended setting whenever applicable to solution. (see Sec. III C 1). In the isolated water molecule example With access to massively parallel resources (ζ > 1 N), ζ = 1/2 P ∈ in Fig. 4, the use of would start with 1 holding each φei is now replicated and stored in memory on the φ φ P φ φ e1 and e2, and 2 holding e3 and e4. After running Pi,Pi+No ,...,Pi+(ζ−1)No processes. After running Algo- Algorithm 1 to generate , the workload associated with rithm 1 to generate , the workload associated with a L L a given MLWF is mapped onto the process holding this given MLWF is split into ζ parts, each of which is as- orbital. This results in five units of computation assigned signed to one of the processes holding this orbital. For the to each MPI process: two self pairs, one local pair (in isolated water molecule, the use of ζ = 2 (ζ = 1) results in which both MLWFs are held on the same process), and processes assigned with 1 2 (2 3) computational tasks. − − two non-local pairs (in which one of the MLWFs is held This reduces the maximum number of computation events on a different process), e.g., P1 would be responsible for per process from 3 to 2, and hence lowers the overall time ii = 11 and ii = 22 (two self pairs), ij = 12 to solution. However, this gain comes at the expense of in- h i h i h i h i h i h i (one local pair), and ij = 14 and ij = 24 (two creasing the workload imbalance from 1/3 (i.e., processes h i h i h i h i non-local pairs). In this case, the workload is optimally with the lightest workload idling for 1/3 of the time) ≈ balanced and the maximum number of computation events to 1/2, and is therefore a less efficient use of the available per process is only 5/3 (instead of 2 ) larger than ζ = 1. computational resources. × × 15

Construction of Proto-Subdomains As discussed throughout this work, the efficient evalua- tion of veij(r) is the cornerstone of our MLWF-based EXX approach. To exploit the sparsity of the MLWFs and still retain a numerically exact evaluation of veij(r), this quan- tity is computed via the solution to the corresponding PE for all points in Ω that are contained in Ωij (see Eq. (21)). Since the PE is a boundary-value problem, the required boundary conditions are provided by the ME of ρeij(r) about Ceij on the thin shell of the real-space grid surround- ing Ωij (see Eqs. (22)–(23)). By computing veij(r) for all r Ωij, Exx can be computed in a numerically exact ∈ fashion (cf. Eq. (19)). However, the numerically exact evaluation of the EXX contribution to the wavefunction ij ji forces, De (r) and De (r), requires vij(r) for all r Ωj xx xx e ∈ and r Ωi, respectively (see Eq. (20)). Since Ωij Ωj ∈ ij ji ⊂ and Ωij Ωi, De (r) and De (r) are evaluated with FIG. 5. Graphical depiction of the proto-subdomains used in ⊂ xx xx vij(r) from the solution to the PE for all r Ωij. For the exx module. Dots are used to denote the MLWF centers, e ∈ r Ωj Ωij and r Ωi Ωij, vij(r) can be conveniently Cei, which are approximated by the closest points, Ci, on the ∈ \ ∈ \ e and accurately supplied by a sufficiently converged ME real-space grid, Ω. The dashed blue and red circles bound the two concentric spherical proto-subdomains, Θ(C ,R ) of ρeij(r). As such, the ME serves the dual purpose of 0 PE providing the necessary boundary conditions for the PE and Θ(C0,RME), which are assembled around C0 (purple star) with radii RPE and RME, respectively. Pair-exchange as well as the far-field vij(r) required for a numerically e interactions involving φ (r) include all overlapping MLWFs Dij ( ) Dji ( ) ei exact computation of both exx r and exx r . (yellow dots), whose centers are located within a distance, To exploit this second level of computational savings, Rpair, of Cei ≈ Ci (black dot) that is large enough to account which originates from the fact that a numerically exact for all φek(r) with Ωik 6= ∅. For the overlapping hiji pair, the evaluation of all EXX-related quantities can be restricted Θ(C0,RPE) and Θ(C0,RME) proto-subdomains are translated to real-space domains that are system-size independent across Ω via a rigid grid offset, τij , to form Θ(Cij ,RPE) and and significantly smaller than Ω, the exx module in- Θ(Cij ,RME), which are centered at Cij ≈ Ceij (purple square). troduces an alternative formulation of the Ωi and Ωij To evaluate the hiji contribution to all EXX-related quantities, subdomains described above in Sec. II C. To begin, we 2 the corresponding PE, ∇ veij (r) = −4πρeij (r), is solved for first note that subdomains like Ωi and Ωij = Ωi Ωj v (r) on the Θ(C ,R ) subdomain, which encompasses ∩ eij ij PE are formally defined as the points in Ω for which φei(r) Ωij , the support of ρeij (r) (shaded in dark green). Boundary conditions for the PE (as well as the far-field v (r)) are and ρeij(r) are non-negligible (i.e., larger than some pre- eij determined numerical cutoff). As such, both of these computed via a ME of ρeij (r) on the Θ(Cij ,RME)\Θ(Cij ,RPE) shell surrounding Θ(C ,R ). subdomains can have irregular and even disjoint shapes. ij PE However, this is a cumbersome and computationally de- manding definition that would require screening substan- tial sectors of Ω for each pair of MLWFs during every To efficiently utilize this concept of concentric spherical CPMD step. To combat this issue and still maintain a subdomains in the exx module, we assemble two fixed- numerically exact evaluation of all required quantities, size proto-subdomains, Θ(C0,RPE) and Θ(C0,RME), cen- one could simply utilize two concentric spherical subdo- tered around a predetermined origin, C0, which is cho- ij ij sen to be one of the grid points in Ω. When dealing mains per ij pair, i.e., Θ(Ceij,RPE) and Θ(Ceij,RME), h i ij with all computations involving a given ij pair, these which are spheres centered at Ceij with radii RPE and h i ij proto-subdomains are simply translated to Ceij, which RME chosen to be large enough to encompass Ωij and will be approximated (with no discernible error) by Cij, Ωi Ωj, respectively. In doing so, the corresponding PE, 2∪ the closest grid point in Ω (see Fig. 5 and Sec. III C 3). vij(r) = 4πρij(r), could then be solved without any ∇ e − e Since these fixed-size proto-subdomains will be used for Θ(C ,Rij ) domain truncation error on eij PE , which is signifi- all ij pairs, their radii should be chosen such that cantly smaller than Ω. Computing the ME of ρeij(r) on h i ij ij ij ij RPE = maxij RPE and RME = maxij RME . With the Θ(Ceij,R ) Θ(Ceij,R ) shell would again provide { } { } ME \ PE judicious choices for RPE and RME (see Secs. IV A 1– the necessary boundary conditions for the PE as well as ij IV A 2), these proto-subdomains allow for an accurate the far-field veij(r) needed for evaluating both Dexx(r) on evaluation of all EXX-related quantities, and have several ji Ωj and Dexx(r) on Ωi. Since both of these subdomains are algorithmic advantages that will be described below. ij ij ji contained in Θ(Ceij,RME), Dexx(r) and Dexx(r) can also With RPE and RME in hand, we now describe the be computed in a numerically exact fashion on a subset construction of these proto-subdomains around C0, an of points contained in Ω. arbitrary center that is coincident with a grid point in 16

Ω. In this work, the grid point closest to the center of stored in a double-precision array, Ω was chosen as the reference C0, since this allows us ( ) to avoid the use of both the minimum image convention rPE[q] q = 1,...,NPE r[q] = , (30) and wrap-around (periodic) boundary conditions during rME[q NPE] q = NPE + 1,...,NME grid point screening. Assembly of the proto-subdomains − begins by looping over grid points, r Ω, and determin- while the global grid indices are stored in an integer array, ∈ ing whether or not a given grid point is contained within ( 0 ) Θ(C0,RPE) or Θ(C0,RME) Θ(C0,RPE). For each grid 0 gPE[q] q = 1,...,NPE \ g [q] = 0 . (31) point contained in either proto-subdomain, we increment gME[q NPE] q = NPE + 1,...,NME the corresponding counter (q0 or q00) and store its rela- − tive (local) Cartesian coordinates (in rPE or rME) and By storing all of the subdomain data in this scheme, only 0 0 q (global) grid point indices (in gPE or gME), as depicted in a single local index, , is required for labeling the ele- Algorithm 2. ments in these arrays. This still maintains access to the Θ(C0,RPE) and Θ(C0,RME) proto-subdomains (as well as the Θ(C0,RME) Θ(C0,RPE) shell) through knowl- \ Algorithm 2 Proto-Subdomain Construction edge of NPE and NME, the number of elements in each q0 ← 0; q00 ← 0 proto-subdomain. As such, this scheme provides us with a foreach r ∈ Ω do compact representation for the sparse quantities required if |r − C0| ≤ RPE then in our EXX algorithm, as well as a convenient mapping q0 ← q0 + 1 between data stored in the proto-subdomain representa- 0 rPE[q ] ← r − C0 tion and the real-space grid (Ω) representation. This is 0 0 gPE[q ] ← NINT [Ngrid,ara/|La|] , a = 1, 2, 3 crucial for loading and off-loading data to and from Ω, as else if RPE < |r − C0| ≤ RME then it only requires communication of the relevant sectors of 00 00 q ← q + 1 Ω ρ (r) v (r) 00 for sparse quantities like eij and eij . rME[q ] ← r − C0 0 00 gME[q ] ← NINT [Ngrid,ara/|La|] , a = 1, 2, 3 end if 3. Step III: Communication of MLWFs end for 0 NPE ← q 0 00 NME ← q + q By virtue of the ORBITAL data distribution scheme, the unique MLWF-pair list, , not only determines the L computational workload associated with each MPI pro- cess, but also encodes the communication protocol that Incrementing these counters throughout the loop over will be followed throughout the remainder of the exx r Ω yields NPE and NME, the (fixed) number of points module (see Fig. 4). With a support that is significantly ∈ in Θ(C0,RPE) and Θ(C0,RME). By storing the relative smaller than Ω, the communication of any given MLWF Cartesian coordinates, r C0, we now have a set of local on the entire real-space grid is clearly neither efficient − coordinates that are invariant to rigid translations of nor necessary in our EXX algorithm. To reduce the com- Θ(C0,RPE) and Θ(C0,RME), thereby avoiding the need munication overhead associated with each overlapping to recompute these coordinates for every ij pair. This ij pair, the exx module employs the proto-subdomains h i h i also provides a convenient platform for precomputing a (Θ(C0,RPE) and Θ(C0,RME)) introduced in Sec. III C 2. number of quantities (e.g., r in spherical polar coordinates, As discussed above, these system-size-independent proto- the set of spherical harmonics, etc.) that are required subdomains provide a compact data representation for the during the ME of ρij(r) (cf. Eqs. (22)–(23)). For each storage and communication of sparse quantities such as e ij ji point in the proto-subdomains, we also store its global φei, ρeij, veij, and Dexx (or Dexx). To utilize Θ(C0,RPE) and grid point indices, which are given by three integer values, Θ(C0,RME) in practice, these proto-subdomins must be 0 0 0 (g1, g2, g3), representing the position of a given grid point translated across Ω to form the subdomains, Θ(Cij,RPE) along the cell (lattice) vectors, L1, L2, and L3 (which are and Θ(Cij,RME), required for evaluating all quantities aligned with the Cartesian directions for the orthorhombic associated with a given ij pair, as shown in Fig. 5. h i cells considered in this work). For an orthogonal grid, Before describing the translation of these proto- which has Ngrid,a equispaced grid points along each of the subdomains, we now discuss the employed convention La lattice vectors (with grid spacing δξa = La /Ngrid,a), used for Ceij, and remind the reader of the flexibility one |0 | the global grid index along La is given by ga = ra/δξa = has in defining this quantity for neutral charge distri- Ngrid,ara/ La . Since r is always coincident with a grid butions like ρij(r) (see Sec. II C). In the exx module, | | 0 e point in Ω, g is formally an array of integers; this is { a} Ceij is defined as the midpoint between the i-th and j-th enforced in a floating-point environment using the nearest MLWF centers, i.e., Ceij = (Cei + Cej)/2, which repre- integer function, NINT. sents an excellent approximation to the aforementioned This accumulated data is then concatenated to form gauge used in molecular quantum mechanics and an al- two 3 NME arrays as follows: the local coordinates are gorithmically convenient choice. By utilizing the MLWF × 17

ij centers, this definition for Ceij accounts for the spatial g , and φei(r) φei(r + Cij), with associated sizes (types) ≡ distribution of each MLWF through its first moment, of 3 NME (double-precision), 3 NME (integer), and × × and becomes equivalent to the conventional definition, 1 NME (double-precision), respectively. Here, we remind R R × Ceij = dr r ρij(r) / dr ρij(r) , for a number of dif- the reader that all of the data on the Θ(Cij,RPE) subdo- |e | |e | main and Θ(Cij,RME) Θ(Cij,RPE) shell are contained ferent symmetric cases (e.g., when both φei(r) and φej(r) \ have the same spread and are spherically symmetric with within Θ(Cij,RME), and can easily be accessed using the local grid indexing scheme outlined in Eqs. (30)–(31). As respect to Cei and Cej, when ρeij(r) is centrosymmetric with respect to the midpoint, etc.). Furthermore, this mentioned above, the local Cartesian coordinates stored in the r array are invariant to translations of the proto- C choice for eij recovers the correct center of charge for subdomains; as such, this information does not need to ρii(r), i.e., Ceij Ceii = Cei. Algorithmically speaking, e → be recomputed for each translated subdomain and can this convention for Ceij is also quite useful, as it only be broadcast across all processes. Communication of the requires knowledge of the MLWF centers, which are avail- MLWFs on these compact subdomains then proceeds ac- able throughout a CPMD simulation. cording to among the pool of available MPI processes. exx L As mentioned above, the module employs one Once φei(r) is received by a given process, ρeij(r) is assem- additional simplification when dealing with MLWF and bled on the Θ(Cij,RPE) subdomain and the exx module MLWF-pair centers: these quantities are approximated begins the process of solving the corresponding PE. by the closest grid points in Ω and denoted throughout by either Ci or Cij. This algorithmic simplification leads to i no appreciable error during evaluation of Exx and De , 4. Step IV: Solution of Poisson’s Equation { xx} and allows us to rigidly translate the proto-subdomains to the appropriate center, Cij, as needed. For an orthogonal Based on , each MPI process, Pi, now holds an L grid, the component of the required grid translation vector, assigned MLWF-product density ρij(r) as well as the ij e τ , along a given lattice vector, La, is given by: relevant quantities that map the Θ(Cij,RPE) and Θ(Cij,RME) subdomains onto Ω (i.e., NPE, NME, r , ij   ij { } τa = NINT Ngrid,a (Cij C0)a / La and g ). As such, Pi has all of the required in-  − | | { } (Cij C0) formation to compute vij(r) on the Θ(Cij,RPE) and = NINT − a . e (32) Θ(Cij,RME) Θ(Cij,RPE) subdomains. On Θ(Cij,RPE), δξa \ veij(r) is obtained via the solution of the PE in Eq. (21). ij On Θ(Cij,RME) Θ(Cij,RPE), vij(r) is obtained via the Application of τ to a given proto-subdomain leaves the \ e radius and local Cartesian coordinates untouched, and ME in Eqs. (22)–(23), which provides the appropriate simply offsets the global grid indices as follows: boundary conditions for the PE as well as the far-field potential. ij  0 ij  ρ ( ) ga [q] = MOD ga[q] + τa ,Ngrid,a , (33) While the ME of eij r (about Cij) can be straightfor- wardly computed using the local coordinates, r , the PE { } thereby resulting in a subdomain centered at Cij. The use requires a discrete representation of the Laplacian opera- of the remainder function, MOD, in Eq. (33) enforces the tor for computing numerical second derivatives on these appropriate wrap-around boundary conditions; as such, subdomains. Since the subdomains employed in the exx this equation is specific to the grid convention used in QE, module are coincident with the underlying real-space grid in which the grid points (along La) are numbered from (taken here to be orthogonal), the Laplacian operator can 0, 1,...,Ngrid,a 1. For codes that number these grid be expressed as a sum of second partial derivatives along − ∂2 1, 2,...,Ngrid,a 2 P points from , Eq. (33) should be modified each of the lattice directions, = a ∂ξ2 , in which ξa ij 0 ij ∇ a as follows: ga [q] = MOD[ ga[q] + τa 1,Ngrid,a ] + 1. − La La/ La With each MLWF stored according to the ORBITAL is a coordinate of b . At a given grid point, ξ ≡ | | f(ξ) data distribution scheme, the MPI process (ζ = 1) or 0, the second partial derivative of a function, , along La was discretized via the standard central-difference ζ > 1 φ (r) Ω 143 processes ( ) that are currently storing ei on approach: are responsible for sending this MLWF to another MPI 2 n process (or processes) according to the computation and ∂ f(ξ) X f(ξ0 + q δξaLba) = wq . (34) communication protocol outlined by . In order to do so, ∂ξ2 δξ2 L a ξ=ξ0 q=−n a φei(r) is off-loaded onto the appropriately translated sub- domains, Θ(Cij,RME), corresponding to the overlapping In this expression, the associated discretization error de- ij pairs that will be handled remotely (i.e., on other pends on the number, n, of (equispaced) neighboring grid h i processes); all of the information required to do so is points on each side of ξ0, and wq is the central-difference provided by local access to and Cij , as both of these coefficient for the q-th grid point. We note in passing L { } arrays have been broadcast to all processes. For each of that only w|q| is required due to the central symmetry these overlapping ij pairs, a sparse quantity like φei(r) (wq = w−q) of the equispaced finite-difference stencil. h i is now stored in the more compact Θ(Cij,RME) repre- Discretization of this derivative results in a (2n + 1)- sentation via the use of three relatively small arrays: r, point stencil along each grid direction, La; as such, the 18 discrete 3D Laplacian operator corresponds to a finite- 5. Step V: Computation of Energy and Forces difference stencil covering 3 2n + 1 = 6n + 1 grid × points. We note in passing that the choice of n = 3 (with P 143 After a process, i, arrives at the solution to the PE for corresponding central-difference coefficients given by one of its assigned pairs (i.e., for a given j [i]), this w0 = 49/18, w1 = +3/2 = w−1, w2 = 3/20 = w−2, ∈ L − − process now holds the corresponding MLWF-product po- and w3 = +1/90 = w−3) yields a second derivative with v (r) Θ(C ,R ) 6 tential eij on the entire ij ME subdomain. With an associated discretization error of δξ ; this choice O veij(r) on the Θ(Cij,RPE) subdomain (via the solution is the default option in the exx module as it yields a v ( ) Θ( ,R ) Θ( ,R ) 76,78 of the PE) and eij r on the Cij ME Cij PE well-converged value for Exx. \ shell (via the ME of ρeij(r)), Pi is now ready to evalu- ate the ij contribution to the EXX energy (Exx) and With this discrete representation of the Laplacian, we h i ij ji 2 wavefunction forces (D (r) and D (r)). can express the PE in Eq. (21), vij(r) = 4πρij(r), exx exx ∇ e − e as the following set of sparse linear equations on the The evaluation of Exx is quite straightforward via Θ(Cij,RPE) subdomain: Eq. (19). Here, we remind the reader that a numerically exact evaluation of this quantity only requires integration on the Θ(Cij,RPE) subdomain; this integration over the ρ (r) OpenMP 2 b  support of eij is also parallelized over threads ∇ vij = 4π ρij ρ . (35) PE e − e − eij and is therefore quite computationally efficient. Partial summations over the assigned ij pairs on each process h i are accumulated to form Exx with minimal associated 2 i.e. Exx In this expression, ∇ is a sparse NPE NPE matrix con- communication ( , one double-precision number for PE × taining the discretized Laplacian (whose stencil coverage per MPI process). v (r) P has been restricted to Θ(Cij,RPE)), vij is a NPE 1 vec- With eij in hand, i is also in position to compute e × ij ji tor representing the (currently unknown) MLWF-product both Dexx(r) = veij(r)φej(r) and Dexx(r) = veij(r)φei(r) on ji potential, and ρij is a NPE 1 vector containing the the entire Θ(Cij,RME) subdomain. For each D (r) com- e × exx MLWF-product density. The final term on the right- puted on Pi, this quantity is shipped back to Pj (assuming b 2 2 hand side, ρij (1/4π)(∇ ∇PE)vij, is the so-called ζ = 1 here for simplicity), which requires communica- e ≡ − − e boundary charge, which accounts for the part(s) of the ji tion of NME double-precision numbers for each Dexx(r); Laplacian stencil that extend outside of Θ(Cij,RPE) (and this array is equivalent in size to φej(r) and represents into the Θ(Cij,RME) Θ(Cij,RPE) shell) and have been 2 \ the necessary second communication event described in truncated in the ∇PE representation of this operator. In Sec. III C 2. Since Pi has access to vij(r) j [i], this doing so, ρb accounts for the Dirichlet boundary condi- e ∀ ∈ L eij Di (r) = P Dij (r) tions provided by the ME form of the potential on the process accumulates exx j exx into a local temporary array that is the size of the global real-space Θ(Cij,RME) Θ(Cij,RPE) shell surrounding Θ(Cij,RPE) \ ij (see Eq. (22)), and therefore allows for a numerically exact grid; as Pi receives a given Dexx(r) array from Pj, this solution of the PE using ∇2 , a Laplacian whose stencil quantity is also accumulated into this local temporary ar- PE ij coverage has been restricted to the Θ(Cij,RPE) subdo- ray. When all Dexx(r) contributions are accounted for, this i main. This restricts the PE to the support of ρeij and temporary array on Pi holds the final Dexx(r) according substantially reduces the dimensionality and associated to the ORBITAL data distribution scheme. computational cost of solving the PE for each overlapping MLWF pair. 6. Step VI: Redistribution of Wavefunction Forces The system of sparse linear equations in Eq. (35) is then solved (for veij) using an iterative conjugate-gradient (CG) approach. Since the solution of the PE is notoriously At this stage, all of the EXX-related quantities have E difficult to parallelize efficiently over MPI tasks, the CG- been evaluated; xx has been accumulated and broadcast to all processes, while De i (r) is stored in the ORBITAL based PE solver in the exx module is largely parallelized { xx } over OpenMP threads to allow for an efficient real-space data distribution scheme. As such, the remaining task for i v the exx module is the transformation of Dexx(r) to the evaluation of eij. The efficient solution of the PE for { } each overlapping MLWF pair is the cornerstone of our GRID data distribution scheme for compliance with the MLWF-based EXX algorithm, and the performance of core functions in QE (see Sec. III B). This redistribution is essentially the reverse operation of the GRID ORBITAL the CG-based PE solver will be discussed in Sec. IV B 2. → During CPMD simulations, the number of CG iterations redistribution of the MLWFs described in Fig. 3 and required to solve a given PE can be substantially reduced Sec. III C 1. by using a polynomial extrapolation144 of the potential At this stage, the QE executable exits the exx module i from the previous CPMD steps as the initial guess. More with Exx and De (r) (in the GRID data distribution { xx } detailed considerations of this extrapolation scheme as scheme) as output. These EXX-related quantities are well as extensions to BOMD will be discussed in future then added to their semi-local exchange analogs with the work. appropriate EXX fraction, ax, given in Eq. (2). With the 19

EXX contribution to the wavefunction (MLWF) forces, ically, the computational cost associated with the CG 4/3 the CPMD equations of motion are now propagated for- solution of the PE in exx scales as (NPE ), with NPE 3 O ward via Eqs. (9)–(10). asymptotically growing as (R ). In the same breath, O PE the associated communication and memory footprint in exx scale as (NME), with NME asymptotically growing 3 O IV. ACCURACY AND PERFORMANCE as (R ). In addition, both computation and communi- O ME cation in exx scale as (Npair), with Npair asymptotically 3 O exx growing as (R ) (for homogeneous systems with con- During the implementation of the module, we O pair have introduced three parameters: Rpair, RPE, and RME stant densities). As such, judicious choices for Rpair, RPE, (see Sec. III C 2). Rpair is used to determine whether or and RME are required to balance the performance and not two MLWFs, φei(r) and φej(r), are considered to be accuracy of our algorithm (see Secs. IV A 1–IV A 2). Since the number of self pairs (with i = j) is No and an overlapping ij pair via Cei Cej < Rpair. For all h i | − | the number of non-self pairs (with i = j) is (n 1)No overlapping ij pairs, RPE and RME are the radii of 6 e − h i (see Sec. II C), the computational cost associated with the spherical Θ(Cij,RPE) and Θ(Cij,RME) subdomains, evaluating the contribution to Exx from non-self pairs will which are centered at Cij and chosen to cover the product dominate that from self pairs when performing MLWF- ρ (r) φ (r) φ (r) density, eij , and individual orbitals, ei and ej , based EXX calculations. However, the energetic contri- respectively. In order to efficiently perform large-scale bution to Exx from the self pairs tends to dominate the hybrid DFT based AIMD simulations, judicious choices contribution from non-self pairs, which is primarily due to R R R for pair, PE, and ME are required to balance the the fact that the overlap between an MLWF and itself is performance and accuracy, and are therefore the focus substantially larger than the overlap between an MLWF of this section. In Sec. IV A, we introduce a systematic and any one of its neighbors. In the reference calculation selection of these parameters based on user-defined er- i described above, for example, the energetic contribution ror thresholds for Exx and De (r) . In Sec. IV B, we ref xx to Exx is dominated ( 85%) by the self pairs while { } ≈ ref discuss the intranode (OpenMP) and internode (MPI) par- the computational cost to evaluate E ( 87%) mainly xx ≈ allel scaling performance when simulating liquid water originates from the non-self pairs. Taken together, these using our MLWF-based EXX approach on several different observations suggest that we can further balance the per- supercomputer architectures. formance and accuracy of our algorithm by employing s a larger value of RPE for the self pairs (RPE, which de- s fines the Θ(C0,RPE) proto-subdomain) and a smaller A. Parameters and Convergence Criteria ns value of RPE for the non-self pairs (RPE, which defines ns s the Θ(C0,RPE) proto-subdomain). Since RPE will in ns In this section, a systematic determination of all re- general be larger than RPE, we will employ the same quired parameters will be demonstrated using a snapshot strategy by using a larger value of RME for the self pairs s from a liquid water simulation containing (H2O)64 in a (RME) and a smaller value of RME for the non-self pairs ns s ns cubic cell with L = 23.52 Bohr. We begin by performing a (RME), which define the Θ(C0,RME) and Θ(C0,RME) reference single-point energy evaluation in QE at the PBE0 proto-subdomains, respectively. Physically speaking, the level with a planewave kinetic energy cutoff of 85 Ry. In choice to use larger (smaller) RPE and RME values for the ref this reference calculation (which yields Exx ), we use the self (non-self) pairs is also justified by the fact that: (i) largest possible values for all three parameters such that: self MLWF-product densities (ρeii(r)) have a larger sup- (i) all radially overlapping MLWF pairs are included with port than non-self MLWF-product densities (ρeij(r)) due Cei Cej < Rpair = L/2 = 11.76 Bohr and (ii) both proto- to the increased overlap between an MLWF and itself, and | − | subdomains (Θ(C0,RPE) and Θ(C0,RME)) are contained (ii) self MLWF-product potentials (veii(r)) are generally within the simulation cell (i.e., RME = L/2 = 11.76 Bohr longer-ranged than non-self MLWF-product potentials and RPE = L/2 n maxa δξa = 11.11 Bohr). Subtrac- (veij(r)) due to the absence of a monopolar contribution − { } tion of n maxa δξa (with n = 3) from RPE allows us in the non-self cases (see Sec. II C). { } to retain a thin shell of the real-space grid surrounding Θ(C0,RPE); this shell provides the boundary conditions for the PE and accommodates the part(s) of the dis- 1. Convergence of the EXX Contribution to the Energy cretized Laplacian stencil that extend beyond Θ(C0,RPE). We note in passing that the energetic contributions As an initial test, Exx will be computed using the con- ref ref to E from MLWF pairs with Cei Cej > Rpair = verged MLWFs obtained during the calculation of E . xx | − | xx L/2 = 11.76 Bohr (within the minimum image conven- Since the effects of self-consistency are neglected in this tion) are completely negligible (i.e., 2.0 10−5 kcal/mol test (and will be investigated below), we can quickly assess −6 × ref or 4.8 10 %). the convergence of Exx with respect to Exx as a function ≈ × s ns As such, this reference calculation provides an approx- of Rpair, RPE, and RPE (without the need for modifying s ns imately exact evaluation of Exx using the exx module, RME and RME). We do so by independently varying each s ns albeit at an excessive computational cost. More specif- of the Rpair, RPE, and RPE parameters, while keeping 20 all other parameters fixed at their largest possible values 0.10 (see Fig. 6). From this figure, one can immediately see Rpair that the relative (percent) error in Exx rapidly decays as 0.08 Rs each of these parameter values is increased. This obser- (%) PE Rns ref vation can be justified by considering the fact that each xx PE

MLWF (in finite-gap systems) is exponentially localized, E 0.06 / and hence products of MLWFs (i.e., ρii(r) or ρij(r)) are ) e e s also exponentially localized. As such, increasing RPE (or xx 0.04 ns RPE) leads to spherical PE domains that increasingly E

ρ (r) ρ (r) − cover the exponentially decaying tails of eii (or eij ); 0.02 ref as seen in Eq. (19) (and the surrounding discussion), this xx

results in rapid convergence to the reference value for E ns ( 0.00 Exx. Since Exx converges more quickly with RPE (than Rs PE), this finding confirms our physical intuition that 2 4 6 8 10 12 Rs > Rns , and further justifies our use of separate self PE PE R (bohr) and non-self proto-subdomains as a means to improving the balance between performance and accuracy in this s algorithm. By increasing Rpair, the incremental contri- FIG. 6. Convergence of Exx as a function of Rpair, RPE, and ns bution to Exx from (more) distant MLWF pairs becomes RPE on a snapshot of liquid water containing (H2O)64 in a ρ (r) 0 r cubic cell with L = 23.52 Bohr. Relative errors (in %) with negligible as eij , which also results in rapid ref → ∀ respect to Exx are evaluated individually by varying Rpair convergence to the reference value for Exx. s ns (solid black line), RPE (solid blue line), and RPE (dashed blue Based on this initial convergence test, we have chosen line), while keeping all other parameters set to their maximum s ns Rpair = 8.0 Bohr, RPE = 6.0 Bohr, and RPE = 5.0 Bohr allowed values (see text for more details). An overall relative as the default parameter set in QE; for EXX-based simula- error of ≈ 0.02% corresponds to the default parameter values in s ns tions of liquid water, the overall relative error is 0.02%, QE (i.e., Rpair = 8.0 Bohr, R = 6.0 Bohr, R = 5.0 Bohr), ≈ PE PE which is essentially additive (in each of these parameters) and is depicted by the solid green line. and typically rather stringent when obtaining ground- state energies, binding/cohesive energetics, and ionic i vide infra Exx via the sparse evaluation of De (r) (see Eq. (18), forces in the condensed phase ( ). To increase { xx } the convergence of Exx, a general rule of thumb is to Eq. (20), and the surrounding discussion); in other words, s Rs Rns first increase RPE, since the self-pair contribution is the larger values for ME and ME lead to more accurate dominant contribution to Exx, yet requires evaluation Exx values (via the convergence of the orbitals which is of less terms (and is therefore significantly cheaper to self-consistently driven by De i (r) ), although this is { xx } compute) than the contribution from non-self pairs. In accompanied by a higher computational cost (as well as this example, one can further reduce the relative error in communication overhead and memory footprint) during Exx by an additional factor of two (i.e., to 0.01%) by the EXX calculation. To quantify this effect (and deter- s ≈ s ns simply increasing RPE from 6.0 Bohr to 7.0 Bohr, with mine the appropriate default values for RME and RME), negligible (< 1%) additional computational cost. we now study the convergence of De i (r) with respect { xx } In this convergence test, we have neglected the effects of to these parameters. orbital self-consistency when determining Exx. To quan- tify this effect, we first performed a fully self-consistent EXX calculation using the default parameter values for 2. Convergence of the EXX Contribution to the s ns Wavefunction Forces Rpair, RPE, and RPE determined above (while keeping s ns RME and RME set to the maximum reference value of i L/2 = 11.76 Bohr). In doing so, we found that the To study the convergence of Dexx(r) and determine s { ns } inclusion of orbital self-consistency leads to a negligible the default values for RME and RME, use of the reference (< 0.01%) variation in Exx, which indicates that: (i) there calculation performed above (in which all parameters were is excellent agreement between the PE and ME evalua- set to the largest possible values) is inconvenient as it s ns s ns tions of the far-field (beyond RPE and RPE) contribution lacks the flexibility to vary RME and RME (as these pa- i Rs Rns to the wavefunction forces (Dexx(r)), and (ii) the default rameters must be larger than PE and PE to provide value of Rpair is sufficient to capture all relevant overlap- the boundary conditions for the PE). Since the use of the s ns ping MLWF pairs in this system. In a non-self-consistent default parameter values for Rpair, RPE, and RPE (with s ns calculation, the ME only provides the boundary condi- RME and RME each set to the maximum reference value ref tions for the PE, and therefore has no direct effect on of L/2 = 11.76 Bohr) reproduces Exx with negligible s ns s ns i.e. 0.02% Exx, provided that RME (RME) is larger than RPE (RPE) error ( , to within ), we will base our convergence by the extent of the Laplacian stencil (i.e., n maxa δξa , tests on this as our new reference calculation. As an ini- { } tial test, De i (r) will be computed using the converged see Sec. IV A). In a self-consistent calculation, however, { xx } Rs and Rns govern the accuracy and cost of obtaining MLWFs and De i,ref (r) obtained during this new refer- ME ME { xx } 21 ence calculation. Since the effects of self-consistency are 0.8 neglected in this test (and will be investigated below), Rs i ME we can quickly assess the convergence of Dexx(r) with Rns i,ref {s } ns 0.6 ME respect to Dexx (r) as a function of RME and RME. { } s ns We do so by independently varying the RME and RME parameters, while keeping all other parameters (Rpair, s ns 0.4 RPE, and RPE) fixed at their default values (see Fig. 7). (%) Rs By employing this new reference calculation, ME can xx s now be varied from 7.65 Bohr (i.e., RPE + n maxa δξa) γ ns 0.2 to a maximum value of L/2 = 11.76 Bohr, and RME can ns be varied from 5.65 Bohr (i.e., RPE + n maxa δξa) to a maximum value of L/2 = 11.76 Bohr. To quantify the 0.0 i 1 error in Dexx(r) , we will utilize the relative L -norm for i { } each Dexx(r): 6 7 8 9 10 11 12 R (bohr) R i,ref i dr Dexx (r) Dexx(r) i − γxx , (36) i ≡ R i,ref FIG. 7. Convergence of {Dexx(r)} (via the γxx metric defined dr Dexx (r) s ns in Eqs. (36) and (37)) as a function of RME and RME on a snapshot of liquid water containing (H2O)64 in a cubic cell which can then be averaged over MLWFs to furnish the with L = 23.52 Bohr. Relative errors (in %) with respect to i,ref s following convergence metric: {Dexx (r)} are evaluated by varying RME (solid red line) and ns RME (dashed red line), while keeping all other parameters 1 X i (R , Rs , Rns ) set to their default values in (see text γxx = γ . (37) pair PE PE QE N xx o i for more details). An overall relative error of ≈ 0.2% (with a standard deviation of ±0.03%) corresponds to the default s ns From Fig. 7, one can again see that the relative (per- parameter values in QE (RME = 10.0 Bohr, RME = 7.0 Bohr), i and is depicted by the solid (dashed) green line. cent) error in De (r) rapidly decays as each of these { xx } parameter values is increased. Since γxx converges more ns s quickly with RME (than RME), this finding also confirms s ns our physical intuition that RME > RME, and further justi- fies our use of separate self and non-self proto-subdomains as a means to improving the balance between performance s ns RME = 10.0 Bohr and RME = 7.0 Bohr (in addition to and accuracy in this algorithm. As discussed above in s ns Rpair = 8.0 Bohr, R = 6.0 Bohr, and R = 5.0 Bohr) Secs. II C and IV A, this results from the fact that self PE PE as the default parameters used in QE. MLWF-product potentials (veii(r)) are generally longer- ranged than non-self MLWF-product potentials (vij(r)) e s due to the absence of a monopolar contribution in the As seen above for RPE, one can further reduce γxx by non-self cases. Since the cost of our algorithm (i.e., compu- an additional factor of two (i.e., to 0.1%) by increas- s ≈ tation, communication, and memory) is dominated by the ing RME from 10.0 Bohr to 11.0 Bohr, with negligible ns non-self contributions and scales cubically with RME, it is (< 1%) additional computational cost; for sufficiently s preferable to choose the smallest possible value for this pa- large simulation cells, increasing RME is therefore another rameter. The plots depicted in Fig. 7 clearly suggest that efficient way to improve the accuracy of the EXX calcu- ns s RME is converged for R & 7.0 Bohr and (the noticeably lation. Even with RME = 11.0 Bohr, however, one can s s slower) RME only begins to plateau for R & 10.0 Bohr. still observe a finite slope in the tail of the RME curve in When used in conjunction with the default parameter val- Fig. 7. This is an artifact of performing this convergence s ns ues for Rpair, RPE, and RPE determined above, parameter test on (H2O)64, and a stricter convergence of γxx with s ns s values of RME = 10.0 Bohr and RME = 7.0 Bohr lead to RME would be observed when using a larger simulation an overall relative error of γxx 0.2% for this snapshot of cell. To estimate the effect of this artifact on γxx, we ≈ 2 liquid water. To see how this γxx 0.2% error translates performed an exponential fit (with R > 0.9999) to the ≈ s into the final value of Exx, we again performed a fully self- RME curve, and found that the residual error in γxx (i.e., s consistent EXX calculation on this (H2O)64 snapshot; in that was caused by truncating RME to L/2 = 11.76 Bohr) doing so, we found that the use of these parameter values is 0.1%. Based on the self-consistency test described −7 ≈ leads to a completely negligible error ( 10 %) in Exx. above, we expect that this residual error in γxx would ≈ −7 We further note that a number of EXX-based CPMD only lead to a negligible ( 10 %) error in Exx for our ≈ simulations of solid and liquid aqueous systems have been (H2O)64 test system; when treating similarly sized (or performed by our group using these parameter values; in slightly smaller) systems, we recommend that users quan- s all cases, we have found that the appropriate constant of tify this truncation error and potentially set RME to the s motion was reasonably maintained. As such, we have set largest possible value (i.e., RME = L/2). 22

3. Transferability of the Default EXX Parameters certain hacks can be used to ameliorate the efficiency degradation in these cases, we have chosen to focus on To provide an alternative gauge for the 0.02% and 0.2% a comprehensive revision of our algorithm that will ex- i plicitly account for the MLWF-orbital (Ωi) and MLWF- error thresholds in Exx and De (r) , we also considered { xx } Ω the errors in the binding energy and ionic forces in this product ( ij) domains introduced in Sec. II C. Inspired by the work of Gygi and coworkers,79,80,91 we are cur- (H2O)64 snapshot by comparing the results obtained with s ns rently working on a significantly more efficient (vide infra) the default and reference (i.e., Rpair = RME = RME = s ns β-version of exx in which each MLWF will have a spread- L/2 = 11.76 Bohr and RPE = RPE = 11.11 Bohr, see dependent Ωi and each overlapping ij pair will have Sec. IV A) parameter sets. For the binding energy (per h i a overlap-dependent Ωij. As such, MLWFs with larger H2O molecule), we found an error of 0.04 kcal/mol us- ing the default parameters in exx, which is comparable spreads will automatically be treated more accurately to the typical pseudopotential error. For the 3N ionic without sacrificing computational efficiency for MLWFs force components (with N being the number of atoms), with smaller spreads. In addition, more distant MLWF the default parameter set leads to a mean absolute er- pairs will have smaller MLWF-product domains by con- −5 struction due to the overlap dependence in Ωij = Ωi Ωj. ror of 6.2 10 Ha/Bohr and a maximum absolute ∩ × −4 β exx error of 2.5 10 Ha/Bohr, which is approximately In doing so, we expect that the -version of will × half of the default convergence criteria used during struc- be a single EXX algorithm that is general enough to ac- tural/geometry optimizations in QE. Taken together, all curately and efficiently handle condensed-phase systems of these tests strongly indicate that the default parameter ranging from large-gap homogeneous systems like liquid values in exx are more than adequate when performing water (with a narrow distribution of MLWF spreads) to EXX calculations on systems like liquid water. small-gap heterogeneous systems like solvated semicon- We note in passing that our choice to treat self ( ii ) ducting nanoparticles (with a wide distribution of MLWF h i and overlapping non-self ( ij ) MLWF pairs differently spreads). When paired with an orbital localization scheme s h i ns that can appropriately treat small-to-vanishing band gap using the Θ(C0,R ) and Θ(C0,R ) proto-subdomains PE PE 113,114 β is only the first step towards exploiting the concept of systems, we expect that the -version of this al- variable-size subdomains during MLWF-based EXX cal- gorithm will also be an important step towards treating s ns large-scale metallic systems with screened and/or range- culations. By using a single set of RPE and RPE values, 49,62,132–136 this choice is particularly well-suited for condensed-phase separated exchange. systems characterized by a narrow distribution of MLWF spreads (e.g., liquid water, wherein each H2O molecule 4. Tight Convergence to the Electronic Ground State has a set of four similarly localized MLWFs). As such, we expect that the chosen default parameter values deter- mined above for bulk liquid water will yield similar errors While exx is designed to perform efficient AIMD sim- i ulations, this module can also be adapted to evaluate in Exx and Dexx(r) for systems with similarly large band { } precise ground state energetics (e.g., to within an uncer- gaps. For systems with smaller gaps (and hence more −8 diffuse MLWFs), one would need to use more stringent tainty of ∆E < 10 Ha), which are needed for property parameter values to obtain a similar level of accuracy; evaluations, numerical phonon calculations, etc. In or- as such, a series of test calculations (in analogy to those der to achieve tight convergence to the electronic ground s state using exx, the default convergence criteria used above) should be run to determine the optimal RPE and ns during the CG solution to the PE (exx_poisson_eps RPE values prior to performing large-scale production = 10−6 au) and the nested SODD optimization of the AIMD simulations. For systems with a wider distribution −8 of MLWF spreads (due to a smaller band gap and/or a Marzari-Vanderbilt functional (tolw = 10 au) must more heterogeneous environment), our current algorithm be tightened accordingly. Doing so minimizes the noise would be forced to sacrifice computational efficiency for in the force acting on φei(r) ( Dei(r) in Eq. (26)), and s ns { } { } accuracy (vide infra), since RPE and RPE (as well as reduces oscillatory behavior in the energy profile during s ns RME and RME) would need to be large enough to provide the SODD-based SCF procedure. sufficient cover for the most diffuse MLWF in the system (and would therefore be overkill for MLWFs with substan- tially smaller spreads). As pointed out by Dawson and B. Parallel Scaling and Performance Gygi,91 this issue is particularly important in small-gap heterogeneous condensed-phase systems, such as solvated As demonstrated above in Sec. IV A, judicious choices semiconducting nanoparticles and water-semiconductor for the five parameter values in exx allow one to evaluate interfaces. all EXX-related quantities with a high level of accuracy. Although the current implementation of exx would To enable large-scale EXX-based AIMD simulations using sacrifice efficiency when applied to such challenging cases, this approach, we employ a hybrid MPI/OpenMP paral- we would still argue that the algorithmic framework of lelization scheme that allows us to minimize the walltime exx is general enough to perform accurate hybrid DFT cost (i.e., time to solution) by exploiting both internode based CPMD simulations of finite-gap systems. While and intranode computational resources provided by mas- 23 sively parallel supercomputer architectures (see Sec. III). in step (ii) were used to equilibrate the intermolecular Here, we remind the reader that our massively parallel degrees of freedom in these systems, and the additional implementation of exx seamlessly distributes the major CPMD simulations performed in step (iii) were used to computational workload across thousands of MPI pro- ensure that the intramolecular degrees of freedom (i.e., cesses. Within each MPI process, the CG-based PE solver the OH bonds and HOH angles) were equilibrated at the is further parallelized over OpenMP threads. The scal- PBE0 level. Since the temperature of these systems will ing and performance of the internode (MPI, first level) rapidly increase once the rigid-molecule TIP4P2005 con- and intranode (OpenMP, second level) levels in our hybrid straint is lifted, this additional CPMD equilibration step is parallelization scheme were evaluated by performing large- important when determining a representative average for scale simulations of liquid water with exx on the IBM the walltime cost during EXX-based CPMD simulations Blue Gene/Q platform (Mira), and will be discussed below in the NVT ensemble. During the CPMD simulations in Secs. IV B 1 and IV B 2, respectively. In Sec. IV B 3, the (i.e., the 500 equilibration steps and subsequent 50 produc- computational performance of exx will also be considered tion steps), the temperature (of the ions) was controlled on the Cori Haswell and KNL architectures. using massive Nosé-Hoover thermostats, each with a chain length of 4.148,149 The nuclear and electronic degrees of freedom were integrated using the standard Verlet algo- MPI rithm and a time step of 2.0 au ( 0.05 fs); to ensure a 1. Internode Parallelization via ≈ clear adiabatic separation between the electronic and nu- clear degrees of freedom during the CP dynamics, we used For the first level of parallelization, the exx module a fictitious electronic mass of 100 au and the nuclear mass employs internode MPI communication to distribute the of deuterium for each hydrogen atom. Interactions be- computational workload associated with a given EXX tween the valence electrons and the ions (consisting of the calculation across (many) thousands of compute nodes. nuclei and their corresponding frozen-core electrons) were To critically assess the computational performance of treated using the Hamann-Schlüter-Chiang-Vanderbilt this parallelization level, which is at the very heart of (HSCV) type norm-conserving pseudopotentials150,151 dis- our massively parallel algorithm, we performed a strong- tributed with the Qbox package.152 The valence electronic scaling analysis (i.e., by varying the number of processing pseudo-wavefunctions were expanded in a planewave basis elements for a fixed problem size) and a weak-scaling set which includes planewaves with a kinetic energy up analysis (i.e., by varying the problem size for a fixed to 85 Ry. Mass preconditioning was applied to all Fourier ratio of problem size to number of processing elements). components of the electronic pseudo-wavefunctions with To investigate the strong and weak scaling of exx, we a kinetic energy above 25 Ry.140 To enable distributed performed a series of 12 different EXX-based CPMD storage of all real-space quantities according to the GRID simulations of liquid water, in which (i) the problem data distribution scheme in QE (see Fig. 3), the real-space (system) size was varied to include Nwater = 64, 128, 256 (and simple-cubic) grids utilized in these calculations water molecules (each with No = 4 Nwater MLWFs), × were partitioned into Nslab = 140, 176, 220 slabs along and (ii) the number of processing elements (Nproc MPI 1 the z-direction for (H2O)64, (H2O)128, and (H2O)256, re- processes) was varied via ζ = Nproc/No = /2, 1, 2, 4. In spectively. All computational timings were generated these calculations, we used one MPI process per node on using an in-house development version of QE (based on the Mira IBM Blue Gene/Q platform, and used all of v5.0.2).153 the 64 hyperthreads available per node for the intranode OpenMP parallelization. Computational timings for each of these 12 CPMD sim- Initial structures for (H2O)64, (H2O)128, and (H2O)256 ulations of liquid water at the hybrid PBE0 level on Mira were systematically generated using the following proce- are presented in Table I. In this table, all timings have dure: (i) randomly packing Nwater = 64, 128, 256 water been averaged over 50 CPMD steps and are reported (in molecules into simple cubic unit cells with lattice parame- s/step) for the following four QE modules: (i) the walltime 3 ters chosen to match the targeted density of 0.993 g/cm ; associated with the underlying GGA calculation ( tGGA ); h i (ii) equilibrating each of these randomly packed struc- (ii) the walltime associated with MLWF localization be- tures via MD simulations in the NVT ensemble us- tween each CPMD step via nested SODD optimization of 146 ing the TIP4P2005 force field at 300 K for 1.0 ns the Marzari-Vanderbilt functional ( tMLWF , see Eq. (29)); 147 h i in GROMACS; (iii) further equilibrating each of the (iii) the walltime spent in the exx module ( texx , see h i TIP4P2005 structures via CPMD simulations in the NVT Fig. 2); and (iv) the total walltime associated with a 58,142 145 ensemble using the PBE0 hybrid xc functional at given CPMD step ( tTotal ). To ensure a fair compar- h i 330 K for 250 fs (i.e., 500 CPMD steps). The com- ison between tGGA and texx , the underlying GGA ≈ h i h i putational timings reported herein are meant to reflect calculation utilized all MPI processes available to the exx the walltime spent during EXX-based CPMD simulations module. This was accomplished using an existing two-tier in the NVT ensemble, and were therefore averaged over parallelization scheme in QE, which allows for a com- 50 additional CPMD steps starting from the equilibrated putationally efficient execution of the fwdFFT/invFFT structures obtained using this three-step procedure. Here operation (i.e., a typical bottleneck during GGA-based we note that the TIP4P2005 MD simulations performed CPMD simulations). The first parallelization tier takes 24

TABLE I. Computational timings profile for CPMD simulations of liquid water at the hybrid PBE0 level on the Mira IBM Blue Gene/Q platform using the exx module in QE. Parameters include: (i) the system size, which was varied to include Nwater = 64, 128, 256 water molecules (each with No = 4 × Nwater MLWFs); (ii) the number of MPI processes (Nproc), which 1 was varied to cover ζ = Nproc/No = /2, 1, 2, 4; and (iii) the level of task-group parallelization in QE, which was varied to include Ntg = 1, 2, 4, 8, 16 task groups (see below and text for more details). All timings have been averaged over 50 CPMD steps, and are reported (in s/step) for the following QE modules: htGGAi, the walltime associated with the underlying GGA calculation; htMLWFi, the walltime associated with optimizing the Marzari-Vanderbilt functional (i.e., to localize the MLWFs between CPMD steps as shown in Eq. (29)); htexxi, the walltime spent in the exx module; and htTotali, the total walltime associated with a given CPMD step.145 To gauge the reproducibility of these timings, each CPMD simulation was run in triplicate; the observed fluctuations were always smaller than the precision reported (i.e., < 10−2 s) and were not included in the table. Also shown is the ratio, htexxi / htGGAi, demonstrating that the walltime required to perform EXX-based CPMD simulations with the exx module are approximately 1−3× that of the underlying GGA. All htexxi timings were further broken down into the walltime comp comm idle dedicated to computation events (htexx i), communication overhead (htexx i), and processor idling (htexx i); the corresponding comp comm idle fractions of the htexxi walltime (fexx , fexx , and fexx ) were reported as percentages (in %). All timings reflect the fact that one MPI process was executed per node on Mira, and all 16 physical cores (up to 64 hyperthreads) per node were used for the n intranode OpenMP parallelization. To utilize all available MPI processes during the underlying GGA calculation, Ntg = 2 ≥ 1 was set to the maximum possible value such that Ntg × Nslab ≤ Nproc (see text for more details).

Parameters QE Module Timings Breakdown of texx htexxi comp comp comm comm idle idle Nwater No Nproc ζ Ntg htGGAi htMLWFi htexxi htTotali ht i (f ) ht i (f ) ht i (f ) htGGAi exx exx exx exx exx exx 64 256 128 1/2 1 2.81 0.16 7.34 10.31 2.6 4.07 (55.4) 0.96 (13.1) 2.31 (31.5) 64 256 256 1 1 1.97 0.17 3.83 5.97 1.9 2.05 (53.4) 0.52 (13.5) 1.27 (33.1) 64 256 512 2 2 1.02 0.16 2.74 3.92 2.7 1.06 (38.9) 0.39 (14.3) 1.28 (46.8) 64 256 1024 4 4 0.63 0.16 1.70 2.49 2.7 0.54 (32.0) 0.37 (21.6) 0.79 (46.5) 128 512 256 1/2 1 5.19 1.43 8.27 14.89 1.6 4.60 (55.6) 1.21 (14.7) 2.46 (29.8) 128 512 512 1 2 2.64 0.43 4.35 7.42 1.7 2.35 (54.1) 0.64 (14.8) 1.36 (31.2) 128 512 1024 2 4 1.57 0.41 3.04 5.02 1.9 1.25 (41.0) 0.51 (16.9) 1.28 (42.1) 128 512 2048 4 8 0.96 0.41 1.96 3.33 2.0 0.67 (34.4) 0.48 (24.8) 0.80 (40.9) 256 1024 512 1/2 2 6.39 2.77 8.34 17.50 1.3 4.19 (50.2) 1.58 (18.9) 2.57 (30.8) 256 1024 1024 1 4 3.59 1.20 4.80 9.59 1.3 2.23 (46.5) 1.11 (23.2) 1.46 (30.4) 256 1024 2048 2 8 2.23 1.13 3.33 6.69 1.5 1.26 (38.0) 0.76 (22.9) 1.30 (39.1) 256 1024 4096 4 16 1.59 1.08 2.41 5.08 1.5 0.77 (31.9) 0.82 (34.2) 0.82 (33.9) advantage of the fact that the real-space grid has been the exx module is still the overall bottleneck during partitioned into Nslab slabs along the z-direction (with hybrid DFT based CPMD simulations. By exploiting each slab distributed to a particular MPI process); this al- both the natural sparsity of the exchange interaction lows one to split the 3D fwdFFT and invFFT operations and our massively parallel implementation, the walltime into 2D intra-slab FFT operations (which are executed cost to evaluate all EXX-related quantities is compara- in parallel without the need for additional communica- ble to that of the underlying GGA (i.e., texx / tGGA h i h i tion) and 1D inter-slab FFT operations (which are also is now within the range of 1–3). We further stress ≈ executed in parallel, but require communication among that this ratio steadily decreases with increasing system the pool of MPI processes). At the first tier, the under- size due to the more favorable scaling of the exx mod- lying GGA calculation can utilize up to Nproc = Nslab ule (see below). Since the exx module requires ML- MPI processes. To enable the use of Nproc > Nslab MPI WFs (and the underlying GGA does not), we now dis- processes (when available), the second parallelization tier cuss the additional cost needed to perform the nested (i.e., task-group parallelization) is employed to further SODD optimization of the Marzari-Vanderbilt functional distribute the independent 3D FFT operations associated ( tMLWF ) between each CPMD step. In all CPMD simula- n h i with the No orbitals. At the second tier, Ntg = 2 1 tions performed herein, this MLWF refinement procedure ≥ can be set to the maximum possible value such that (see Sec. III A) only required 3–4 SODD steps (on aver- Ntg Nslab Nproc, thereby enabling the underlying age) per CPMD step. As such, tMLWF only represents × ≤ h i GGA calculation to utilize up to Nproc = Ntg Nslab a minor contribution to tTotal for systems containing × h i MPI processes. We note in passing that the scalability of < 500 atoms (e.g., (H2O)64 and (H2O)128). For larger task-group parallelization depends on the communication systems (e.g., (H2O)256), tMLWF can become quite sub- h i bandwidth, and will often deteriorate when Ntg 4. By stantial ( 10 20% of tTotal ) as the MLWF procedure  ≈ − h i using this approach, the underlying GGA calculation is requires cubic-scaling matrix operations. As such, a more able to utilize the pool of available MPI processes, thereby efficient MLWF localization procedure (which takes ad- ensuring a reasonably fair comparison between the tGGA vantage of the sparsity of the MLWFs) will be required h i and texx timings. to efficiently utilize the exx module for system sizes that h i are significantly larger than (H2O)256. In the QE module timings in Table I, we observed that 25

16.0 ζ = 1/2 ζ = 2 ζ = 1 ζ = 4

8.0 (s / step)

i 4.0 exx t h 2.0

64 128 256 Nwater

FIG. 8. Strong-scaling analysis of the MPI internode par- FIG. 9. Weak-scaling analysis of the MPI internode par- allelization level in exx during CPMD simulations of liq- allelization level in exx during CPMD simulations of liquid uid water at the hybrid PBE0 level on the Mira IBM Blue water at the hybrid PBE0 level on the Mira IBM Blue Gene/Q Gene/Q platform. For a fixed system size (i.e., (H2O)64 platform. For a fixed ratio of system size to number of process- (red line), (H2O)128 (green line), and (H2O)256 (blue line)), ing elements (i.e., ζ = 1/2 (red line), ζ = 1 (green line), ζ = 2 the walltime spent in the exx module (htexxi in s/step, av- (blue line), and ζ = 4 (magenta line)), the walltime spent in eraged over 50 CPMD steps) is plotted versus the num- the exx module (htexxi in s/step, averaged over 50 CPMD ber of MPI processes (Nproc), which were varied to include steps) is plotted versus the system size, which was varied to 1 ζ = Nproc/No = /2, 1, 2, 4. For reference, ideal strong-scaling include Nwater = 64, 128, 256 water molecules. For reference, timings were plotted as dashed lines for each system size, and ideal weak-scaling timings were plotted as dashed lines for 1 were computed with respect to the corresponding ζref = /2 each ζ, and were computed with respect to the corresponding timings. Pie plots were used to illustrate the fraction of htexxi comp (H2O)64 timings. Pie plots were again used to illustrate the dedicated to computation events (fexx , colored), communi- comp comm idle fraction of htexxi dedicated to computation events (fexx , col- cation overhead (fexx , black), and processor idling (fexx , comm ored), communication overhead (fexx , black), and processor white). idle idling (fexx , white).

Based on the timings in Table I, we assessed the strong- processing elements. This was accomplished by consid- exx scaling behavior of the module by analyzing how ering (H2O)64, (H2O)128, and (H2O)256 for fixed values t 1 exx changes as the number of processing elements was of ζ = Nproc/No /2, 1, 2, 4 (see Fig. 9). For each ζ h i ∈ { } varied for a fixed problem size. This was accomplished value, we computed the weak-scaling efficiency via 1 by changing ζ = Nproc/No = /2, 1, 2, 4 when simulating (H2O)64, (H2O)128, and (H2O)256 (see Fig. 8). For each texx N ref texx weak h i water h iNwater=64 system size, we computed the strong-scaling efficiency via ηMPI (Nwater) = , (39) ≡ texx texx h iNwater h iNwater 1 ζref texx ζ 2 texx ζ=1/2 ref strong ref in which Nwater = 64 was chosen as the reference (or ηMPI (ζ) · h i = · h i , (38) ≡ ζ texx ζ ζ texx ζ baseline) system size (as this represents a realistic compu- · h i · h i t tational setup) and exx Nwater is the the walltime spent 1 h i in which ζref = /2 was chosen as the reference (or baseline) in the exx module for a given Nwater. When averaged weak ζ value (as this represents a realistic computational setup) over all four ζ values, we find that ηMPI decreases to and texx is the walltime spent in the exx module for 89% (Nwater = 128) and 81% (Nwater = 256) as the h iζ ≈ ≈ a given ζ. When averaged over all three systems, we find system size is increased. As shown in Fig. 9, the exx strong that ηMPI decreases to 93% (ζ = 1), 66% (ζ = 2), module is quite scalable as the system size is increased, ≈ ≈ and 50% (ζ = 4) as the number of processing elements and the time-to-solution can be kept (relatively) constant ≈ is increased (see below for a more detailed discussion). for systems as large as (H2O)256 provided that a consis- For even higher ζ, the number of MPI processes becomes tent (i.e., fixed ζ) amount of computational resources are comparable to the number of overlapping ij pairs; as available (see below for a more detailed discussion). strong h i such, ηMPI is expected to deteriorate even further for Despite the fact that the strong- and weak-scaling ζ 4. efficiencies of the exx module are not perfect, our al-  Based on the timings in Table I, we also assessed the gorithm is still able to furnish all EXX-related quanti- weak-scaling behavior of the exx module by analyzing ties in 2.4 s for the largest system considered herein, ≈ how texx changes as the problem (system) size was i.e., (H2O)256 with ζ = 4. As such, the exx module h i varied for a fixed ratio of problem size to number of in QE enables relatively long (e.g., 10–100 ps) CPMD 26 simulations for large-scale condensed-phase systems con- offloading all computation events (not necessarily limited sisting of 500 1000 atoms at the hybrid DFT level of to the solution of the PE) to graphical processing units − theory. Quite interestingly, the overall cost of the exx (GPUs), which provide significantly higher computational module in this case (see Table I and Figs. 8–9) can throughput than CPUs. be decomposed into roughly equal contributions from Communication Overhead. In addition to the computa- comp comp 1 computation (fexx texx / texx /3), commu- tion events described above, the communication overhead comm ≡comm h i h i1 ≈ strong nication (fexx texx / texx /3), and idling in Fig. 2 also contributes to the degradation of η and idle idle ≡ h 1 i h i ≈ MPI (f t / texx /3). This breakdown of texx weak exx ≡ h exxi h i ≈ h i ηMPI observed in Figs. 8–9. Here, this non-ideal scaling demonstrates that the exx algorithm is not computa- behavior mainly originates from: (i) the sending/receiving tion bound (as one might expect for the relatively large of MLWFs ( φei(r) ) in Step III (Communication of number of computation events required for hybrid DFT). { } MLWFs) and the sending/receiving of wavefunction forces As discussed below, there still remains significant room Di (r) Computation for algorithmic improvements which would combat the ( exx ) at the conclusion of Step V ( of{ Energy} and Forces ii relatively high cost associated with the communication ), as well as ( ) the redis- overhead and processor idling, both of which are currently tribution of φei(r) in Step I (Redistribution of { } i under development by our group and will be the topic MLWFs) and the redistribution of Dexx(r) in Step VI { } of future work. When combined with state-of-the-art (Redistribution of Wavefunction Forces). In preconditioners during the CG solution of the PE (which this regard, the former is more important for relatively are also under intense development by our group), the smaller systems (e.g., (H2O)64 and (H2O)128) employing computational cost of the current exx algorithm can be fewer processing elements (e.g., ζ = 1/2 and ζ = 1) due significantly sped up, which would further enable hybrid to the fact that each MPI process needs to send/receive DFT-based AIMD simulations of large-scale condensed- significantly more MLWFs and wavefunction forces (see phase systems across sufficiently longer timescales. Secs. III C 3 and III C 5). In the same breath, the latter dominates for larger systems (e.g., (H2O)256) employing Computation Events. When further breaking down more processing elements (e.g., ζ = 2 and ζ = 4) due the computation events in the exx module (see Fig. 2) comp to the ALL-TO-ALL communication events in Steps I that contribute to t , we find that the computa- h exx i and VI (see Fig. 3, and Secs. III C 1 and III C 6). As tional costs associated with Step IV (Solution of a result, these communication events lead to a notice- Poisson’s Equation) and Step V (Computation able upward tilt in the strong- and weak-scaling curves of Energy and Forces) scale nearly ideally with strong weak in Figs. 8–9, which is particularly evident in the large Nproc (e.g., η > 90%) and Nwater (e.g., η > MPI MPI system and ζ limit. To attack the communication over- 99%). However, the computational effort in Step II head associated with the sending/receiving of φei(r) (Construction of Pair List) required for deter- { } and De i (r) , we plan to implement an asynchronous mining the unique pair list (see Fig. 4 and Sec. III C 2) { xx } was implemented in serial (in the current version of the (non-blocking) communication protocol that will overlap exx module) and does not scale with Nproc. In addition, with the computation events in Steps IV–V. By doing so, the computational cost associated with Step II grows the serial communication–computation–communication quadratically with system size ( (N 2)); although this process in Steps III–V (i.e., communication of φei(r) , O o { } step is quite cheap for smaller system sizes (e.g., (H2O)64 followed by the solution of the PE for all overlapping and (H2O)128), this cost can become more substantial pairs and computation of the ij contribution to Exx i h i ij for larger systems (e.g., (H2O)256). As a result, Step II and De (r) , followed by communication of De (r) ) { xx } { xx } (in its current form) leads to some of the deterioration can be overlapped to effectively mask the communication strong weak (particularly for (H2O)256) seen in ηMPI and ηMPI (see overhead (see the right panel of Fig. 4). To attack the com- Figs. 8–9). In future versions of exx, we plan to miti- munication overhead associated with the redistribution of i gate this unnecessary computational cost by parallelizing φei(r) and Dexx(r) , we plan to exploit the locality of { } { } Step II over MPI processes and using a Verlet list (which the MLWFs by only performing the redistribution based will be updated periodically throughout a given CPMD on the compact supports of each MLWF. By doing so, this simulation) to avoid unnecessary consideration of distant algorithmic improvement has the potential to completely MLWF pairs; as such, these improvements will increase eliminate the unnecessary ALL-TO-ALL communication both the strong- and weak-scaling efficiencies of exx. Al- events in Steps I and VI, which would significantly re- though Step IV does scale nearly ideally with Nproc and duce the communication overhead in the exx module. Nwater, the computational cost associated with solving As an added bonus, this approach would also allow for a ij the PE for each overlapping ij pair still remains the more accurate evaluation of Dexx(r) on Ωj via Eq. (20), comph i { } dominant contribution to t . In the near future, we thereby eliminating any residual error in the wavefunction h exx i plan to significantly reduce this primary source of com- forces (see Fig. 7). In addition to the above strategies, putational effort by using more sophisticated guesses (for we also plan to port the exx module to parallel GPU veij(r)) in conjunction with novel preconditioners during architectures (e.g., with NVLink technology), which will the CG solution of the PE. An additional future direction allow us to exploit faster peer-to-peer connections and to overcome this computational hurdle might also involve further reduce the communication overhead. 27

Processor Idling. The last and most critical issue that 32 limits the strong- and weak-scaling efficiency in the exx Physical Cores (85 Ry) algorithm is processor idling due to workload imbalance. This imbalance mainly originates from: (i) the imperfect Physical Cores (150 Ry) distribution of overlapping ij pairs across the pool of Hyper-Threading (85 Ry) h i MPI processes (see Fig. 4 and Sec. III C 2), and (ii) the Hyper-Threading (150 Ry) variability in the number of CG steps required during the solution of the PE for each overlapping ij pair. In the h i current exx module, these issues are primarily due to the 16 static load-balancing algorithm described in Sec. III C 2, which assumes that the computational workload (i.e., the Speedup number of CG steps) associated with the solution of the PE is equivalent for each ij pair, and limits this com- 8 h i putation to the Pi or Pj MPI processes only (and not Pk for example). To attack the processor idling, we plan to 4 2 remove these limitations by employing a task-based load- 1 balancing algorithm which will account for the workload 1 2 4 8 16 imbalance using a dynamic scheduler, and has the flexibil- Ncore ity to assign (and even reassign) a given ij task to any h i available MPI process. In addition to the load-balancing FIG. 10. Strong-scaling analysis of the intranode par- algorithm, we also plan to implement more intelligent OpenMP v ( ) allelization level in exx during CPMD simulations of (H2O)64 initial guesses for eij r (to aid in the convergence of the at the hybrid PBE0 level on the Mira IBM Blue Gene/Q plat- CG solution to the PE) as well as the aforementioned form. For a fixed system and basis set size (i.e., (H2O)64 with a asynchronous communication protocol (which will overlap planewave kinetic energy cutoff of 85 Ry (blue line) and 150 Ry with the computational events in Step IV), and expect (red line)), the speedup in the walltime spent in Step IV: Solu- StepIV that both of these algorithmic improvements will also tion of Poisson’s Equation of Fig. 2 (htexx i, averaged over mitigate the processor idling in the exx module. The 50 CPMD steps) is plotted versus the number of physical cores, current exx module also faces challenges associated with which were varied to include Nthread = Ncore = 1, 2, 4, 8, 16. processor idling when there is an inherent workload im- Beyond the maximum number of physical cores (Ncore = 16) balance (in the number of overlapping pairs per MLWF) per node on Mira, the OpenMP intranode parallelization level due to the physical nature of the problem, i.e., the (in- can further utilize hyperthreading technology to access up to Nthread = 64 (hyper)threads (depicted by the blue and red herent and transient) heterogeneity present in systems triangles). For reference, the ideal strong-scaling performance containing interfaces as well as disordered systems (like (using up to Ncore = 16) was plotted as a dashed line, and liquids). To balance the workload in such heterogeneous normalized to unity for Ncore = 1. systems, one could employ a different parallelization level for each MLWF (i.e., ζ = ζ(i)), which would allow the exx module to dynamically adopt the number of process- ode communication pattern, we used one MPI process ing elements dedicated to a given MLWF based on its per node on the Mira IBM Blue Gene/Q architecture; number of overlapping ij pairs. h i since each node contains 16 physical cores, five different levels of OpenMP parallelization were assessed by varying Nthread 1, 2, 4, 8, 16 threads across Ncore = Nthread ∈ { } 2. Intranode Parallelization via OpenMP physical cores (see Fig. 10). For each Nthread value, we computed the strong-scaling efficiency via Within each MPI process, the exx module uses OpenMP tStep IV threading to further parallelize the following operations strong exx Nthread=1 ηOpenMP(Nthread) h Stepi IV , (40) for each overlapping ij pair: (i) the CG solution of ≡ Nthread texx h i · h iNthread the PE for the near-field veij(r) (Step IV in Fig. 2), (ii) the multipole expansion for the far-field veij(r) (Step IV), in which Nthread = Ncore = 1 was chosen as the reference and (iii) other (less computationally intensive) operations (or baseline) OpenMP setting and tStep IV is the h exx iNthread (e.g., proto-subdomain construction in Step II, φei(r) walltime spent in Step IV of the exx module for a given { } loading/off-loading in Step III, and Exx integration in Step Nthread. Here, we find that the computational costs as- V). To critically assess the strong-scaling performance of sociated with Step IV scale very well with Nthread (e.g., strong the intranode OpenMP parallelization (in analogy to the ηOpenMP = 84% when using all 16 physical cores) when internode MPI parallelization in Sec. IV B 1), we ana- using a typical constant-volume (NVT ) planewave ba- lyzed how tStep IV (the typical computational bottleneck sis setting (i.e., 85 Ry kinetic energy cutoff). With a h i in the exx module) changes as the number of OpenMP heavier workload (i.e., a typical constant-pressure (NpT ) threads (Nthread) was varied during a CPMD simulation planewave basis setting with a 150 Ry kinetic energy of (H2O)64 with ζ = 1. To maintain a consistent intern- cutoff), we find that the strong-scaling efficiency of the 28 exx module is significantly better and maintains a nearly V. CONCLUSIONS AND FUTURE OUTLOOK strong ideal efficiency of ηOpenMP = 92% when using all 16 physi- cal cores. We note in passing that hyperthreading each In this work, we presented a detailed discussion of the Mira physical core on into four logical cores yields an addi- theoretical framework, algorithmic implementation, and i.e. 30 40% tional boost ( , speedup) in the computational computational performance of a linear scaling approach exx− performance of . that exploits sparsity in the real-space evaluation of the EXX interaction in finite-gap condensed-phase systems by utilizing a localized (MLWF) representation of the oc- cupied orbitals. Our theoretical discussion focused on the integration of this approach into CPMD, and highlighted the central role played by veij(r)—the MLWF-product po- 3. Performance on Other Supercomputer Architectures tential obtained via the CG solution of Poisson’s equation for the corresponding MLWF-product density ρeij(r)— in the evaluation of the EXX energy and wavefunction To provide an additional assessment of the performance forces. We then provided a comprehensive description of the exx module, we performed an analogous compu- of the exx algorithm, which has been implemented in tational analysis on the Cori Haswell and KNL super- the CP module of the open-source QE package, and em- computer architectures housed at the National Energy ploys a hybrid MPI/OpenMP parallelization scheme to Research Scientific Computing Center (NERSC). For sim- efficiently utilize the HPC resources available on current- plicity, we considered the (H2O)128 test case described and next-generation supercomputer architectures. This above, and limited our analysis to the most common was followed by a critical assessment of the accuracy and ζ = 1 case (in which Nproc = No = 512). For the MPI parallel performance (e.g., strong and weak scaling) of parallelization level, we employed one MPI process per this approach when performing large-scale AIMD simu- node on the Cori Haswell and KNL architectures. For the lations of liquid water in the canonical (NVT ) ensemble. OpenMP parallelization level, we used all physical cores on With access to HPC resources, we demonstrated that exx each node (i.e., 24 and 68 for the Haswell and KNL archi- enables us to compute the EXX contribution to the energy tectures, respectively). When specified, we fully activated and wavefunction forces for (H2O)256, a condensed-phase system containing 750 atoms, in just under 2.4 s. With hyperthreading on each physical core, which corresponds ≈ to a maximum total of 48 and 272 OpenMP threads for a walltime cost that is comparable to semi-local DFT, each Haswell and KNL node, respectively. As shown the exx module takes us one step closer to routinely in Table II, the exx module behaves quite consistently performing AIMD simulations of complex and large-scale across all three architectures considered (i.e., Mira IBM condensed-phase systems for sufficiently long timescales BlueGene/Q, Cori Haswell, and Cori KNL). Here, we first at the hybrid DFT level of theory. not that texx / tGGA is fairly constant and fluctuates In its current form, the exx module can also be used h i h i between 1.5 1.7; as such, the exx module enables hybrid for high-throughput applications such as the generation of − DFT calculations with a walltime cost that is comparable high quality AIMD data for training, developing, and test- to semi-local DFT on all three architectures. We further ing next-generation machine-learning and neural-network 123,154–156 note that the fractional breakdown of texx into computa- based force fields. Despite the favorable scala- h i tion, communication, and processor idling is also similar; bility and computational performance of exx, however, as such, our comprehensive three-pronged strategy (vide we found that this algorithm is not computation-bound supra) to attack each of these contributions is likely to for larger systems (such as (H2O)256)) with an overall lead to a robust exx module with significantly improved walltime that can be roughly split into three equivalent performance across a wide array of HPC architectures. contributions: computation events, communication over- When comparing texx across these architectures, we head, and processor idling (due to workload imbalance). h i find that Cori Haswell (with 512 24 = 12, 288 phys- As such, there still exits significant room for improving × ical cores) has a faster turnaround (by 2.6 ) than the performance of the exx module, and we are cur- ≈ × Mira IBM BlueGene/Q (with 512 16 = 8, 192 phys- rently in the process of implementing a comprehensive × ical cores), while Cori KNL (with 512 68 = 34, 816 three-pronged strategy that will attack each of these con- × physical cores) is noticeably slower (by 0.67 ). We tributions and significantly reduce the overall walltime ≈ × 79,80,91 note in passing that hyperthreading introduces noticeable cost. Inspired by the work of Gygi and coworkers, performance improvements on the IBM BlueGene/Q and we are also implementing an MLWF-specific domain strat- Haswell architectures, but leads to a significant decrease egy (as outlined in Secs. II C and IV A 3) that will allow in the performance of exx on the KNL architecture. For the β-version of exx to perform accurate and efficient instance, deactivating the hyperthreading option on KNL hybrid DFT simulations of condensed-phase systems rang- leads to an 80% speedup in texx and an 20% slow- ing from large-gap homogeneous systems like liquid wa- ≈ h i ≈ down in tGGA ; in doing so, texx / tGGA = 0.7 and ter (with a narrow distribution of MLWF spreads) to h i h i h i the EXX calculation is now faster than the corresponding small-gap heterogeneous systems like solvated semicon- GGA calculation. ducting nanoparticles (with a wide distribution of MLWF 29

TABLE II. Computational timings profile for CPMD simulations of liquid water at the hybrid PBE0 level on the Mira IBM Blue Gene/Q, Cori Haswell, and Cori KNL platforms using the exx module in QE. All timings (in s/step) correspond to an average over 50 CPMD steps for (H2O)128 with ζ = 1 and Ntg = 2 (see Table I and the text for more details). All timings reflect the fact that one MPI process was executed per node on the given architecture, and all available physical cores per node were used for the intranode OpenMP parallelization; unless otherwise specified, hyperthreading was fully activated on each physical core.

Architecture QE Module Timings Breakdown of texx htexxi comp comp comm comm idle idle Machine CPU htGGAi htMLWFi htexxi htTotali ht i (f ) ht i (f ) ht i (f ) htGGAi exx exx exx exx exx exx Mira IBM BlueGene/Q 2.64 0.43 4.35 7.42 1.7 2.35 (54.1) 0.64 (14.8) 1.36 (31.2) Cori Haswell 1.07 0.72 1.66 3.45 1.6 0.84 (50.8) 0.37 (22.2) 0.45 (27.0) Cori KNL 4.45 1.76 6.53 12.74 1.5 3.51 (53.6) 1.50 (23.0) 1.53 (23.4) Cori KNL (no hyperthreading) 5.38 1.15 3.57 10.10 0.7 1.84 (51.7) 1.03 (28.8) 0.70 (19.5) spreads). at finite T and p) at the hybrid DFT level of theory. In addition to improving the performance and general- ity of exx, our group is also in the process of extending this approach to perform BOMD simulations, as well ACKNOWLEDGMENTS as generalizing this MLWF-based framework to enable linear-scaling and highly accurate evaluations of screened and range-separated exchange.49,62,132–136 Other improve- H-YK, JJ, and RD acknowledge partial support from ments can also be straightforwardly incorporated into Cornell University through start-up funding and the Cen- exx such as alternative localization schemes that are ter for Alkaline Based Energy Solutions (CABES), an better suited to treat heterogeneous systems80,91 and met- Energy Frontier Research Center funded by the U.S. De- als,113,114 or furnish localized orbitals in a non-iterative partment of Energy, Office of Science, Office of Basic fashion to avoid convergence issues.81–83,92 From a wave- Energy Sciences, under Award No. DE-SC0019445. H- function theory point of view, exx is also a viable ap- YK and RC gratefully acknowledge support from the U.S. proach for the mean-field HF approximation, and is a Department of Energy under Grant No. DE-SC0019394. logical starting point for enabling condensed-phase AIMD XW acknowledges support from National Science Foun- with local electron correlation methods. dation through Award No. DMR-1552287. This research Interested users can find exx implemented in the most used resources of the National Energy Research Scientific recent version of QE,102 with additional information (in- Computing (NERSC) Center, which is supported by the cluding a detailed description of the exx keywords) avail- Office of Science of the U.S. Department of Energy under able in the QE user’s manual online.157 In the next paper Contract No. DE-AC02-05CH11231. This research used in this series, we will generalize our MLWF-based EXX ap- resources of the Argonne Leadership Computing Facility proach to treat arbitrary Bravais lattice based simulation at Argonne National Laboratory, which is supported by cells, as well as derive (and implement) the EXX contri- the Office of Science of the U.S. Department of Energy butions to the cell forces (i.e., the stress tensor). These under Contract No. DE-AC02-06CH11357. Additional extensions to exx are needed for performing constant- resources were provided by the Terascale Infrastructure pressure AIMD simulations in the NpH and NpT ensem- for Groundbreaking Research in Science and Engineering bles, and will enable us to model large condensed-phase (TIGRESS) High Performance Computing Center and systems under realistic thermodynamic conditions (i.e., Laboratory at Princeton University.

[email protected] 6 Marx, D.; Hutter, J. Ab Initio Molecular Dynamics: Ba- 1 Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas. sic Theory and Advanced Methods; Cambridge University Phys. Rev. 1964, 136, B864–B871. Press: Cambridge, 2009. 2 Kohn, W.; Sham, L. J. Self-Consistent Equations Including 7 Iftimie, R.; Minary, P.; Tuckerman, M. E. Ab Initio Molec- Exchange and Correlation Effects. Phys. Rev. 1965, 140, ular Dynamics: Concepts, Recent Developments, and Fu- A1133–A1138. ture Trends. Proc. Natl. Acad. Sci. USA 2005, 102, 6654– 3 Jones, R. O.; Gunnarsson, O. The Density Functional 6659. Formalism, Its Applications and Prospects. Rev. Mod. 8 Perdew, J. P.; Schmidt, K. Jacob’s Ladder of Density Phys. 1989, 61, 689–746. Functional Approximations for the Exchange-Correlation 4 Parr, R. G.; Yang, W. Density-Functional Theory of Atoms Energy. in Density Functional Theory and Its Application and Molecules; Oxford University Press: New York, 1989. to Materials: Antwerp, Belgium, Jun. 8–10, 2000, edited 5 Burke, K. Perspective on Density Functional Theory. J. by V. E. Van Doren, C. Van Alsenoy, and P. Geerlings Chem. Phys. 2012, 136, 150901. (AIP Publishing, Melville, New York, 2001), pp. 1. 30

9 Ceperley, D. M.; Alder, B. J. Ground State of the Electron 27 Dion, M.; Rydberg, H.; Schröder, E.; Langreth, D. C.; Gas by a Stochastic Method. Phys. Rev. Lett. 1980, 45, Lundqvist, B. I. Van der Waals Density Functional for 566–569. General Geometries. Phys. Rev. Lett. 2004, 92, 246401. 10 Becke, A. D. Density-Functional Exchange-Energy Approx- 28 Vydrov, O. A.; Van Voorhis, T. Nonlocal van der Waals imation with Correct Asymptotic Behavior. Phys. Rev. A Density Functional Made Simple. Phys. Rev. Lett. 2009, 1988, 38, 3098–3100. 103, 063004. 11 Lee, C.; Yang, W.; Parr, R. G. Development of the Colle- 29 Lee, K.; Murray, É. D.; Kong, L.; Lundqvist, B. I.; Lan- Salvetti Correlation-Energy Formula into a Functional of greth, D. C. Higher-Accuracy van der Waals Density Func- the Electron Density. Phys. Rev. B 1988, 37, 785–789. tional. Phys. Rev. B 2010, 82, 081101. 12 Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gra- 30 Ghosh, S. K.; Parr, R. G. Phase-Space Approach to the dient Approximation Made Simple. Phys. Rev. Lett. 1996, Exchange-Energy Functional of Density-Functional Theory. 77, 3865–3868. Phys. Rev. A 1986, 34, 785–791. 13 Klimeš, J.; Michaelides, A. Perspective: Advances and 31 Becke, A. D.; Roussel, M. R. Exchange Holes in Inhomo- Challenges in Treating van der Waals Dispersion Forces geneous Systems: A Coordinate-Space Model. Phys. Rev. in Density Functional Theory. J. Chem. Phys. 2012, 137, A 1989, 39, 3761–3767. 120901. 32 Tao, J.; Perdew, J. P.; Staroverov, V. N.; Scuseria, G. E. 14 Grimme, S.; Hansen, A.; Brandenburg, J. G.; Ban- Climbing the Density Functional Ladder: Nonempirical nwarth, C. Dispersion-Corrected Mean-Field Electronic Meta–Generalized Gradient Approximation Designed for Structure Methods. Chem. Rev. 2016, 116, 5105–5154. Molecules and Solids. Phys. Rev. Lett. 2003, 91, 146401. 15 Hermann, J.; DiStasio Jr., R. A.; Tkatchenko, A. First- 33 Zhao, Y.; Truhlar, D. G. A New Local Density Func- Principles Models for van der Waals Interactions in tional for Main-Group Thermochemistry, Transition Metal Molecules and Materials: Concepts, Theory, and Applica- Bonding, Thermochemical Kinetics, and Noncovalent In- tions. Chem. Rev. 2017, 117, 4714–4758. teractions. J. Chem. Phys. 2006, 125, 194101. 16 Berland, K.; Cooper, V. R.; Lee, K.; Schröder, E.; Thon- 34 Sun, J.; Haunschild, R.; Xiao, B.; Bulik, I. W.; Scuse- hauser, T.; Hyldgaard, P.; Lundqvist, B. I. Van der Waals ria, G. E.; Perdew, J. P. Semilocal and Hybrid Meta- Forces in Density Functional Theory: A Review of the Generalized Gradient Approximations Based on the Un- vdW-DF Method. Rep. Prog. Phys. 2015, 78, 066501. derstanding of the Kinetic-Energy-Density Dependence. J. 17 Becke, A. D.; Johnson, E. R. Exchange-Hole Dipole Mo- Chem. Phys. 2013, 138, 044113. ment and the Dispersion Interaction Revisited. J. Chem. 35 Sun, J.; Ruzsinszky, A.; Perdew, J. P. Strongly Constrained Phys. 2007, 127, 154108. and Appropriately Normed Semilocal Density Functional. 18 Tkatchenko, A.; Scheffler, M. Accurate Molecular van der Phys. Rev. Lett. 2015, 115, 036402. Waals Interactions from Ground-State Electron Density 36 Yu, H. S.; Li, S. L.; Truhlar, D. G. Perspective: Kohn- and Free-Atom Reference Data. Phys. Rev. Lett. 2009, Sham Density Functional Theory Descending a Staircase. 102, 073005. J. Chem. Phys. 2016, 145, 130901. 19 Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. A Consistent 37 Sun, J.; Remsing, R. C.; Zhang, Y.; Sun, Z.; Ruzsinszky, A.; and Accurate Ab Initio Parametrization of Density Func- Peng, H.; Yang, Z.; Paul, A.; Waghmare, U.; Wu, X.; tional Dispersion Correction (DFT-D) for the 94 Elements Klein, M. L.; Perdew, J. P. Accurate First-Principles Struc- H-Pu. J. Chem. Phys. 2010, 132, 154104. tures and Energies of Diversely Bonded Systems from an 20 Ferri, N.; DiStasio Jr., R. A.; Ambrosetti, A.; Car, R.; Efficient Density Functional. Nat. Chem. 2016, 8, 831–836. Tkatchenko, A. Electronic Properties of Molecules and 38 Chen, M.; Ko, H.-Y.; Remsing, R. C.; Andrade, M. F. C.; Surfaces with a Self-Consistent Interatomic van der Waals Santra, B.; Sun, Z.; Selloni, A.; Car, R.; Klein, M. L.; Density Functional. Phys. Rev. Lett. 2015, 114, 176802. Perdew, J. P.; Wu, X. Ab Initio Theory and Modeling 21 Caldeweyher, E.; Bannwarth, C.; Grimme, S. Extension of Water. Proc. Natl. Acad. Sci. USA 2017, 114, 10846– of the D3 Dispersion Coefficient Model. J. Chem. Phys. 10851. 2017, 147, 034112. 39 Zheng, L.; Chen, M.; Sun, Z.; Ko, H.-Y.; Santra, B.; Dhu- 22 Tkatchenko, A.; DiStasio Jr., R. A.; Car, R.; Scheffler, M. vad, P.; Wu, X. Structural, Electronic, and Dynamical Accurate and Efficient Method for Many-Body van der Properties of Liquid Water by Ab Initio Molecular Dy- Waals Interactions. Phys. Rev. Lett. 2012, 108, 236402. namics Based on SCAN Functional within the Canonical 23 DiStasio Jr., R. A.; von Lilienfeld, O. A.; Tkatchenko, A. Ensemble. J. Chem. Phys. 2018, 148, 164505. Collective Many-Body van der Waals Interactions in Molec- 40 Calegari Andrade, M. F.; Ko, H.-Y.; Car, R.; Selloni, A. ular Systems. Proc. Natl. Acad. Sci. USA 2012, 109, 14791– Structure, Polarization, and Sum Frequency Generation 14795. Spectrum of Interfacial Water on Anatase TiO2. J. Phys. 24 DiStasio Jr., R. A.; Gobre, V. V.; Tkatchenko, A. Many- Chem. Lett. 2018, 9, 6716–6721. Body van der Waals Interactions in Molecules and Con- 41 LaCount, M.; Gygi, F. Ensemble First-Principles Molec- densed Matter. J. Phys.: Condens. Matter 2014, 26, ular Dynamics Simulations of Water Using the SCAN 213202. Meta-GGA Density Functional J. Chem. Phys. 2019, 151, 25 Ambrosetti, A.; Reilly, A. M.; DiStasio Jr., R. A.; 164101. Tkatchenko, A. Long-Range Correlation Energy Calcu- 42 Xu, J.; Chen, M.; Zhang, C.; Wu, X. First-Principles lated from Coupled Atomic Response Functions. J. Chem. Study of the Infrared Spectrum in Liquid Water from a Phys. 2014, 140, 18A508. Systematically Improved Description of H-Bond Network 26 Blood-Forsythe, M. A.; Markovich, T.; DiStasio Jr., R. A.; Phys. Rev. B 2019, 90, 205123. Car, R.; Aspuru-Guzik, A. Analytical Nuclear Gradients 43 Calegari Andrade, M. F.; Ko, H.-Y.; Zhang, L.; Car, R.; for the Range-Separated Many-Body Dispersion Model of Selloni, A. Free Energy of Proton Transfer at the Water– Noncovalent Interactions. Chem. Sci. 2016, 7, 1712–1728. TiO2 Interface from Ab Initio Deep Potential Molecular 31

Dynamics Chem. Sci. 2020, DOI: 10.1039/C9SC05116C. 89, 195112. 44 Perdew, J. P.; Zunger, A. Self-Interaction Correction to 61 Garza, A. J.; Scuseria, G. E. Predicting Band Gaps with Density-Functional Approximations for Many-Electron Hybrid Density Functionals. J. Phys. Chem. Lett. 2016, Systems. Phys. Rev. B 1981, 23, 5048–5079. 7, 4165–4170. 45 Cohen, A. J.; Mori-Sánchez, P.; Yang, W. Insights into 62 Heyd, J.; Scuseria, G. E.; Ernzerhof, M. Hybrid Func- Current Limitations of Density Functional Theory. Science tionals Based on a Screened Coulomb Potential. J. Chem. 2008, 321, 792–794. Phys. 2003, 118, 8207–8215. 46 Gräfenstein, J.; Kraka, E.; Cremer, D. The Impact of the 63 Guidon, M.; Hutter, J.; VandeVondele, J. Robust Periodic Self-Interaction Error on the Density Functional Theory Hartree-Fock Exchange for Large-Scale Simulations Using Description of Dissociating Radical Cations: Ionic and Basis Sets. J. Chem. Theory Comput. 2009, 5, Covalent Dissociation Limits. J. Chem. Phys. 2003, 120, 3010–3021. 524–539. 64 Duchemin, I.; Gygi, F. A Scalable and Accurate Algorithm 47 Lundberg, M.; Siegbahn, P. E. M. Quantifying the Ef- for the Computation of Hartree-Fock Exchange. Comput. fects of the Self-Interaction Error in DFT: When Do the Phys. Commun. 2010, 181, 855–860. Delocalized States Appear? J. Chem. Phys. 2005, 122, 65 Bylaska, E. J.; Tsemekhman, K.; Baden, S. B.; 224103. Weare, J. H.; Jonsson, H. Parallel Implementation of 48 LeBlanc, L. M.; Dale, S. G.; Taylor, C. R.; Becke, A. .D; Γ-Point Pseudopotential Plane-Wave DFT with Exact Day, G. M.; Johnson, E. R. Pervasive Delocalisation Error Exchange. J. Comput. Chem. 2011, 32, 54–69. Causes Spurious Proton Transfer in Organic Acid–Base 66 Barnes, T. A.; Kurth, T.; Carrier, P.; Wichmann, N.; Co-Crystals Angew. Chem. Int. Ed. 2018, 130, 15122– Prendergast, D.; Kent, P. R. C.; Deslippe, J. Improved 15126. Treatment of Exact Exchange in Quantum ESPRESSO. 49 Janesko, B. G.; Henderson, T. M.; Scuseria, G. E. Screened Comput. Phys. Commun. 2017, 214, 52–58. Hybrid Density Functionals for Solid-State Chemistry and 67 Varini, N.; Ceresoli, D.; Martin-Samos, L.; Girotto, I.; Physics. Phys. Chem. Chem. Phys. 2009, 11, 443–454. Cavazzoni, C. Enhancement of DFT-Calculations at Petas- 50 Marsman, M.; Paier, J.; Stroppa, A.; Kresse, G. Hybrid cale: Nuclear Magnetic Resonance, Hybrid Density Func- Functionals Applied to Extended Systems. J. Phys.: Con- tional Theory and Car-Parrinello Calculations. Comput. dens. Matter 2008, 20, 064201. Phys. Commun. 2013, 184, 1827–1833. 51 Zhang, C.; Donadio, D.; Gygi, F.; Galli, G. First Principles 68 Guidon, M.; Hutter, J.; VandeVondele, J. Auxiliary Den- Simulations of the Infrared Spectrum of Liquid Water Us- sity Matrix Methods for Hartree-Fock Exchange Calcula- ing Hybrid Density Functionals. J. Chem. Theory Comput. tions. J. Chem. Theory Comput. 2010, 6, 2348–2364. 2011, 7, 1443–1449. 69 Hu, W.; Lin, L.; Yang, C. Interpolative Separable Density 52 Zhang, C.; Wu, J.; Galli, G.; Gygi, F. Structural and Fitting Decomposition for Accelerating Hybrid Density Vibrational Properties of Liquid Water from van der Waals Functional Calculations with Applications to Defects in Density Functionals. J. Chem. Theory Comput. 2011, 7, Silicon. J. Chem. Theory Comput. 2017, 13, 5420–5431. 3054–3061. 70 Dong, K.; Hu, W.; Lin, L. Interpolative Separable Den- 53 Gaiduk, A. P.; Gustafson, J.; Gygi, F.; Galli, G. First- sity Fitting through Centroidal Voronoi Tessellation with Principles Simulations of Liquid Water Using a Dielectric- Applications to Hybrid Functional Electronic Structure Dependent Hybrid Functional. J. Phys. Chem. Lett. 2018, Calculations. J. Chem. Theory Comput. 2018, 14, 1311– 9, 3068–3073. 1320. 54 Perdew, J. P. Size-Consistency, Self-Interaction Correc- 71 Lin, L. Adaptively Compressed Exchange Operator. J. tion, and Derivative Discontinuity. in Density Functional Chem. Theory Comput. 2016, 12, 2242–2249. Theory of Many-Electron Systems, Advances in Quantum 72 Lin, L.; Lindsey, M. Convergence of Adaptive Compression Chemistry Vol. 21, edited by S. B. Trickey (Academic Methods for Hartree-Fock-Like Equations. Commun. Pure Press, 1990), p. 113. Appl. Math 2019, 72, 451–499. 55 Pederson, M. R.; Perdew, J. P. Self-Interaction Correction 73 Jia, W.; Lin, L. Fast Real-Time Time-Dependent Hybrid in Density Functional Theory: The Road Less Traveled. Functional Calculations with the Parallel Transport Gauge ΨK Newsletter Scientific Highlight of the Month, February and the Adaptively Compressed Exchange Formulation. (2012), see http://www.psi-k.org/newsletters/ Comput. Phys. Commun. 2019, 240, 21–29. News_109/Highlight_109.pdf. 74 Hu, W.; Lin, L.; Yang, C. Projected Commutator DIIS 56 Pederson, M. R.; Ruzsinszky, A.; Perdew, J. P. Communi- Method for Accelerating Hybrid Functional Electronic cation: Self-Interaction Correction with Unitary Invariance Structure Calculations. J. Chem. Theory Comput. 2017, in Density Functional Theory. J. Chem. Phys. 2014, 140, 13, 5458–5467. 121103. 75 Marzari, N.; Vanderbilt, D. Maximally Localized Gener- 57 Becke, A. D. Density-Functional Thermochemistry. III. alized Wannier Functions for Composite Energy Bands. The Role of Exact Exchange. J. Chem. Phys. 1993, 98, Phys. Rev. B 1997, 56, 12847–12865. 5648–5652. 76 Wu, X.; Selloni, A.; Car, R. Order-N Implementation of 58 Perdew, J. P.; Ernzerhof, M.; Burke, K. Rationale for Exact Exchange in Extended Insulating Systems. Phys. Mixing Exact Exchange with Density Functional Approxi- Rev. B 2009, 79, 085102. mations. J. Chem. Phys. 1996, 105, 9982–9985. 77 Marzari, N.; Mostofi, A. A.; Yates, J. R.; Souza, I.; Vander- 59 Becke, A. D. A New Mixing of Hartree-Fock and Local bilt, D. Maximally Localized Wannier Functions: Theory Density-Functional Theories. J. Chem. Phys. 1993, 98, and Applications. Rev. Mod. Phys. 2012, 84, 1419–1475. 1372–1377. 78 DiStasio Jr., R. A.; Santra, B.; Li, Z.; Wu, X.; Car, R. 60 Skone, J. H.; Govoni, M.; Galli, G. Self-Consistent Hybrid The Individual and Collective Effects of Exact Exchange Functional for Condensed Systems. Phys. Rev. B 2014, and Dispersion Interactions on the Ab Initio Structure of 32

Liquid Water. J. Chem. Phys. 2014, 141, 084502. Functions for Large-Scale Biomolecular Simulations. J. 79 Gygi, F. Compact Representations of Kohn-Sham Invariant Chem. Phys. 2004, 120, 4530–4544. Subspaces. Phys. Rev. Lett. 2009, 102, 166406. 99 Silvestrelli, P. L. Van der Waals Interactions in DFT Made 80 Gygi, F.; Duchemin, I. Efficient Computation of Easy by Wannier Functions. Phys. Rev. Lett. 2008, 100, Hartree–Fock Exchange Using Recursive Subspace Bisec- 053002. tion. J. Chem. Theory Comput. 2013, 9, 582–587. 100 Mostofi, A. A.; Yates, J. R.; Lee, Y.-S.; Souza, I.; Van- 81 Damle, A.; Lin, L.; Ying, L. Compressed Representation of derbilt, D.; Marzari, N. wannier90: A Tool for Obtaining Kohn–Sham Orbitals via Selected Columns of the Density Maximally-Localised Wannier Functions. 2008, 178, 685– Matrix. J. Chem. Theory Comput. 2015, 11, 1463–1469. 699. 82 Damle, A.; Lin, L.; Ying, L. Computing Localized Repre- 101 QUANTUM ESPRESSO: A Modular and Open-Source sentations of the Kohn–Sham Subspace via Randomization Project for Quantum Simulations of Materials. and Refinement. SIAM J. Sci. Comput. 2017, 39, B1178– J. Phys.: Condens. Matter 2009, 21, 395502. B1198. 102 Giannozzi, P.; Andreussi, O.; Brumme, T.; Bunau, O.; 83 Damle, A.; Lin, L.; Ying, L. SCDM-k: Localized Orbitals Nardelli, M. B.; Calandra, M.; Car, R.; Cavazzoni, C.; for Solids via Selected Columns of the Density Matrix. J. Ceresoli, D.; Cococcioni, M.; Colonna, N.; Carnimeo, I.; Comput. Phys. 2017, 334, 1–15. Corso, A. D.; Gironcoli, S. d.; Delugas, P.; DiSta- 84 Mountjoy, J.; Todd, M.; Mosey, N. J. Exact Exchange sio Jr., R. A.; Ferretti, A.; Floris, A.; Fratesi, G.; Fu- with Non-Orthogonal Generalized Wannier Functions. J. gallo, G.; Gebauer, R.; Gerstmann, U.; Giustino, F.; Chem. Phys. 2017, 146, 104108. Gorni, T.; Jia, J.; Kawamura, M.; Ko, H.-Y.; Kokalj, A.; 85 Izmaylov, A. F.; Scuseria, G. E.; Frisch, M. J. Efficient Küçükbenli, E.; Lazzeri, M.; Marsili, M.; Marzari, N.; Evaluation of Short-Range Hartree-Fock Exchange in Mauri, F.; Nguyen, N. L.; Nguyen, H.-V.; Otero-de-la Large Molecules and Periodic Systems. J. Chem. Phys. Roza, A.; Paulatto, L.; Poncé, S.; Rocca, D.; Sabatini, R.; 2006, 125, 104103. Santra, B.; Schlipf, M.; Seitsonen, A. P.; Smogunov, A.; 86 Guidon, M.; Schiffmann, F.; Hutter, J.; VandeVondele, J. Timrov, I.; Thonhauser, T.; Umari, P.; Vast, N.; Wu, X.; Ab Initio Molecular Dynamics Using Hybrid Density Func- Baroni, S. Advanced Capabilities for Materials Modelling tionals. J. Chem. Phys. 2008, 128, 214104. with Quantum ESPRESSO. J. Phys.: Condens. Matter 87 Carnimeo, I.; Baroni, S.; Giannozzi, P. Fast Hybrid 2017, 29, 465901. Density-Functional Computations Using Plane-Wave Basis 103 Soler, J. M.; Artacho, E.; Gale, J. D.; García, A.; Jun- Sets. Electronic Structure 2019, 1, 015009. quera, J.; Ordejón, P.; Sánchez-Portal, D. The SIESTA 88 Gaiduk, A. P.; Gygi, F.; Galli, G. Density and Compress- Method for Ab Initio Order-N Materials Simulation. J. ibility of Liquid Water and Ice from First-Principles Sim- Phys.: Condens. Matter 2002, 14, 2745–2779. ulations with Hybrid Functionals. J. Phys. Chem. Lett. 104 Gonze, X.; Amadon, B.; Anglade, P. M.; Beuken, J. M.; 2015, 6, 2902–2908. Bottin, F.; Boulanger, P.; Bruneval, F.; Caliste, D.; Cara- 89 Zhang, C.; Pham, T. A.; Gygi, F.; Galli, G. Communica- cas, R.; Côté, M.; Deutsch, T.; Genovese, L.; Ghosez, P.; tion: Electronic Structure of the Solvated Chloride Anion Giantomassi, M.; Goedecker, S.; Hamann, D. R.; Her- from First Principles Molecular Dynamics. J. Chem. Phys. met, P.; Jollet, F.; Jomard, G.; Leroux, S.; Mancini, M.; 2013, 138, 181102. Mazevet, S.; Oliveira, M. J. T.; Onida, G.; Pouil- 90 Gaiduk, A. P.; Zhang, C.; Gygi, F.; Galli, G. Structural lon, Y.; Rangel, T.; Rignanese, G. M.; Sangalli, D.; and Electronic Properties of Aqueous NaCl Solutions from Shaltaf, R.; Torrent, M.; Verstraete, M. J.; Zerah, G.; Ab Initio Molecular Dynamics Simulations with Hybrid Zwanziger, J. W. ABINIT: First-Principles Approach to Density Functionals. Chem. Phys. Lett. 2014, 604, 89–96. Material and Nanosystem Properties. Comput. Phys. Com- 91 Dawson, W.; Gygi, F. Performance and Accuracy of Recur- mun. 2009, 180, 2582–2615. sive Subspace Bisection for Hybrid DFT Calculations in 105 Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Inhomogeneous Systems. J. Chem. Theory Comput. 2015, Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; 11, 4655–4663. Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A. 92 Damle, A.; Lin, L. Disentanglement via Entanglement: NWChem: A Comprehensive and Scalable Open-Source A Unified Method for Wannier Localization. Multiscale Solution for Large Scale Molecular Simulations. Comput. Model. Simul. 2018, 16, 1392–1410. Phys. Commun. 2010, 181, 1477–1489. 93 Boys, S. F. Construction of Some Molecular Orbitals to Be 106 Enkovaara, J.; Rostgaard, C.; Mortensen, J. J.; Chen, J.; Approximately Invariant for Changes from One Molecule Dułak, M.; Ferrighi, L.; Gavnholt, J.; Glinsvad, C.; to Another. Rev. Mod. Phys. 1960, 32, 296–299. Haikola, V.; Hansen, H. A.; Kristoffersen, H. H.; 94 Resta, R. Quantum-Mechanical Position Operator in Ex- Kuisma, M.; Larsen, A. H.; Lehtovaara, L.; Ljungberg, M.; tended Systems. Phys. Rev. Lett. 1998, 80, 1800–1803. Lopez-Acevedo, O.; Moses, P. G.; J Ojanen,; Olsen, T.; Pet- 95 Silvestrelli, P. L.; Parrinello, M. Water Molecule Dipole in zold, V.; Romero, N. A.; Stausholm-Møller, J.; Strange, M.; the Gas and in the Liquid Phase. Phys. Rev. Lett. 1999, Tritsaris, G. A.; Vanin, M.; Walter, M.; B Hammer,; 82, 3308–3311. Häkkinen, H.; Madsen, G. K. H.; Nieminen, R. M.; 96 Souza, I.; Wilkens, T.; Martin, R. M. Polarization and Nørskov, J. K.; Puska, M.; Rantala, T. T.; Schiøtz, J.; Localization in Insulators: Generating Function Approach. Thygesen, K. S.; Jacobsen, K. W. Electronic Structure Phys. Rev. B 2000, 62, 1666–1683. Calculations with GPAW: A Real-Space Implementation 97 Kirchner, B.; Hutter, J. Solvent Effects on Electronic of the Projector Augmented-Wave Method. J. Phys.: Con- Properties from Wannier Functions in a Dimethyl Sulfox- dens. Matter 2010, 22, 253202. ide/Water Mixture. J. Chem. Phys. 2004, 121, 5133–5142. 107 Hutter, J.; Iannuzzi, M.; Schiffmann, F.; VandeVondele, J. 98 Sagui, C.; Pomorski, P.; Darden, T. A.; Roland, C. Ab : Atomistic Simulations of Condensed Matter Systems. Initio Calculation of Electrostatic Multipoles with Wannier Wiley Interdiscip. Rev. Comput. Mol. Sci. 2013, 4, 15–25. 33

108 Kresse, G.; Joubert, D. From Ultrasoft Structure Calculations: Recent Advances and Novel Appli- to the Projector Augmented-Wave Method. Phys. Rev. B cations to Nano-Structures. Phys. Status Solidi B 2006, 1999, 59, 1758–1775. 243, 1063–1079. 109 Sharma, M.; Wu, Y.; Car, R. Ab Initio Molecular Dynam- 128 Saad, Y.; Chelikowsky, J. R.; Shontz, S. M. Numerical ics with Maximally Localized Wannier Functions. Int. J. Methods for Electronic Structure Calculations of Materials. Quantum Chem. 2003, 95, 821–829. SIAM Rev. 2010, 52, 3–54. 110 Iftimie, R.; Thomas, J. W.; Tuckerman, M. E. On-the- 129 Mohr, S.; E. Ratcliff, L.; Genovese, L.; Caliste, D.; Fly Localization of Electronic Orbitals in Car-Parrinello Boulanger, P.; Goedecker, S.; Deutsch, T. Accurate and Molecular Dynamics. J. Chem. Phys. 2004, 120, 2169– Efficient Linear Scaling DFT Calculations with Universal 2181. Applicability. Phys. Chem. Chem. Phys. 2015, 17, 31360– 111 Thomas, J. W.; Iftimie, R.; Tuckerman, M. E. Field The- 31370. oretic Approach to Dynamical Orbital Localization in 130 Skylaris, C.-K.; Haynes, P. D.; Mostofi, A. A.; Payne, M. C. Ab Initio Molecular Dynamics. Phys. Rev. B 2004, 69, Introducing ONETEP: Linear-Scaling Density Functional 125105. Simulations on Parallel Computers. J. Chem. Phys. 2005, 112 Car, R.; Parrinello, M. Unified Approach for Molecular 122, 084119. Dynamics and Density-Functional Theory. Phys. Rev. Lett. 131 Bowler, D. R.; Choudhury, R.; Gillan, M. J.; Miyazaki, T. 1985, 55, 2471. Recent Progress with Large-Scale Ab Initio Calculations: 113 Cornean, H. D.; Gontier, D.; Levitt, A.; Monaco, D. Lo- The CONQUEST Code. Phys. Status Solidi B 2006, 243, calised Wannier Functions in Metallic Systems. Ann. Henri 989–1000. Poincaré 2019, 20, 1367–1391. 132 Gerber, I. C.; Ángyán, J. G. Hybrid Functional with Sep- 114 Damle, A.; Levitt, A.; Lin, L. Variational Formulation arated Range. Chem. Phys. Lett. 2005, 415, 100–105. for Wannier Functions with Entangled Band Structure. 133 Vydrov, O. A.; Scuseria, G. E. Assessment of a Long-Range arXiv:1801.08572 [physics] 2018, Corrected Hybrid Functional. J. Chem. Phys. 2006, 125, 115 Kohn, W. Analytic Properties of Bloch Waves and Wannier 234109. Functions. Phys. Rev. 1959, 115, 809–821. 134 Baer, R.; Livshits, E.; Salzner, U. Tuned Range-Separated 116 des Cloizeaux, J. Analytical Properties of n-Dimensional Hybrids in Density Functional Theory. Annu. Rev. Phys. Energy Bands and Wannier Functions. Phys. Rev. 1964, Chem. 2010, 61, 85–109. 135, A698–A707. 135 Kronik, L.; Stein, T.; Refaely-Abramson, S.; Baer, R. Exci- 117 Nenciu, G. Existence of the Exponentially Localised Wan- tation Gaps of Finite-Sized Systems from Optimally Tuned nier Functions. Commun. Math. Phys. 1983, 91, 81–85. Range-Separated Hybrid Functionals. J. Chem. Theory 118 Niu, Q. Theory of the Quantized Adiabatic Particle Trans- Comput. 2012, 8, 1515–1531. port. Mod. Phys. Lett. B 1991, 05, 923–931. 136 Karolewski, A.; Kronik, L.; Kümmel, S. Using Optimally 119 Panati, G.; Pisante, A. Bloch Bundles, Marzari-Vanderbilt Tuned Range Separated Hybrid Functionals in Ground- Functional and Maximally Localized Wannier Functions. State Calculations: Consequences and Caveats. J. Chem. Commun. Math. Phys. 2013, 322, 835–875. Phys. 2013, 138, 204115. 120 Wu, X.; Walter, E. J.; Rappe, A. M.; Car, R.; Selloni, A. 137 Chen, W.; Wu, X.; Car, R. X-Ray Absorption Signatures Hybrid Density Functional Calculations of the Band Gap of the Molecular Environment in Water and Ice. Phys. Rev. of GaxIn1−xN. Phys. Rev. B 2009, 80, 115201. Lett. 2010, 105, 017802. 121 Chen, J.; Wu, X.; Selloni, A. Electronic Structure and 138 Swartz, C. W.; Wu, X. Ab Initio Studies of Ionization Bonding Properties of Cobalt Oxide in the Spinel Structure. Potentials of Hydrated Hydroxide and Hydronium. Phys. Phys. Rev. B 2011, 83, 245204. Rev. Lett. 2013, 111, 087801. 122 Santra, B.; DiStasio Jr., R. A.; Martelli, F.; Car, R. Local 139 Kümmel, S.; Kronik, L. Orbital-Dependent Density Func- Structure Analysis in Ab Initio Liquid Water. Mol. Phys. tionals: Theory and Applications. Rev. Mod. Phys. 2008, 2015, 113, 2829–2841. 80, 3–60. 123 Ko, H.-Y.; Zhang, L.; Santra, B.; Wang, H.; E, W.; Jr, R. 140 Tassone, F.; Mauri, F.; Car, R. Acceleration Schemes for A. D.; Car, R. Isotope Effects in Liquid Water via Deep Po- Ab Initio Molecular-Dynamics Simulations and Electronic- tential Molecular Dynamics. Mol. Phys. 2019, 117, 3269– Structure Calculations. Phys. Rev. B 1994, 50, 10561– 3281. 10573. 124 Bankura, A.; Santra, B.; DiStasio Jr., R. A.; Swartz, C. W.; 141 The typical choice for Γ is ≈ 0.1, e.g., see Klein, M. L.; Wu, X. A Systematic Study of Chloride Ion the “electron_damping” parameter in http://www. Solvation in Water Using van der Waals Inclusive Hybrid quantum-espresso.org/Doc/INPUT_CP.htm. Density Functional Theory. Mol. Phys. 2015, 113, 2842– 142 Adamo, C.; Barone, V. Toward Reliable Density Func- 2854. tional Methods without Adjustable Parameters: The PBE0 125 Chen, M.; Zheng, L.; Santra, B.; Ko, H.-Y.; DiStasio Model. J. Chem. Phys. 1999, 110, 6158–6170. Jr., R. A.; Klein, M. L.; Car, R.; Wu, X. Hydroxide Diffuses 143 Fornberg, B. Generation of Finite Difference Formulas on Slower than Hydronium in Water Because Its Solvated Arbitrarily Spaced Grids. Math. Comp. 1988, 51, 699–706. Structure Inhibits Correlated Proton Transfer. Nat. Chem. 144 Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetter- 2018, 10, 413–419. ling, W. T. Numerical Recipes in Fortran 77 ; Cambridge 126 Ko, H.-Y.; DiStasio Jr., R. A.; Santra, B.; Car, R. Thermal University Press: Cambridge, 1992; p 102. Expansion in Dispersion-Bound Molecular Crystals. Phys. 145 Despite the fact that the real-space MLWFs had already Rev. Materials 2018, 2, 055603. been generated (with no additional cost) during the forma- 127 Kronik, L.; Makmal, A.; Tiago, M. L.; Alemany, M. M. G.; tion of the real-space electron density (needed to evaluate Jain, M.; Huang, X.; Saad, Y.; Chelikowsky, J. R. PARSEC the semi-local xc energy), we found an additional (and –the Pseudopotential Algorithm for Real-Space Electronic completely unnecessary) invFFT call that specifically re- 34

generates the real-space MLWFs in the current version of 152 Gygi, F. Architecture of Qbox: A Scalable First-principles QE. As such, we have not included this additional cost in Molecular Dynamics Code. IBM J. RES. & DEV. 2008, htTotali, and plan to remove this unnecessary routine from 52, 137–144. future versions of QE. 153 This development version is accessible at https: 146 Abascal, J. L. F.; Vega, C. A General Purpose Model for //gitlab.com/kosinyj/exx_module_version_ the Condensed Phases of Water: TIP4P/2005. J. Chem. one_demo. Phys. 2005, 123, 234505. 154 Han, J.; Zhang, L.; Car, R.; E, W. Deep Potential: A 147 Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; General Representation of a Many-Body Potential Energy Smith, J. C.; Hess, B.; Lindahl, E. GROMACS: High Surface. Commun. Comput. Phys. 2018, 23, 629–639. Performance Molecular Simulations through Multi-Level 155 Zhang, L.; Han, J.; Wang, H.; Car, R.; E, W. Deep Po- Parallelism from Laptops to Supercomputers. SoftwareX tential Molecular Dynamics: A Scalable Model with the 2015, 1-2, 19–25. Accuracy of Quantum Mechanics. Phys. Rev. Lett. 2018, 148 Martyna, G. J.; Klein, M. L.; Tuckerman, M. Nosé-Hoover 120, 143001. Chains: The Canonical Ensemble via Continuous Dynam- 156 Zhang, L.; Han, J.; Wang, H.; Saidi, W.; Car, R.; E, W. ics. 1992, 97, 2635–2643. In Advances in Neural Information Processing Systems 149 Tobias, D. J.; Martyna, G. J.; Klein, M. L. Molecular Dy- 31 ; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., namics Simulations of a Protein in the Canonical Ensemble. Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates: J. Phys. Chem. 1993, 97, 12959–12966. Red Hook, 2018; pp 4436–4446. 150 Hamann, D. R.; Schlüter, M.; Chiang, C. Norm-Conserving 157 Quantum ESPRESSO: cp.x input description. http:// Pseudopotentials. Phys. Rev. Lett. 1979, 43, 1494–1497. www.quantum-espresso.org/Doc/INPUT_CP.html 151 Vanderbilt, D. Optimally Smooth Norm-Conserving Pseu- (accessed February 6, 2020). dopotentials. Phys. Rev. B 1985, 32, 8412–8415.