Improved speed and scaling in orbital space variational monte carlo
Iliya Sabzevari1 and Sandeep Sharma1, ∗ 1Department of Chemistry and Biochemistry, University of Colorado Boulder, Boulder, CO 80302, USA In this work, we introduce three algorithmic improvements to reduce the cost and improve the scal- ing of orbital space variational Monte Carlo (VMC). First, we show that by appropriately screening the one- and two-electron integrals of the Hamiltonian one can improve the efficiency of the al- gorithm by several orders of magnitude. This improved efficiency comes with the added benefit that the resulting algorithm scales as the second power of the system size O(N 2), down from the fourth power O(N 4). Using numerical results, we demonstrate that the practical scaling obtained is in fact O(N 1.5) for a chain of Hydrogen atoms, and O(N 1.2) for the Hubbard model. Second, we introduce the use of the rejection-free continuous time Monte Carlo (CTMC) to sample the determinants. CTMC is usually prohibitively expensive because of the need to calculate a large number of intermediates. Here, we take advantage of the fact that these intermediates are already calculated during the evaluation of the local energy and consequently, just by storing them one can use the CTCM algorithm with virtually no overhead. Third, we show that by using the adaptive stochastic gradient descent algorithm called AMSGrad one can optimize the wavefunction energies robustly and efficiently. The combination of these three improvements allows us to calculate the ground state energy of a chain of 160 hydrogen atoms using a wavefunction containing ∼ 2 × 105 variational parameters with an accuracy of 1 mEh/particle at a cost of just 25 CPU hours, which when split over 2 nodes of 24 processors each amounts to only about half hour of wall time. This low cost coupled with embarrassing parallelizability of the VMC algorithm and great freedom in the forms of usable wavefunctions, represents a highly effective method for calculating the electronic structure of model and ab initio systems.
Quantum Monte Carlo (QMC) is one of the most pow- to deliver results at the complete basis set limit. Al- erful and versatile tools for solving the electronic struc- though extremely powerful, one of the shortcomings of ture problem and has been successfully used in a wide real space QMC methods is that there is less error can- range of problems1–5. QMC can be broadly classified into cellation than is typically observed in basis set quantum two categories of algorithms, variational Monte Carlo chemistry. This fact, coupled with the recent success of (VMC)6,7 and projector Monte Carlo (PMC). In VMC both AFQMC and FCIQMC (both of which work in a one is interested in minimizing the energy of a suitably finite basis set), has led to renewed interest in develop- chosen wavefunction ansatz. The accuracy of VMC is ment of new QMC algorithms that work in a finite basis limited by the functional form and the flexibility of the set20–27. The present work is an attempt in this direction wavefunction employed. PMC on the other hand is po- and presents algorithmic improvements for performing tentially exact and several variants exist, such as diffusion orbital space VMC calculations. Monte Carlo (DMC)8–10, Auxiliary field quantum Monte Carlo (AFQMC)11,12 , Green’s function Monte Carlo To put the orbital space VMC in the broader context (GFMC)13–15 and full configuration interaction quantum of wavefunction methods, it is useful to classify the vari- Monte Carlo (FCIQMC)16–19. Although exact in prin- ous wavefunctions used in electronic structure theory into ciple, in practice all versions of PMC suffer from the three classes. The wavefunctions in the first class al- fermionic sign problem when used for electronic prob- low one to calculate the expectation value of the energy, lems, except in a few special cases. A commonly used ψ H ψ , deterministically with a polynomial cost in the h | | i technique for overcoming the fermionic sign problem is number of parameters, examples of these include the con- 28 to employ the fixed-node or fixed-phase approximation figuration interaction wavefunction and matrix product 29–31 that stabilizes the Monte Carlo simulation by eliminat- states . The second class of wavefunctions only allow ing the sign problem at a cost of introducing a bias in the one to calculate the overlap of the wavefunction with a calculated energies. The fixed-node or the fixed-phase is determinant (or a real space configuration), n ψ , de- h | i arXiv:1807.10633v1 [physics.chem-ph] 27 Jul 2018 often obtained from a VMC calculation and thus the ac- terministically with a polynomial cost in the number of 32 curacy of the PMC calculation depends to a large extent parameters, examples include the Slater-Jastrow (SJ) , 33 on the quality of the VMC wavefunction. Thus VMC correlator product states , Jastrow-antisymmetric gem- 34 wavefunction plays a pivotal role and to a large extent inal power (JAGP) . Finally, the third class of wave- determines the final accuracy obtained from a QMC cal- function are the most general and do not allow polyno- culation. (FCIQMC is unique in its ability to control the mial cost evaluation of either the expectation value of sign problem self-consistently by making use of the ini- energy or an overlap with a determinant, examples of 35 tiator approximation.) Traditionally the most commonly these include the coupled cluster wavefunction and the 36 used variant of QMC has been the real space VMC fol- projected entangled pair states . The first class of wave- lowed by the fixed-node DMC calculation which is able function are the most restrictive, but are the easiest to work with and often efficient deterministic algorithms are 2 available to variationally minimize their energy. For the A. Reduced scaling evaluation of the local energy second class of wavefunctions one has to resort to VMC algorithms to both evaluate the energy and optimize the The local energy EL[n] of a determinant n is calcu- wavefunctions. Finally, the third type of wavefunctions lated as follows | i are the most general and are often quite accurate but one needs to resort to approximations to evaluate their n H Ψ(p) EL[n] =h | | i (2) energies. In this work we will focus on the second class of n Ψ(p) h | i wavefunction, in particular the wavefunction comprising P n H m m Ψ(p) = mh | | ih | i (3) of a product of the correlator product states and a Slater n Ψ(p) determinant (CPS-Slater). h | i X m Ψ(p) The rest of the paper is organized as follows. In Sec- = Hn,m h | i, (4) n Ψ(p) tion I, we will briefly outline the main steps of the VMC m h | i algorithm, followed by the new algorithmic innovation m introduced to improve the efficiency and reduce the scal- where the sum is over all determinants that have a n m ing of the algorithm. Next, in Section II, we will describe nonzero transition matrix element (Hn,m = H ) with n. The number of such non-zero matrixh elements| | i how the algorithm is used to calculate the energy and op- 2 2 timize the CPS-Slater wavefunction. We will also discuss Hn,m are on the order of neno, where ne is the number in detail the cost and computational scaling of the algo- of electrons and no is the number of open orbitals. This rithm. Finally in Section III we will present the results number increases as a fourth power of the system size obtained by this algorithm for model systems including and a naive use of this formula results in an algorithm the 1-D hydrogen chain and 2-D Hubbard model. that scales poorly with the size of the problem. To reduce the cost of calculating the local energy we truncate the summation over all m to just a summation I. ALGORITHMIC IMPROVEMENTS over those m, that have a Hamiltonian transition matrix element Hn,m > ,
In VMC, the expectation value of the Hamiltonian for X m Ψ(p) a wavefunction Ψ(p), where p are the set of wavefunction EL[n, ] = Hn,m h | i, (5) n Ψ(p) parameters, is calculated by performing a Monte Carlo m h | i integration. where is a user defined parameter. Note that in the P 2 hn|H|Ψ(p)i limit that 0, we recover the exact local energy, Ψ(p) H Ψ(p) n Ψ(p) n hn|Ψ(p)i → E =h | | i = |h | i| EL[n, ] EL[n]. It is useful to note that when a lo- Ψ(p) Ψ(p) P Ψ(p) n 2 cal basis→ set is used the number of elements H that h | i n |h | i| n,m X have a magnitude larger than a fixed non-zero scale = ρ E [n] n L quadratically with the size of the system. To see this, n let’s consider double excitations, the argument for the = EL[n] ρ (1) h i n single excitations is similar. For a given determinant (n), hn|H|Ψ(p)i we can obtain another determinant by a double excita- where, EL[n] = hn|Ψ(p)i is the local energy of a deter- tion of electrons from say orbitals i and j to orbitals a minant n, the last expression denotes that the energy and b to give another determinant (m) with a matrix ele- is calculated by averaging the local energy over a set ment Hn,m = ab ij ab ji . Because of the locality of determinants n that are sampled from the probabil- of orbitals,| all| orbitals|h | i decay− h | exponentiallyi| fast and thus |hΨ(p)|ni|2 ity distribution ρn = P |hΨ(p)|ki|2 . With Equation 1 as the integral ab ij is negligible unless a is close to i and k h | i the background, we can describe the VMC algorithm by b is close to j. In other words this integral is non-zero splitting it into three main tasks for only O(1) instances of a and b and because there are an O(N 2) number of possible i and j, the total number 1. For a given determinant n we have to calculate of non-negligible integrals is also O(N 2). Thus if we are | i the local energy EL[n]. able to efficiently screen the transition matrix elements for a given = 0, no matter how small the is, we are 2. We have to generate a set of determinants n with 6 | i guaranteed to obtain a quadratically scaling evaluation a probability distribution ρn for a given wavefunc- of the local energy. tion Ψ. To see how the parameter can be effectively used to 3. And finally, we need an optimization algorithm that dramatically reduces the cost of the local energy evalua- can minimize the energy of the wavefunction by tion, let us examine the one electron and the two electron varying the parameters p. excitations separately. First let’s see how the more numerous two electron We introduce algorithmic improvements in each of these excitations can be screened. The value of the Hamilto- tasks which are described in detail in the next three sec- nian transition matrix element between a determinant, tions. m, contributing to the summation, that is related to 3 determinant, n, by the double excitation of electrons two other algorithms have appeared in literature that 24 from orbitals i and j to orbitals a and b is equal to sample the matrix elements Hn,m stochastically and 39 Hn,m = ab ij ab ji . The magnitude of the Hamil- semistochastically instead of screening the summation tonian| | matrix|h | elementi−h | onlyi| depends on the four orbitals deterministically as we have proposed here. Both these that change their occupation and does not depend on the algorithms should have the same asymptotic scaling as rest of the orbitals of the determinants n or m. To use the current algorithm, however, in practice the benefits this fact efficiently, we stores the integrals in the heat of reduced scaling only manifest themselves for large sys- bath format whereby, for all pairs of orbitals i and j tem sizes. we store the tuples a, b, ab ij ab ji in descending order by the value{ of abh ij| i −ab h ji| i}. Now for gen- erating all determinants|h m| thati − h are| connectedi| to n by B. Continuous time Monte Carlo for sampling double excitation that is larger than , one simply loops determinants over all pairs of occupied orbitals i and j in determi- nant n. For each of these pairs of orbitals one loops The usual procedure for generating determinants n over the tuple a, b, ab ij ab ji , until the value of according to a probability distribution ρn uses the ab ij ab ji{ > hand| onei − exits h | thisi} inner loop as soon Metropolis-Hastings algorithm in which a Markov chain as|h this| i−h inequality| i| is violated. The CPU cost of storing is realized by performing a random walk according to the the integrals in the heat bath format is O(N 4 ln(N 2)), following steps but this is essentially the same cost as reading the inte- 1. Starting from a determinant n, a new candidate grals O(N 4) and only needs to be done once for the entire determinant m is generate with a proposal prob- calculation. The algorithm described here is essentially ability P (m | ni ). exactly the same as the one used to perform efficient ← screening of two electron integrals in the Heat Bath Con- 2. The candidate determinant is then accepted if a figuration interaction algorithm37, which is a variant of randomly generated number between 0 and 1 is 38 the selected configuration interaction algorithm and is less than min 1, ρ(m)P (n←m) , otherwise we stay much more efficient than other variants. ρ(n)P (m←n) at determinant n for another iteration. It is important to recall that although it might seem that there are only O(N 2) single excitations, and thus The algorithm guarantees that with sufficiently large one does not need to screen them, this is in fact not number of steps one generates determinants according true. The reason being the Hamiltonian transition ma- to the probability distribution ρ as long as the principle trix element between two determinants, n and m, that of ergodicity is satisfied. This states that it should be are connected by the single excitation of an electron possible to reach any determinant k from a determinant from orbital i to orbital a is given by Hn,m = i a + n in a finite number of steps. The draw back of the algo- P 2 h | i j∈occ ij aj ij ja . There are O(N ) such connec- rithm is that if the one chooses the proposal probability tions andh the| i cost − h of| calculatingi each of these matrix ele- distribution poorly, such that determinants m that have ments scales as O(N) thus making the entire cost O(N 3). a small ρ(m) are proposed with a high probability, then We have observed that without appropriate screening of several moves will be rejected leading to long autocor- these one electron integrals their cost starts to dominate relation times. Although the algorithm places very few the cost of calculating the local energy. To screen the restriction on the proposal probability distribution, and singly excited determinants we first calculate the quan- one can in principle devise a probability distribution that P tity Si,a = i a + ij aj ij ja , for all pairs of leads to very few rejections, in practice it is far from easy |h | i| j |h | i − h | i| orbitals i and a. Notice, that Si,a > Hn,m . Thus, while to do this. generating singly excited determinants| from| a determi- In this work we bypass the need for devising com- nant n, by exciting an electron from an occupied orbital plicated proposal probability distributions, by using the i to an empty orbital a, we discard all excitations where continuous time Monte Carlo (CTMC), also known as Si,a < . the kinetic Monte Carlo (KMC), Bortz-Kalos-Lebowitz 40 41 Thus for a given , only determinants, m, that make a (BKL) algorithm and Gillespie’s algorithm . This is non-zero contribution to the local energy are ever consid- an alternative to the Metropolis-Hastings algorithm and ered. As we will demonstrate in Table II, this algorithm has the advantage of being a rejection free algorithm and can be used to discard a very large fraction of the de- every proposed move is accepted. Just as in the case of terminants from the summation of Equation 5 providing the Metropolis-Hastings algorithm, there is considerble orders of magnitude improvement in the efficiency of the flexibility in how the algorithm is implemented but in algorithm while incurring only a small error in the overall this work we use the following steps: energy. In Section II B we will show that the contribu- 1. Starting from a determinant n calculate the quan- hm|Ψ(p)i tion to the local energy (Hn,m hn|Ψ(p)i ) for each of these tity 2 O(N ) determinants m can be evaluated at a cost O(1), 1/2 ρ(m) m Ψ(p) allowing one to evaluate the local energy with a cost that r(m n) = = h | i (6) ← ρ(n) n Ψ(p) scales quadratically with the size of the system. Recently, h | i 4
for all determinants m that are connected to n by set of significant Hamiltonian transition matrix elements a single excitation or a double excitation. (ergodicity of the simulation is broken). For example let
1 us imagine a situation in which two states Ψ1 and Ψ2 2. Calculate the residence time tn = P and | i | i m r(m←n) are nearly degenerate but have different irreducible rep- stay on the walker n for the time tn (the residence resentations. If we introduce a small perturbation in the time can also be viewed as a weight). Hamiltonian that breaks the symmetry, then the result- ing ground state will be a linear combination of the two 3. Next, a new determinant is selected, without rejec- states Ψ and Ψ . In such a situation, the CTCM m 1 2 tion, out of all the determinant with a probabil- algorithm| i when| performedi with a screening parameter ity proportional to r(m n). ← , that is larger than the magnitude of the perturbation, To get a better understanding of the algorithm, it is will not be able to sample both states and will most likely useful to think of each determinant (n) as a reactive be stuck in one or the other state depending on the initial species in a unimolecular reaction network that can be determinant chosen to start the Monte Carlo calculation. transformed into another determinant (m) through a uni- This situation is difficult to diagnose because it is likely molecular reaction with the ratio of the forward and re- that the energy obtained will be reasonably accurate, but verse rates determined by the equilibrium constant (in the properties such as correlation function, etc. will be 2 quite inaccurate. Although it is difficult to be certain k(m←n) hm|Ψ(p)i our case k(n←m) = Keq = hn|Ψ(p)i ). As long as all that such an eventuality is eliminated, it can be avoided determinants are participating in the reaction network, to a large extent by fully localizing the orbital basis, and it will reach an equilibrium state in which a determinant trying to break all possible spatial symmetries. This will (n) will have a concentration ensure that the approximate symmetries of the problem will not influence the Monte Carlo walk. c(n) n Ψ(p) 2 . (7) ∝ |h | i| The CTMC algorithm can be viewed as a way of driv- C. AMSGrad algorithm for optimizing the energy ing this unimolecular reaction network stochastically to equilibrium. The optimized wavefunction (Ψ(p)) is obtained by Notice, that the same equilibrium state will be reached minimizing its energy with respect to its parameters p. irrespective of the relative magnitude of the rate con- The gradient of the energy of the wavefunction with re- stants k(m n) versus k(k n). Within the reac- spect to p can be evaluated as follows tion network← picture, this is akin← to catalyzing a reaction which can lead to faster equilibration but does not change ∂E gi = (8) the equilibrium state itself. This allows considerable free- ∂pi dom in how the algorithm is implemented. The rates cho- ∂ Ψ(p) H Ψ(p) / Ψ(p) Ψ(p) sen by us in Equation 6 are only one example of a choice = h | | i h | i ∂p that one can make. In fact, as long as the algorithm sat- i r(m←n) ρ(m) Ψi(p) H E Ψ isfies the condition of detailed balance = , =2h | − | i r(n←m) ρ(n) Ψ(p) Ψ(p) and the principle of ergodicity, it is guaranteed to sample h | i the determinants with the correct probability. P Ψ(p) n 2 hΨi(p)|ni hn|H−E|Ψ(p)i n |h | i| hΨ(p)|ni hn|Ψ(p)i Although the CTMC can be used in virtually all Monte =2 P 2 i Ψ(p) n Carlo simulations, to the best of our knowledge it has |h | i| Ψi(p) n never been used in VMC before. This is partly because =2 h | i(EL[n] E) (9) one has to calculate and store the quantities r(m n) Ψ(p) n − h | i ρn for a potentially large set of connected configurations← and ∂Ψ(p) E this is usually quite expensive. However, it is interest- where, Ψi(p) = and in going from the first line ∂pi ing to note that in the VMC algorithm, the quantities | i to the second we have assumed that all parameters are hm|Ψ(p)i hn|Ψ(p)i are already used in the calculation of the local real. Note that during the calculation of the energy the energy (see Equation 5) and just by storing those quan- determinants n are generated according to the probabil- tities the CTMC algorithm can be used with almost no ity distribution ρn and for each of these determinants a overhead. We notice that using the CTMC algorithm local energy EL[n] is evaluated. Thus to obtain the gra- leads to a shorter autocorrelation time and results in a dient of the energy, the only additional quantity needed hΨi(p)|ni more efficient VMC calculation. is a vector of gradient ratios hΨ(p)|ni . We will show that A potential difficulty that might arise due the use of these can be calculated at a cost that scales linearly with CTMC is that one could encounter a situation in which the number of parameters in the wavefunction. In the although a set of determinants have a large overlap with wavefunctions that we have used in this work, the num- the current wavefunction, they are never reached because ber of parameters themselves scale quadratically with the we start our Monte Carlo simulation from a determinant size of the system. Thus the gradient can be calculated that is not connected to these determinants through a at a cost that scales no worse than the local energy. 5
The two most common algorithms for minimizing The stochastic gradient descent methods have the ad- the energy of the wavefunctions in VMC are the lin- vantage that, (1) their CPU and memory cost scales lin- ear method (LM)42–44 and the stochastic reconfiguration early with the number of wavefunction parameters and (SR)45. The former is a second order method and can be (2) the update in parameters depends strictly linearly viewed as an instance of the augmented Hessian method on the gradient and thus does not introduce a system- in which one repeatedly solves the generalized eigenvalue atic bias. The disadvantage of these methods is that equation traditionally they have been thought to be too slow and needing several thousand iterations to reach convergence, ! ! ! ! E gT 1 1 0 1 thus rendering them relatively ineffective. We show that 0 = (10) g H ∆p 0 S ∆p by choosing appropriate parameters in the AMSGrad al- gorithm, we obtain convergence in a few tens to hundreds of iterations, which coupled with the fact that each itera- where, Sij = Ψi Ψj , Hij = Ψi H Ψj , ∆p is the h | i h | | i tion is much cheaper than LM and SR algorithm, makes update to the wavefunction parameters and Ψi = | i it a powerful algorithm to solve challenging non-linear 1 hΨ|Ψii Ψi Ψ . (Here we won’t go into the optimization problems in VMC. √hΨ|Ψi | i − hΨ|Ψi | i A general algorithm for adaptive methods is as follows details of how the matrices S and H are calculated.) The SR algorithm can be thought to be performing m(i) =f(g(0), , g(i)) (12) ··· projector Monte Carlo by repeated application of the (i) (0) (i) propagator (I ∆tH) to the wavefunction Ψ(t) and n =g(g , , g ) (13) − | i ··· q at each step projecting the resulting wavefunction on to (i) (i) (i) ∆pj = α m / n (14) the tangent space constructed from the wavefunction gra- − j j dients Ψ . It can be shown that this results in a linear i where f and g are some functions that take in all the equation| i past gradients and generate vectors m(i) and n(i), that are then used to calculate the updates ∆p to the pa- rameters. The various flavors on adaptive methods, Sij∆p = (∆t)g (11) − ADAGrad48, RMSProp49, ADAM50 and AMSGrad47 dif- which has to be solved at each step to obtain ∆p, the fer in the function f and g. In AMSGrad f and g re- update to the wavefunction parameters. Note that in spectively calculate the exponentially decaying moving both these methods, one has to solve an equation at each average of the first and second moment 3 iteration which has a CPU cost of O(Np ) and a memory (i) (i−1) (i) 2 m =(1 β1)m + β1g (15) cost of O(Np ), where Np is the number of wavefunction − (i) (i−1) (i−1) (i) (i) parameters, which we will show scales quadratically with n = max(n , (1 β2)n + β2(g g ) (16) − · the size of the system. The CPU cost can be reduced 4 g to O(nsNp) and O(nsN ) respectively for SR and LM of the gradients ( ) respectively, with the caveat that the method by using a direct method which avoid building second moment at iteration (i) is always greater than at iteration (i 1), i.e. n(i) n(i−1). This ensures the matrices, where ns is the number of Monte Carlo − ≥ samples used in each optimization iteration46. that the learning rate, α(i)/√n(i), is a monotonically de- Both these methods are quite effective at optimizing creasing function of the number of iterations. This was the wavefunction, in particular, the LM requires fewer shown to improve the convergence of AMSGrad for a iterations and is often able to find the optimized wave- synthetic problem over ADAM, which is essentially iden- function containing about 1000 parameters in less than tical to AMSGrad with the only difference being that 10 iterations. The effectiveness of the method is some- g calculates the exponentially decaying moving average what adversely affected while using direct method be- of the second moment of the gradient. In our exper- cause one often needs to include level shifts to remove iments with various VMC calculations we have found the singularity in the overlap matrix S, which can result AMSGrad to always be superior to ADAM and RM- in slower convergence. SProp. For most of the systems studied in this arti- In this work we instead use a flavor of the adap- cle, unless otherwise specified, we have used the param- (i) tive stochastic gradient descent (SGD) method called eters α = 0.01, β1 = 0.1, β2 = 0.01, which are signif- AMSGrad47. The use of SGD methods have become pop- icantly more aggressive than the recommended values47 (i) ular in machine learning, where one is often interested in of α = 0.001, β1 = 0.1, β2 = 0.001. optimizing a cost function that depends non-linearly on a set of parameters and which can only be evaluated with a stochastic error by using batches of finite samples. This II. COMPUTATIONAL CONSIDERATIONS problem is of course very similar to the one we are in- terested in solving in VMC. The use of these adaptive In this section we will first briefly describe the wave- stochastic gradient descent in the context of VMC was function consisting of the product of a correlator product first proposed by Booth and coworkers24. state34,46 and a Slater determinant and then show that 6 the cost of each stochastic sample is dominated by op- determinant erations that cost O(N 2) for ab-initio Hamiltonians and O(N) for the Hubbard model. m Ψ(p) Cˆ[m] det(θ[m]) h | i = (20) n Ψ(p) Cˆ[n] det(θ[n]) h | i where, θ[m] is the square matrix constructed by only re- A. Correlator product states and Slater taining those rows and columns from θ that are occupied determinant in m and Φ. The CPS overlap with a determinant, Cˆ[m], is just The wavefunction ( Ψ ) used in this work consists of a the product of the coefficients (c ) of all correlators (λ) | i nλ product of the correlator product states (CPS) (Cˆ) and present in m. Therefore, the ratio of the CPS overlap a Slater determinant ( Φ ) (Cˆ[m]/Cˆ[n]) is equal to the ratio of the coefficients for | i only those correlators that contain at least one of the four Ψ =Cˆ Φ (17) orbitals that differ between n and m, as all correlators | i | i ! in common cancel. This ratio can be evaluated with an Y X † O(1) cost as follows. For each orbital in the system we Φ = θixa 0 (18) | i i | i maintain a list of all the correlators that contain that or- x