Quick viewing(Text Mode)

Approximating Matrix Eigenvalues by Randomized Subspace Iteration Samuel M

Approximating Matrix Eigenvalues by Randomized Subspace Iteration Samuel M

Approximating matrix eigenvalues by randomized subspace iteration Samuel M. Greene,1 Robert J. Webber,2 Timothy C. Berkelbach,1, 3, a) and Jonathan Weare2, b) 1)Department of Chemistry, Columbia University, New York, New York 10027, United States 2)Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, United States 3)Center for Computational Quantum Physics, Flatiron Institute, New York, New York 10010, United States Traditional numerical methods for calculating matrix eigenvalues are prohibitively expensive for high- dimensional problems. Randomized iterative methods allow for the estimation of a single dominant eigenvalue at reduced cost by leveraging repeated random sampling and averaging. We present a general approach to ex- tending such methods for the estimation of multiple eigenvalues and demonstrate its performance for problems in quantum chemistry with matrices as large as 28 million by 28 million.

I. INTRODUCTION be understood as a natural generalization of projector Monte Carlo methods to excited states. Many scientific problems require matrix eigenvectors Among previous randomized iterative approaches to and eigenvalues, but methods for calculating them based the multiple eigenvalue problem, ours is perhaps most on dense, in-place factorizations are intractably expen- closely related to several “replica” schemes that use mul- sive for large matrices.1,2 Iterative methods involving re- tiple independent iterations to build subspaces within peated matrix multiplications3–5 offer reduced computa- which the target matrix is subsequently diagonal- 30,33–35 tional and memory costs, particularly for sparse matri- ized. In comparison, our method avoids the high- ces. However, even these methods are too expensive for variance inner products of sparse, random vectors that 34 the extremely large matrices increasingly encountered in can hinder replica approaches and results in a stable modern applications. stochastic iteration that can be averaged to further re- Randomized iterative numerical linear algebra meth- duce statistical error. As a notable consequence and un- ods6–11 can enable significant further reductions in com- like replica methods, our approach is applicable to itera- 32 putational and memory costs by stochastically impos- tive techniques in continuous space. ing sparsity in vectors and matrices at each iteration, thus facilitating the use of efficient computational tech- niques that leverage sparsity. While randomized itera- II. A NON-STANDARD DETERMINISTIC SUBSPACE tive methods offer fewer established theoretical guaran- ITERATION tees compared to the better studied matrix sketching ap- 12–20 proaches, their memory and computational costs can Our goal is to find the k dominant eigenvalues (count- 11,21–24 be significantly lower. When used to calculate the ing multiplicity) of an n n matrix A. Starting from , or smallest eigenvalue, of the quan- an initial n k matrix X(0)× , classical subspace iteration tum mechanical Hamiltonian , randomized iter- techniques construct× a sequence of “matrix iterates” ac- ative methods are known as projector quantum Monte cording to the iteration X(i+1) = AX(i)[G(i)]−1. For 7,22,25–28 Carlo approaches. Applying these methods to k = 1, this corresponds to power iteration, on which calculate multiple eigenvalues poses additional challenges many single-eigenvalue randomized iterative methods are due to the need to maintain orthogonality among eigen- based. Multiplying by [G(i)]−1 enforces orthonormality 10,29–31 vectors as iteration proceeds. among the columns of matrix iterates. The column span This paper presents a randomized subspace iteration of the matrix iterates converges to the span of the k domi- arXiv:2103.12109v1 [math.NA] 22 Mar 2021 approach to addressing these challenges. The method is nant eigenvectors if the overlap of the initial iterate with general and can be used to extend all of the above ref- this subspace is nonzero. Eigenvalues and eigenvectors erenced randomized iterative approaches for dominant can be estimated after each iteration using the Rayleigh- eigenvalues (including continuous-space methods such as Ritz method.36–38 In standard implementations of sub- 32 diffusion Monte Carlo ) to the multiple dominant eigen- space iteration, both the orthogonalization and eigen- value problem. For concreteness, we focus on a partic- value estimation steps involve nonlinear operations on ular randomization technique, namely one from the fast X(i), which lead to statistical biases once randomness is 11 randomized iteration framework. We test our method introduced into the iterates by stochastic sparsification. on quantum mechanical problems; in this context, it can In order to reduce these errors in our randomized algo- rithm, we make two non-standard choices. First, we es- timate eigenvalues by solving the generalized eigenvalue problem a)Electronic mail: [email protected] b) Electronic mail: [email protected] U∗AX(i)W(i) = U∗X(i)W(i)Λ(i) (1) 2 for the unknown diagonal matrix Λ(i) of Ritz values, error is zero. where U is a constant deterministic matrix with columns The variance in a compressed vector x0 = Φ(x) can chosen to approximate the dominant eigenvectors of A. be systematically reduced by increasing m, albeit at in- This approach represents a multi-eigenvalue generaliza- creased computational cost. In the context of random- tion of the “projected estimator” commonly used in ized iterative methods, m can often be chosen to be sig- single-eigenvalue randomized methods.39 Eigenvalue es- nificantly less than the dimension of x. The statisti- timates are exact for any eigenvector exactly contained cal variance in the dot product u∗x0 is often low even within the column span of U regardless of the quality of in high dimensions.11 In contrast, dot products between the iterate X(i), a feature that will provide an additional pairs of uncorrelated, sparse, random vectors, as are used means of reducing statistical error in our randomized al- in replica methods, can have high variance, particularly gorithm. when the vectors are high-dimensional.34 And taking the Second, we construct the matrices G(i) by a non- dot product of a compressed vector with itself (e.g. x0∗x0) standard approach. We only enforce orthogonality of the introduces a significant bias in high dimensions.27 columns of X(i) within the column span of U instead of in the entire vector space. Multiplication by [G(i)]−1 typically also enforces a normalization constraint, which IV. RANDOMIZED SUBSPACE ITERATION we relax by normalizing by a modified running geometric average of iterate norms. Specific procedures for con- Applying stochastic compression to our subspace it- structing the matrices G(i) are described below. eration yields the iteration X(i+1) = AΦ(X(i))[G(i)]−1, The principle motivating these choices is that non- where the compression operation Φ is performed inde- linear operations in our iteration should only be applied pendently at each iteration. We refer to the sequence of to products between the iterates X(i) and the constant iterates X(i) so generated as a “trajectory.” As empha- matrices U and A∗U. This leads to a suboptimal deter- sized above, the inner products of X(i) with deterministic ministic algorithm in the sense that eigenvalue estimates matrices exhibit low variance. Nonetheless, even this low converge at only half the rate obtained if (1) is replaced variance could result in significant biases if the “instan- by the standard quadratic Rayleigh quotient.36 However, taneous” Ritz values corresponding to U∗AΦ(X(i)) and when iterates are randomized as described below, their U∗X(i) ((1)) from each iteration were averaged, due to products with deterministic matrices typically have very the nonlinear eigensolve operation. This will be demon- low variance, even when the variance in the iterates is strated in our numerical experiments below. For this rea- significant.11 son, we average these matrices in order to further reduce their variance before solving the eigenvalue problem ∗ (i) ∗ (i) U AΦ(X ) iW = U X iWΛ (2) III. STOCHASTIC COMPRESSION h i h i where Λ is a diagonal matrix of Ritz values and i rep- h·i If sparsity is leveraged, the cost of forming the ma- resents an average over multiple iterations i from a single trix products AX(i) in the above algorithm scales as long trajectory. This formulation allows for the estima- (mamxk), where ma and mx are the number of nonzero tion of eigenvalues with low bias and low variance while Oelements in each column of A and X(i), respectively. also avoiding intractable memory costs. (X(i) has k columns.) Stochastic compression allows one In principle, we could first average the iterates to ob- (i) tain X i, an accurate representation of the dominant to reduce this cost by zeroing nonzero elements at ran- h i domly selected positions. We define a stochastic com- eigenspace, and then calculate its associated Ritz values. However, this would require (kn) memory, rendering it pression operator Φ which, when applied to a generic O vector x, returns a random compressed vector Φ(x) with impractical for large matrices. Instead, the k k ma- ∗ (i) ∗ (i) × (1) at most a user-specified number m of nonzero ele- trices U AΦ(X ) and U X in (2) can be stored and ments, and (2) elements equal to those of the input vec- averaged at little memory cost and, due to their linear tor x in expectation, i.e., E[Φ(x)i] = xi. Applying Φ dependence on iterates, without statistical bias. In fact, to a matrix X = [x1 x2 ...] involves compressing each estimates from (2) are equal (in expectation value) to of its columns independently to m nonzero elements, i.e. those obtained by applying the Rayleigh-Ritz estimator (i) Φ(X) = [Φ(x ) Φ(x ) ...]. Many possible compression in (1) to X i. 1 2 h i (i) algorithms correspond to this generic definition. Here we We construct the matrices G from diagonal matrices (i) use a new pivotal algorithm described in the Materials N of normalization factors. Starting from the identity, (i) and Methods section, below. As for many previous algo- N is updated after each iteration as rithms in the fast randomized iteration framework, piv- !α x(i)  (1−α) otal compression exactly preserves the largest-magnitude (i) k 1 (i−1) Nkk = || (i−1)|| Nkk (3) elements of the input vector and randomly samples the xk 1 remainder. Moreover, it yields the exact input vector || || (i) th (i) when m meets or exceeds the number of nonzero ele- where xk denotes the k column of X , and α is a user- ments in the input vector, in which case the statistical defined parameter. In most iterations, we set G(i) equal 3 to N(i). In this case, if α = 1, multiplication by [G(i)]−1 TABLE I. Parameters used in numerical calculations of FCI fixes the iterate column norms at those from the previ- eigenvalues. N and M denote the number of active electrons ous iteration. However, due the nonlinear dependence and orbitals, respectively, for each system, as determined by N(i) of on the matrix iterates, this introduces a bias, the choice of single-electron basis. NFCI is the dimension termed “population control bias” in other randomized of the block of the FCI matrix containing the lowest-energy methods.40 It can therefore be advantageous to choose a eigenvalue. (i) value of α < 1 to ensure that N depends less strongly Nuclear Single-electron 6 on the current matrix iterate, in analogy to normaliza- System separation basis (N,M) NFCI/10 32,40 tion strategies used in other randomized methods. Ne - aug-cc-pVDZ (8, 22) 6.69 In our numerical experiments, choosing α = 0.5 reduced equilibrium C 1.27273A˚ cc-pVDZ (8, 26) 27.9 this bias without significantly increasing the variance. 2 stretched C 2.22254A˚ cc-pVDZ (8, 26) 27.9 Periodically (every 1000 iterations in this work), we 2 orthogonalize the iterate columns by instead construct- ing G(i) as N(i)D(i)R(i). Here R(i) is the upper tri- A I H angular factor from a QR factorization of U∗AΦ(X(i)). apply randomized subspace iteration to = ε in- H − H QR factorization is performed on the comparatively low- stead of itself. The lowest-energy eigenvectors of ∗ (i) (i) are the dominant eigenvectors of A for ε > 0 sufficiently variance product U AΦ(X ) rather than on Φ(X ) −3 −1 itself. In order to remove the normalization constraint small in magnitude. We used ε = 10 Eh for all sys- (i) −1 tems. Eigenvalues Ei of H are related to those λi of A as imposed by multiplying by [R ] , we choose the diag- −1 (i) Ei = ε (1 λi). Errors in Hamiltonian eigenvalues of onal matrix D such that the `1-norms of the columns of − 45 Φ(X(i))[D(i)R(i)]−1 equal those of Φ(X(i)). This ensures < 1 mEh are typically necessary for chemical accuracy. that normalization is only enforced via multiplication by Two different kinds of statistical error contribute to [N(i)]−1. differences between eigenvalue estimates and exact eigen- values. The standard error reflects the stochastic vari- ability of estimates after a certain number of iterations. The asymptotic bias represents the absolute difference V. NUMERICAL EXPERIMENTS between the estimate after infinitely many iterations and the exact eigenvalue. Increasing m in compression oper- In order to assess the performance of our method, ations decreases both the standard error and the asymp- we apply it to the full configuration interaction (FCI) totic bias. Averaging over more iterations decreases problem from quantum chemistry. The FCI Hamilto- only the standard error, which scales asymptotically as nian matrix H encodes the physics of interacting elec- the inverse square root of the number of iterations in- trons in a field of fixed nuclei. It is expressed in a basis cluded.46,47 We estimate the standard error in each eigen- of Slater determinants, each representing a configuration value estimate Λk by applying error estimation tech- of N electrons in M single-electron orbitals, here taken niques to the scalar-valued trajectory quantity to be canonical Hartree-Fock orbitals. The number of such Slater determinant basis elements depends combi- (i) ∗  ∗ (i) ∗ (i) f = z U AΦ(X ) ΛkU X wk (4) natorially on M and N. For many molecules, symmetry k k − can be used to construct a basis in which the Hamilto- nian matrix is block diagonal. Here, we consider only where wk represents the corresponding right generalized the spatial (i.e. point-group) symmetry of the nuclei, eigenvector from (2) and zk the corresponding left eigen- although using other symmetries (e.g. electron spin or vector. Because subsequent iterate matrices are corre- (i+1) (i) angular momentum) could yield a matrix with smaller lated, i.e. X depends on X , a suitable error esti- blocks. Maintaining orthogonality between states in dif- mation approach is needed to account for the correspond- (i) 48 ferent symmetry blocks in randomized iterative numeri- ing correlations in fk . Here we use the emcee software, cal linear algebra methods is often straightforward.31,41 as described previously.27 In our numerical experiments, We focus on the more challenging task of estimating mul- we exclude a number of iterations at the beginning of tiple eigenvalues within a single block, namely the one each trajectory to ensure iterates are sufficiently equili- containing the lowest-energy eigenvalue. brated before we begin averaging.49 Elements of H are given by the Slater-Condon Figure 1 shows estimates of the first ten (lowest- rules,42,43 which also determine its sparsity structure; energy) eigenvalues of the FCI Hamiltonian matrix for only (N 2M 2) elements per column are nonzero. We the Ne atom in the aug-cc-pVDZ basis set, with dimen- subtractedO the Hartree-Fock energy from all diagonal el- sions of approximately 7 million by 7 million. Results ements. Table I lists the parameters defining the FCI ma- in the upper panel are obtained by compressing iterate trix for each molecular system considered here. We cal- matrices to m = 10, 000 nonzero elements per column culated matrix elements, point-group symmetry labels, in each iteration. We show the instantaneous Ritz val- and reference eigenvalues to which results are compared ues (computed from (2) without matrix averaging) for using the PySCF software.44 The lowest-energy eigenval- the first 20,000 iterations, after which we begin averag- ues of H are typically of greatest physical interest, so we ing the k k matrices according to (2). Eigenvalues of × 4

) tion step. However, as we have already mentioned, the h

E variance in the inner products between uncorrelated vec- 3000 tors from independent trajectories can scale unfavorably m = 10,000 with n. This strategy has been tested previously in the 2000 context of the FCI problem.35 An alternative scheme in- troduced in,29 in the context of the continuous space dif- 1000 fusion Monte Carlo algorithm, avoids this variance scal- 0 ing by using the output of that single vector ground state estimation scheme to approximate the same inner prod- Eigenvalue estimates (m 0 10000 20000 30000 40000 50000 ucts. That scheme is an early instance of an algorithm Number of iterations often referred to as the Variational Approach to Con-

) 50–52

h formal dynamics (VAC). In both the independent E 0.3 m = 10,000 replica approach and in VAC, one needs to select a “time- m = 50,000 lag” parameter sufficiently long to avoid bias due to the imperfect choice of trial vectors, but sufficiently short to

exact (m 0.2

− avoid collapse to the dominant eigenvector. 0.1 In order to evaluate the performance of our random- 0.0 ized subspace iteration for larger matrices, we apply it 1 2 3 4 5 6 7 8 9 10 to the carbon dimer molecule C2, with computational Final estimate Eigenvalue index details given in Table I and results given in Table II. Lower energy Higher energy For C2 at its equilibrium bond length, with iterates com- pressed to m = 150, 000 nonzero elements per column, FIG. 1. (top) After an initial equilibration period (20,000 it- estimates of the ten lowest-energy eigenvalues differ from erations), estimates of the first ten FCI eigenvalues for Ne exact eigenvalues by less than 0.09 mEh. Standard errors (obtained by (2)) converge to exact eigenvalues to within after 50,000 iterations (30,000 of which were included in small asymptotic biases when iterate matrices are compressed −5 matrix averages) are less than 4 10 mEh. to m = 10, 000 nonzero elements per column. (bottom) × After 50,000 iterations, eigenvalue estimates obtained with m = 50, 000 exhibit less error than those with m = 10, 000. The stretched C2 molecule exhibits stronger elec- All estimates agree with exact eigenvalues to within 0.35 mEh. tron correlation than the equilibrium C2 molecule and therefore serves as a more rigorous test for numerical eigenvalue methods in general. Deterministic subspace iteration converges more slowly for stretched C be- the averaged matrices show significantly decreased fluc- 2 cause differences between eigenvalues are smaller (Table tuations. II).36 Randomized subspace iteration performs worse for Eigenvalue estimates after 50,000 iterations are shown stretched C2 than for equilibrium C2 when applied with in the lower panel of Figure 1. With m = 10, 000, es- the same parameters (m = 150, 000). As iteration pro- timates differ from exact eigenvalues by less than 0.32 ceeds, random fluctuations in elements of the matrices mEh. These differences can be attributed to the remain- U∗AΦ(X(i)) and U∗X(i) increase, and ultimately the ing sources of statistical bias in our method. All standard matrices U∗X(i) become singular. This makes it impos- errors are less than 0.004 mEh. Retaining more nonzero sible to obtain accurate eigenvalue estimates. In con- elements in compression operations (m = 50, 000) re- trast, these fluctuations are significantly reduced when duces the asymptotic biases to < 0.02 mEh and the stan- iterates are instead compressed to m = 200, 000 nonzero dard errors to < 0.0003 mEh. elements per column. In this case, eigenvalue estimates For comparison, we replaced the averaging of matri- differ from exact eigenvalues by less than 2 mEh, with −5 ces in (2) by averaging of the instantaneous Ritz values standard errors less than 9 10 mEh. A longer equi- from each iteration of the same m = 10, 000 calculation. libration period (35,000 iterations)× was needed to obtain The resulting eigenvalue estimates have biases as large accurate estimates for this problem. These results sug- as 1.48 mEh, almost five times larger. We also tested gest that each problem may require a minimum value a randomization of a more standard subspace iteration of m to achieve reliable convergence. Others53,54 have that uses quadratic operations in the orthogonalization observed similarly abrupt changes in variance with the and eigenvalue estimation steps. Details are presented amount of sampling in randomized methods for single in the Materials and Methods section below. The best eigenvalues, but in those cases the variance can in princi- estimates from this method with m = 10, 000 are 43 to ple be reduced just by averaging over more iterations.28 161 mEh greater than exact eigenvalues. This bias could In contrast, insufficient sampling in randomized subspace potentially be reduced by replacing one instance of the iteration precludes the estimation of k eigenvalues due to iterate matrix in (C1) by one from an independently gen- numerical issues encountered when solving (2), an issue erated replica trajectory and omitting the orthogonaliza- that cannot be addressed by including more iterations. 5

Appendix A: Pivotal compression TABLE II. Exact eigenvalues (in mEh) for the equilibrium and stretched C2 systems, and those calculated by random- ized subspace iteration with m = 150, 000 or m = 200, 000, We use an approach to stochastic vector compression, respectively. based on pivotal sampling,57 that differs from the best- performing approach from previous work on fast ran- equilibrium C2 (m = 150, 000) stretched C2 (m = 200, 000) domized iterative methods,11,27,28 which was based on Exact Estimated Exact Estimated systematic sampling. The pivotal approach was found −343.40 −343.39 −346.33 −346.32 to perform at least as well as, and in some cases better −264.67 −264.67 −343.50 −343.50 than, the systematic approach, as measured by statistical −254.34 −254.34 −322.30 −322.33 error in test calculations. The remainder of this section −155.40 −155.37 −298.45 −298.45 describes the steps involved in constructing the (random) −104.86 −104.84 −251.24 −251.22 compressed vector Φ(x) from an input vector x with this −85.17 −85.15 −250.07 −250.05 pivotal approach. The user-specified parameter m deter- −48.91 −48.92 −241.71 −241.72 mines the number of nonzero elements in Φ(x). −19.33 −19.35 −208.54 −208.15 As in the systematic approach, a set containing the indices of the d largest-magnitude elementsD in x is first 13.64 13.63 −201.02 −199.05 constructed. The number d is chosen such that each of 26.44 26.53 −190.49 −190.01 the remaining elements in x satisfy X xi (m d) < xk , i / (A1) | | − | | ∈ D k∈D / VI. CONCLUSIONS Elements in are preserved exactly in the compression operation. AD number g = m d of elements are then Incorporating randomization techniques into iterative sampled from among the remaining− elements in x using linear algebra methods can enable substantial gains in pivotal sampling. This randomly selected set of indices is their computational efficiency. Here we describe a general denoted as . The probability that the index i is included and effective technique for estimating multiple eigenval- in is S S ues using existing randomized iterative approaches. We ( demonstrate its effectiveness in numerical experiments on 0 i D p = ∈ (A2) quantum chemistry problems, where our method can be i P −1 g xi k∈ /D xk i / D understood as a generalization of projector Monte Carlo | | | | ∈ methods to excited states. Even when the number of el- By constructing the compressed vector as ements retained in each iteration is less than 1% of the  dimension of the matrix, we obtain eigenvalue estimates x i  i with sufficient accuracy for three different problems in −1 ∈ D Φ(x)i = xip i (A3) quantum chemistry. i ∈ S 0 i / , i / Many techniques for reducing error or improving ef- ∈ D ∈ S ficiency originally developed for other randomized iter- we ensure that the expected value of each element of ative linear algebra methods can be adapted for this Φ(x) equals the corresponding element in x. Note that 25,27,28,55 P method. In particular, factorization and com- i pi = g, and the condition in (A1) ensures that pi < 1 pression of the Hamiltonian matrix at each iteration, as for all i. Both the systematic and pivotal approaches described in,27 is likely needed for higher dimensional ensure that each index can be selected at most once in quantum chemistry applications. Alternatively, our ap- the sampling procedure. proach could be combined with stochastic continuous- The following paragraphs describe a particular pivotal space techniques,29,32 which in the quantum context approach to randomly selecting the indices in , although S would provide a new diffusion Monte Carlo approach for it can be applied to sampling problems more generally. excited states. Given that deterministic implementations Pivotal sampling proceeds via a recursive algorithm. In of other iterative linear algebra methods (e.g. Jacobi- each iteration, the algorithm is applied to a subset of vec- Davidson4,5) generally converge faster than the subspace tor indices, here denoted as (i1, i2, ..., in). Each index ik iteration considered here, investigating randomized im- in this subset has an associated inclusion probability p˜k. plementations of these other methods may also prove use- In the first iteration, this subset contains all indices, and ful.56 Finally, despite their success in high-dimensional all inclusion probabilities equal the probabilities defined applications, theoretical guarantees for randomized iter- in (A2), i.e.p ˜k = pik . The first step in each iteration ative algorithms are lacking. Tackling the complicated involves randomly selecting an index h from among the correlations between iterates to precisely characterize the indices (i1, i2, ..., is), where s is the maximum value of Pj dependence of their error on dimension is a pressing and j for which k=1 p˜k < 1. The probability of selecting 11 ambitious goal. each index ik in this step is proportional to its inclusion 6 probability,p ˜k. It is then decided whether to include pling algorithm in order to allow its straightforward par- the selected index h in according to two parameters, allelization. The vector containing the elements of the defined as S vector p of probabilities on a particular process k is de- (k) s noted as q . The first step of this parallel approach X involves apportioning the sampling budget g among the a = 1 p˜k (A4) − processes. The number gk of elements assigned to each k=1 process is a random number satisfying −1 and b =p ˜s+1 a. With probability (1 (1 b) a), (k) − − − q 1 h is included in . In this case, the next iteration E[gk] = g || || (B1) S p 1 of the recursive algorithm proceeds with application to || || the indices (i , i , ..., i ), with inclusion probabilities s+1 s+2 n where 1 denotes the one-norm of a vector, i.e. the sum (b, p˜s+2, p˜s+3, ..., p˜n). Otherwise, the index is+1 is instead of the||·|| magnitudes of its elements. The sum of the bud- included in , in which case, the algorithm is next applied gets for all processes is constrained to equal the total bud- to the indicesS (h, i , i , ..., i ) with inclusion proba- P s+2 s+3 n get, i.e. gk = g. There are many possible approaches bilities (b, p˜ , p˜ , ..., p˜ ). k s+2 s+3 n to generating these random integers gk subject to these constraints. Here, we choose gk = E[gk] + rk, where rk is a random integer, either 0 or 1.b Thec values of k for Appendix B: Parallelizing vector compression which rk = 1 are chosen by sampling c elements from the vector t using the pivotal scheme described above, where This section describes possible strategies for paralleliz- X ing the two steps involved in compressing a given input c = g E[gk] (B2) − b c vector x. We assume that elements of x are distributed k among n parallel processes, not necessarily uniformly, in and elements of t are given as arbitrary order. The greatest-magnitude elements in x can be identified by independently heaping the elements tk = E[gk] E[gk] (B3) on each process by magnitude and maintaining the heaps − b c as elements are removed from consideration. The most This sampling step can be performed efficiently without straightforward approach to calculating the number d of parallelization, since the size n of the vector t is assumed elements to be preserved exactly involves adding indices to be small. to sequentially, in order of decreasing magnitude of A number gk of elements are then selected from each D the corresponding elements xi , stopping when the cri- process independently and in parallel. The probabili- terion in (A1) is satisfied for| all| remaining elements.27 ties on each process k must be adjusted to ensure that An alternative, more parallelizable approach is based on their sum is gk. This adjustment is performed differ- the observation that elements need not be considered in ently depending on whether gk > E[gk] or gk < E[gk]. strict order of decreasing magnitude. If, at an inter- No adjustment is needed if gk = E[gk], i.e. if E[gk] is (k) mediate stage of constructing , there exists an i not integer-valued. We define the vector y , with elements D P yet in for which xi (m d) xk , then i is  n o D | | − ≥ k∈D / | | (k) guaranteed to later be included in , even if there ex- (k) min 1, qi /tk gk > E[gk] y = ist greater-magnitude elements thatD have not yet been i n  (k)  o max 0, qi tk / (1 tk) gk < E[gk] included. The converse is not necessarily true. Thus, − − the criterion in (A1) can be evaluated for the greatest- (B4) z(k) magnitude elements on each process in parallel, initially and the vector , with elements using d = 0 and the sum of the magnitudes of elements j ek (k) X (k) X (k) on all processes. Periodically, but not necessarily after zj = yi + qi (B5) the addition of each element to , this sum of magni- i=1 i=j+1 tudes can be re-evaluated and communicatedD across all processes, along with an updated value of d based on the where ek is the total number of elements in the vector (k) total number of elements added. This iterative procedure p . The index h is calculated as the minimum value (k) (k) terminates only when the criterion in (A1) is not satisfied of j satisfying z gk if gk > E[gk], and z gk if j ≥ j ≤ for the greatest-magnitude element on every process. A gk < E[gk] The adjusted probabilities are then calculated parallel implementation of this algorithm is included in as 58 the open-source FRIES software on GitHub.  (k) y i < h Although the second step, i.e. constructing by piv-  i S (k)0 Ph (k) Pek (k) otal sampling, can be parallelized in the form described q = gk y q i = h (B6) i − j=1 j − j=h+1 j above, its implementation is complicated, particularly in  (k) qj i > h comparison to the systematic sampling approach we used previously. In order to simplify the implementation, we This particular approach was chosen to minimize the developed a general approach to reformulating any sam- number of probabilities to be recalculated. Only the first 7

(k) (k) h elements of y and z are needed. After calculation 2000 15000 10000 of the adjusted probabilities, gk elements can be selected 1000 5000 using standard sampling techniques, e.g. according to 15000 the pivotal scheme described above. 1500 10000 1000 5000 500 ) h

E 1500 15000 1000 10000 5000 Appendix C: Randomizing a standard subspace iteration 500 4000 This section describes the application of stochastic 1000 2000 500 compression techniques to a more standard subspace it- 2000 eration, namely one that relies more upon nonlinear op- Eigenvalue estimates (m 0 (i) 1000 erations on the iterates X . By comparing this “non- 200 − linear” approach to the method in the main text (here 0 10000 20000 0 10000 20000 referred to as “linear randomized subspace iteration”), Iterations we further elucidate the advantages of the non-standard features of linear randomized subspace iteration. Unlike FIG. 2. Estimates of the ten least-energy eigenvalues for Ne in the linear approach, multiplication by G(i) enforces or- from symmetric randomized subspace iteration with iterate thonormality in the full n-dimensional vector space. At matrices compressed to m = 10, 000 nonzero elements per intervals of 1000 iterations, G(i) is constructed from a column. Note the significantly greater errors relative to lin- Gram-Schmidt orthogonalization of the iterate columns, ear randomized subspace iteration (Figure 1). Dashed lines indicate exact eigenvalues. which requires inner products of pairs of random vectors. In other iterations, G(i) is a diagonal matrix containing the `1-norms of the iterate columns. In each iteration, eigenvalues can be estimated in the Appendix D: Data Availability process of forming the next iterate, after performing com- pression. Applying the Rayleigh-Ritz method to the com- pressed iterate Φ(X(i)) yields the generalized eigenvalue All data from the numerical experiments presented equation here is available at https://doi.org/10.5281/zenodo. 4624477. The code used to perform our numerical Φ(X(i))∗AΦ(X(i))W(i) = Φ(X(i))∗Φ(X(i))W(i)Λ(i) experiments can be accessed at https://github.com/ (C1) sgreene8/FRIES. to be solved for the matrix Λ(i) of Ritz values. Because this equation involves quadratic inner products of the compressed iterates, the resulting eigenvalue estimates are variational. Applying this approach with iterates compressed to m = 10, 000 to the Ne system defined in ACKNOWLEDGMENTS the main text yields the eigenvalue estimates presented in Figure 2. The best (i.e. minimum) estimates differ from the exact eigenvalues by as much as 201 mEh, in- We gratefully acknowledge productive discussions with dicating that the spans of the individual iterate matrices Aaron Dinner, Michael Lindsey, Verena Neufeld, and differ greatly from the dominant eigenspace of A. James Smith throughout the development and execution One might expect that averaging can be used to im- of this project. Benjamin Pritchard provided invaluable prove the accuracy of these eigenvalue estimates. How- suggestions for improving the readability and efficiency of ever, because this approach is variational, averaging the our source code. S.M.G. is supported by an investment Ritz values themselves yields estimates with at least as fellowship from the Molecular Sciences Software Insti- much error as the minimum eigenvalue estimates con- tute, which is funded by U.S. National Science Founda- sidered above. Instead averaging the k k matrices tion grant OAC-1547580. R.J.W. is supported by New Φ(X(i))∗AΦ(X(i)) and Φ(X(i))∗Φ(X(i)), in× analogy to York University’s Dean’s Dissertation Fellowship and by linear randomized subspace iteration, yields poorer esti- the National Science Foundation through award DMS- mates. These differ from the exact eigenvalues by 299 1646339. J.W. acknowledges support from the Advanced to 10,590 mEh. Although the computational costs and Scientific Computing Research Program within the DOE memory requirements for this nonlinear approach are ap- Office of Science through award DE-SC0020427. The proximately the same as for the method in the main text, Flatiron Institute is a division of the Simons Foundation. it is impossible to extract eigenvalue estimates of similar Computational resources were provided by the Research accuracy. The underscores the importance of reducing Computing Center at the University of Chicago and the the variance in quantities to which nonlinear operations High Performance Computing Center at New York Uni- are applied in randomized subspace iteration. versity. 8

REFERENCES 21H. Ji, M. Mascagni, and Y. Li, “Convergence analysis of Markov chain Monte Carlo linear solvers using Ulam–von Neumann al- gorithm,” SIAM J. Numer. Anal. 51, 2107–2122 (2013). 22J. Lu and Z. Wang, “The full configuration interaction quantum 1 G. W. Stewart, Volume I: Basic Decompositions, Matrix Al- Monte Carlo method in the lens of inexact power iteration,” 2017, gorithms (Society for Industrial and Applied Mathematics, arXiv:1711.09153v3. arXiv.org ePrint archive, (accessed July 17, Philadelphia, PA, 1998). 2020), arXiv:1711.09153v3. 2 G. W. Stewart, Volume II: Eigensystems, Matrix Algorithms (So- 23A. Andoni, R. Krauthgamer, and Y. Pogrow, “On solving lin- ciety for Industrial and Applied Mathematics, Philadelphia, PA, ear systems in sublinear time,” in 10th Innovations in Theoret- 2001). ical Computer Science Conference (ITCS 2019), Leibniz Inter- 3 C. Lanczos, “An iteration method for the solution of the eigen- national Proceedings in Informatics (LIPIcs), Vol. 124, edited value problem of linear differential and integral operators,” J. by A. Blum (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Res. Natl. Bur. Stand. 45, 255–282 (1950). Dagstuhl, Germany, 2018) pp. 3:1–3:19. 4 E. R. Davidson, “The iterative calculation of a few of the low- 24A. Ozdaglar, D. Shah, and C. L. Yu, “Asynchronous approxi- est eigenvalues and corresponding eigenvectors of large real- mation of a single component of the solution to a linear system,” symmetric matrices,” J. Comput. Phys. 17, 87 – 94 (1975). IEEE Trans. Netw. Sci. Eng. 7, 975–986 (2019). 5 G. L. G. Sleijpen and H. A. Van der Vorst, “A Jacobi-Davidson 25D. Cleland, G. H. Booth, and A. Alavi, “Communications: Sur- iteration method for linear eigenvalue problems,” SIAM J. Matrix vival of the fittest: Accelerating convergence in full configuration- Anal. Appl. 17, 401–425 (1996). interaction ,” J. Chem. Phys. 132 (2010). 6 T. Koma, “A new Monte Carlo power method for the eigenvalue 26M. Motta and S. Zhang, “Ab initio computations of molecular problem of transfer matrices,” J. Stat. Phys. 71, 269–297 (1993). systems by the auxiliary-field quantum Monte Carlo method,” 7 M. P. Nightingale and H. W. J. Bl¨ote,“Transfer-matrix Monte Wiley Interdiscip. Rev.: Comput. Mol. Sci. , 1364 (2018). Carlo estimates of critical points in the simple-cubic ising, planar, 27S. M. Greene, R. J. Webber, J. Weare, and T. C. Berkelbach, and Heisenberg models,” Phys. Rev. B 54, 1001–1008 (1996). “Beyond walkers in stochastic quantum chemistry: Reducing er- 8 I. T. Dimov, A. N. Karaivanova, and P. I. Yordanova, “Monte ror using fast randomized iteration,” J. Chem. Theory Comput. Carlo algorithms for calculating eigenvalues,” in Monte Carlo 15, 4834–4850 (2019). and Quasi-Monte Carlo Methods 1996, edited by H. Niederreiter, 28S. M. Greene, R. J. Webber, J. Weare, and T. C. Berkelbach, P. Hellekalek, G. Larcher, and P. Zinterhof (Springer New York, “Improved fast randomized iteration approach to full configura- New York, NY, 1998) pp. 205–220. tion interaction,” Journal of Chemical Theory and Computation, 9 G. H. Booth, A. J. W. Thom, and A. Alavi, “Fermion Monte J. Chem. Theory Comput. 16, 5572–5585 (2020). Carlo without fixed nodes: A game of life, death, and annihila- 29D. M. Ceperley and B. Bernu, “The calculation of excited state tion in Slater determinant space,” J. Chem. Phys. 131, 054106 properties with quantum Monte Carlo,” J. Chem. Phys. 89, (2009). 6316–6328 (1988). 10 Y. Ohtsuka and S. Nagase, “Projector Monte Carlo method based 30N. S. Blunt, S. D. Smart, G. H. Booth, and A. Alavi, “An on Slater determinants: Test application to singlet excited states excited-state approach within full configuration interaction quan- of H2O and LiF,” Chem. Phys. Lett. 485, 367 – 370 (2010). tum Monte Carlo,” J. Chem. Phys. 143, 134117 (2015). 11 L.-H. Lim and J. Weare, “Fast randomized iteration: Diffusion 31J. Feldt and C. Filippi, “Excited-state calculations with quantum Monte Carlo through the lens of numerical linear algebra,” SIAM Monte Carlo,” 2020, arXiv:2002.03622. arXiv.org ePrint archive Rev. 59, 547–587 (2017). (2020), (accessed November 20, 2020), arXiv:2002.03622. 12 P. Drineas, R. Kannan, and M. W. Mahoney, “Fast 32W. M. C. Foulkes, L. Mitas, R. J. Needs, and G. Rajagopal, Monte Carlo algorithms for matrices i: Approximating ma- “Quantum Monte Carlo simulations of solids,” Rev. Mod. Phys. trix multiplication,” SIAM J. Comput. 36, 132–157 (2006), 73, 33–83 (2001). https://doi.org/10.1137/S0097539704442684. 33C. Overy, G. H. Booth, N. S. Blunt, J. J. Shepherd, D. Cle- 13 P. Drineas, R. Kannan, and M. W. Mahoney, “Fast Monte land, and A. Alavi, “Unbiased reduced density matrices and Carlo algorithms for matrices ii: Computing a low-rank approx- electronic properties from full configuration interaction quantum imation to a matrix,” SIAM J. Comput. 36, 158–183 (2006), Monte Carlo,” J. Chem. Phys. 141, 244117 (2014). https://doi.org/10.1137/S0097539704442696. 34N. S. Blunt, A. Alavi, and G. H. Booth, “Krylov-projected quan- 14 P. Drineas, R. Kannan, and M. W. Mahoney, “Fast Monte Carlo tum Monte Carlo method,” Phys. Rev. Lett. 115, 050603 (2015). algorithms for matrices iii: Computing a compressed approx- 35N. S. Blunt, A. Alavi, and G. H. Booth, “Nonlinear biases, imate matrix decomposition,” SIAM J. Comput. 36, 184–206 stochastically sampled effective Hamiltonians, and spectral func- (2006), https://doi.org/10.1137/S0097539704442702. tions in quantum Monte Carlo methods,” Phys. Rev. B 98, 15 T. Sarlos, “Improved approximation algorithms for large matrices 085118 (2018). via random projections,” in 2006 47th Annual IEEE Symposium 36G. W. Stewart, “Accelerating the orthogonal iteration for the on Foundations of Computer Science (FOCS’06) (2006) pp. 143– eigenvectors of a Hermitian matrix,” Numer. Math. 13, 362–376 152. (1969). 16 E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and 37G. Stewart, “Methods of simultaneous iteration for calculating M. Tygert, “Randomized algorithms for the low-rank approxi- eigenvectors of matrices,” in Topics in Numerical Analysis II , mation of matrices,” Proc. Natl. Acad. Sci. 104, 20167–20172 edited by J. J. Miller (Academic Press, 1975) pp. 185 – 196. (2007), https://www.pnas.org/content/104/51/20167.full.pdf. 38Y. Saad, Numerical Methods for Large Eigenvalue Problems, 2nd 17 V. Rokhlin and M. Tygert, “A fast randomized algorithm ed. (Society for Industrial and Applied Mathematics, 2011). for overdetermined linear least-squares regression,” Proc. Natl. 39G. H. Booth, S. D. Smart, and A. Alavi, “Linear-scaling and Acad. Sci. 105, 13212–13217 (2008). parallelisable algorithms for stochastic quantum chemistry,” Mol. 18 F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, “A fast ran- Phys. 112, 1855–1869 (2014). domized algorithm for the approximation of matrices,” Appl. 40W. A. Vigor, J. S. Spencer, M. J. Bearpark, and A. J. W. Comput. Harmon. Anal. 25, 335–366 (2008). Thom, “Minimising biases in full configuration interaction quan- 19 M. W. Mahoney and P. Drineas, “CUR matrix decompositions tum Monte Carlo,” J. Chem. Phys. 142, 104101 (2015). for improved data analysis,” Proc. Natl. Acad. Sci. 106, 697–702 41R. M. Grimes, B. L. Hammond, P. J. Reynolds, and W. A. (2009), https://www.pnas.org/content/106/3/697.full.pdf. Lester, “Quantum Monte Carlo approach to electronically ex- 20 P.-G. Martinsson and J. A. Tropp, “Randomized numerical linear cited molecules,” J. Chem. Phys. 85, 4749–4750 (1986). algebra: Foundations and algorithms,” Acta Numer. 29, 403–572 (2020). 9

42J. C. Slater, “The theory of complex spectra,” Phys. Rev. 34, 50F. No´eand F. N¨uske, “A variational approach to modeling slow 1293–1322 (1929). processes in stochastic dynamical systems,” Multiscale Model. 43E. U. Condon, “The theory of complex spectra,” Phys. Rev. 36, Simul. 11, 635–655 (2013), https://doi.org/10.1137/110858616. 1121–1133 (1930). 51R. J. Webber, E. H. Thiede, D. Dow, A. R. Dinner, and J. Weare, 44Q. Sun, T. C. Berkelbach, N. S. Blunt, G. H. Booth, S. Guo, “Error bounds for dynamical spectral estimation,” SIAM Journal Z. Li, J. Liu, J. D. McClain, E. R. Sayfutyarova, S. Sharma, on Mathematics of Data Science 3, 225–252 (2021). S. Wouters, and G. K.-L. Chan, “PySCF: the Python-based 52C. Lorpaiboon, E. H. Thiede, R. J. Webber, J. Weare, and simulations of chemistry framework,” Wiley Interdiscip. Rev.: A. R. Dinner, “Integrated variational approach to conformational Comput. Mol. Sci. 8, e1340 (2018). dynamics: A robust strategy for identifying eigenfunctions of 45M. Bogojeski, L. Vogt-Maranto, M. E. Tuckerman, K.-R. M¨uller, dynamical operators,” J. Phys. Chem. B 124, 9354–9364 (2020). and K. Burke, “Quantum chemical accuracy from density func- 53J. S. Spencer, N. S. Blunt, and W. M. Foulkes, “The sign prob- tional approximations via machine learning,” Nat. Commun. 11, lem and population dynamics in the full configuration interaction 5223 (2020). quantum Monte Carlo method,” J. Chem. Phys. 136, 054110 46K. L. Chung, “Further limit theorems,” in Markov Chains with (2012). Stationary Transition Probabilities (Springer, Berlin, 1960) pp. 54W. A. Vigor, J. S. Spencer, M. J. Bearpark, and A. J. W. Thom, 93–106. “Understanding and improving the efficiency of full configuration 47A. Sokal, “Monte Carlo methods in statistical mechanics: Foun- interaction quantum Monte Carlo,” J. Chem. Phys. 144, 094110 dations and new algorithms,” in Functional Integration, NATO (2016). ASI Series (Series B: Physocs), Vol. 351, edited by C. DeWitt- 55F. R. Petruzielo, A. A. Holmes, H. J. Changlani, M. P. Nightin- Morette, P. Cartier, and A. Folacci (Springer, Boston, MA, 1997) gale, and C. J. Umrigar, “Semistochastic projector Monte Carlo pp. 131–192. method,” Phys. Rev. Lett. 109, 230201 (2012). 48D. Foreman-Mackey, D. W. Hogg, D. Lang, and J. Good- 56N. S. Blunt, A. J. Thom, and C. J. Scott, “Preconditioning and man, “emcee: The MCMC Hammer,” arXiv:1202.3665v4. perturbative estimators in full configuration interaction quantum arXiv.org ePrint archive (2013), (accessed March 21, 2021), Monte Carlo,” J. Chem. Theory Comput. 15, 3537–3551 (2019), arXiv:1202.3665 [astro-ph.IM]. arXiv:1901.06348. 49J. D. Chodera, “A simple method for automated equilibration 57G. Chauvet, “A comparison of pivotal sampling and unequal detection in molecular simulations,” J. Chem. Theory Comput. probability sampling with replacement,” Statistics and Proba- 12, 1799–1805 (2016). bility Letters 121, 1 – 5 (2017). 58“Fast randomized iteration for electronic structure (FRIES),” (2021), https://github.com/sgreene8/FRIES (accessed March 21, 2021).