Linear Scaling Computation of the Fock Matrix. VI. Data Parallel Computation of the Exchange-Correlation Matrix Chee Kwan Gan and Matt Challacombe

Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix Chee Kwan Gan and Matt Challacombe Citation: The Journal of Chemical Physics 118, 9128 (2003); doi: 10.1063/1.1568734 View online: http://dx.doi.org/10.1063/1.1568734 View Table of Contents: http://scitation.aip.org/content/aip/journal/jcp/118/20?ver=pdfcov Published by the AIP Publishing Articles you may be interested in Linear-scaling implementation of molecular response theory in self-consistent field electronic-structure theory J. Chem. Phys. 126, 154108 (2007); 10.1063/1.2715568 Parallel algorithm for the computation of the Hartree-Fock exchange matrix: Gas phase and periodic parallel ONX J. Chem. Phys. 125, 104110 (2006); 10.1063/1.2222359 Linear scaling computation of the Fock matrix. VII. Parallel computation of the Coulomb matrix J. Chem. Phys. 121, 6608 (2004); 10.1063/1.1790891 Linear scaling computation of the Fock matrix. IV. Multipole accelerated formation of the exchange matrix J. Chem. Phys. 111, 6223 (1999); 10.1063/1.479926 Linear scaling computation of the Fock matrix J. Chem. Phys. 106, 5526 (1997); 10.1063/1.473575 This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 141.214.17.222 On: Tue, 23 Dec 2014 03:02:26 JOURNAL OF CHEMICAL PHYSICS VOLUME 118, NUMBER 20 22 MAY 2003 Linear scaling computation of the Fock matrix. VI. Data parallel computation of the exchange-correlation matrix Chee Kwan Gan and Matt Challacombea) Theoretical Division, Group T-12, MS B268, Los Alamos National Laboratory, Los Alamos, New Mexico 87545 ͑Received 4 November 2002; accepted 27 February 2003͒ Recently, early onset linear scaling computation of the exchange-correlation matrix has been achieved using hierarchical cubature ͓J. Chem. Phys. 113, 10037 ͑2000͔͒. Hierarchical cubature differs from other methods in that the integration grid is adaptive and purely Cartesian, which allows for a straightforward domain decomposition in parallel computations; the volume enclosing the entire grid may be simply divided into a number of nonoverlapping boxes. In our data parallel approach, each box requires only a fraction of the total density to perform the necessary numerical integrations due to the finite extent of Gaussian-orbital basis sets. This inherent data locality may be exploited to reduce communications between processors as well as to avoid memory and copy overheads associated with data replication. Although the hierarchical cubature grid is Cartesian, naive boxing leads to irregular work loads due to strong spatial variations of the grid and the electron density. In this paper we describe equal time partitioning, which employs time measurement of the smallest sub-volumes ͑corresponding to the primitive cubature rule͒ to load balance grid-work for the next self-consistent-field iteration. After start-up from a heuristic center of mass partitioning, equal time partitioning exploits smooth variation of the density and grid between iterations to achieve load balance. With the 3-21G basis set and a medium quality grid, equal time partitioning applied to taxol ͑62 heavy atoms͒ attained a speedup of 61 out of 64 processors, while for a 110 molecule water cluster at standard density it achieved a speedup of 113 out of 128. The efficiency of equal time partitioning applied to hierarchical cubature improves as the grid work per processor increases. With a fine grid and the 6-311G͑df,p͒ basis set, calculations on the 26 atom molecule ␣-pinene achieved a parallel efficiency better than 99% with 64 processors. For more coarse grained calculations, superlinear speedups are found to result from reduced computational complexity associated with data parallelism. © 2003 American Institute of Physics. ͓DOI: 10.1063/1.1568734͔ I. INTRODUCTION are far apart. It is important to note that quantum locality also implies data locality, which is an essential property for Density functional theory ͑DFT͒ and its variant, the hy- scalable parallel algorithms. With parallel linear scaling brid Hartree–Fock/density functional theory ͑HF/DFT͒ have methods, an n-fold increase in processors should lead ap- proven to be accurate and computationally efficient. These proximately to an n-fold increase in simulation capability. In model chemistries are routinely used by conventional Gaussian-orbital quantum chemistry codes to excellent ad- practice, however, this is true only for scalable algorithms. vantage. Recently, significant progress has been achieved in While there has been continued efforts to parallelize conven- 33–41 the algorithmic development of O(N) alternatives to these tional quantum chemistry codes, taking advantage of conventional methods, where compute times scale linearly quantum locality to achieve parallel efficiency is a largely with system size N.1,2 These linear scaling methods over- unexplored but important area of research in linear scaling come a number of bottlenecks of order O(N2–3) associated SCF theory. with conventional methods, including computation of the ex- One of most computationally demanding parts of a mod- act Hartree–Fock exchange matrix,3–8 the Coulomb ern Gaussian-orbital SCF code is numerical integration of matrix,9–14 the exchange-correlation matrix,15–18 and density the exchange-correlation potential for DFT or HF/DFT cal- matrix alternatives to an eigensolution of the self-consistent- culations. Linear scaling approaches have been proposed to field ͑SCF͒ equations.19–30 solve this problem.15–18 Most works use Becke42 nuclear The existence of linear scaling methods has been argued weight functions that partition the integrand into a sum of from the concept of ‘‘nearsightedness’’ or quantum single-center integrations, each with a nuclear origin. Stan- 31,32 locality, which implies that the density matrix n(r,rЈ) dard numerical techniques may then be used to perform one- goes to zero as ͉rϪrЈ͉!ϱ. The fundamental reason for this center numerical integration in spherical polar coordinates. It is the loss of quantum phase coherence between points that should be observed that this transformation overcomes the primary nuclear cusp in the electron density. However, there a͒Electronic mail: [email protected] are additional complexities in the Becke approach. First, the 0021-9606/2003/118(20)/9128/8/$20.009128 © 2003 American Institute of Physics This article is copyrighted as indicated in the article. Reuse of AIP content is subject to the terms at: http://scitation.aip.org/termsconditions. Downloaded to IP: 141.214.17.222 On: Tue, 23 Dec 2014 03:02:26 J. Chem. Phys., Vol. 118, No. 20, 22 May 2003 Linear scaling of Fock matrix 9129 radial Jacobian cannot suppress cusps at other nuclear cen- time, we propose measurement based load balancing ͑equal ters. These secondary cusps can be damped to zero with a time partitioning͒ for domain decomposition in quantum sharper partition, but this has the risk of introducing addi- chemistry. Equal time partitioning exploits the smooth varia- tional discontinuities in the integrand. Second, increasing tion of the electron density between SCF cycles ͑so called density of the radial grid does not guarantee convergence to temporal locality49͒, to predict an optimal domain decompo- a more accurate result. Third, the overlapping nature of the sition for the next iteration. Becke weight functions impedes clean spatial partitioning, The remainder of this paper is organized as follows: In which is important for achieving true linear scaling, simple Sec. II we discuss our strategies to efficiently parallelize the domain decomposition and data parallelism. computation of Kxc using HiCu. In Sec. III we discuss the A different approach, hierarchical cubature ͑HiCu͒ em- issue of data locality to reduce communications between pro- ploys an adaptive telescoping Cartesian grid with primitive cessors. In Sec. IV we describe a computational implemen- cubature rules at the finest level of resolution.18 In HiCu, the tation of data parallel HiCu. In Sec. V we present the results hierarchical adaptive grid resolves strong variations and mul- and discussions of the speedup tests performed on three sys- tiple length scales encountered in numerical integration of tems. In Sec. VI we summarize the main conclusions of the 43–45 the exchange-correlation matrix Kxc . The k-d tree data paper. structure is used to represent the adaptive grid or CubeTree. The electron density of the system is also represented by the k-d tree data structure, called a RhoTree. Each node in a k-d II. PARALLELIZATION OF HICU ͑ ͒ tree contains a bounding box BBox that encloses the spatial In this section we describe the parallelization of HiCu data contained by it and its children ͑if the node is not a leaf ͒ with the goal of obtaining good speedups even for fine- node . Because of the hierarchical organization of data in a grained parallelism, which we consider to be one heavy atom k-d tree, a search for all data within a given Euclidean dis- ͑ ͒ per processor. This is difficult, as the spatial distribution of tance i.e., a range query is an efficient O(log2 N) operation. the grid-work, which depends on the electron density, is The CubeTree is constructed through recursive bisection highly irregular for an inhomogeneous system. We note that of a Cartesian domain ͑i.e., a box͒ enclosing the electron ͑ ͒ grid points can experience very different work loads depend- density, applying a fixed fully symmetric C3 cubature ing on their environment, as the HiCu method will always try rules46 within each BBox. Given a local error threshold ␶, the ␦Ͻ␶ to minimize the work associated with gridding the density. recursive bisection process stops when , where To obtain a good speedup, we need to reduce both load im- balance and communications between processors. First we ␦ϭ⌬␳ϩ 1 ͱ͑⌬␳ ͒2ϩ͑⌬␳ ͒2ϩ͑⌬␳ ͒2.

Load more