Scaling up Hartree–Fock Calculations on Tianhe-2
Total Page:16
File Type:pdf, Size:1020Kb
Original Article The International Journal of High Performance Computing Applications 1–18 Scaling up Hartree–Fock calculations Ó The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav on Tianhe-2 DOI: 10.1177/1094342015592960 hpc.sagepub.com Edmond Chow1,XingLiu1, Sanchit Misra2, Marat Dukhan1, Mikhail Smelyanskiy2, Jeff R. Hammond2, Yunfei Du3,Xiang-KeLiao3 and Pradeep Dubey2 Abstract This paper presents a new optimized and scalable code for Hartree–Fock self-consistent field iterations. Goals of the code design include scalability to large numbers of nodes, and the capability to simultaneously use CPUs and Intel Xeon Phi coprocessors. Issues we encountered as we optimized and scaled up the code on Tianhe-2 are described and addressed. A major issue is load balance, which is made challenging due to integral screening. We describe a general framework for finding a well-balanced static partitioning of the load in the presence of screening. Work stealing is used to polish the load balance. Performance results are shown on Stampede and Tianhe-2 supercomputers. Scalability is demonstrated on large simulations involving 2938 atoms and 27,394 basis functions, utilizing 8100 nodes of Tianhe-2. Keywords Hartree-Fock, quantum chemistry, Tianhe-2, Intel Xeon Phi, load balancing, work stealing 1Introduction cores, much better scaling is essential. Lack of scalabil- ity inhibits the ability to simulate large molecules in a Given a set of atomic positions for a molecule, the reasonable amount of time, or to use more cores to Hartree–Fock (HF) method is the most fundamental reduce time-to-solution for small molecules. method in quantum chemistry for approximately sol- The scalability challenge of HF can be seen in ving the electronic Schro¨dinger equation (Szabo and Figure 1, which shows timings for one iteration of SCF Ostlund, 1989). The solution of the equation, called the for a protein-ligand test problem. Each iteration is com- wavefunction, can be used to determine properties of posed of two main computational components: con- the molecule. The solution is represented indirectly in structing a Fock matrix, F, and calculating a density terms of a set of n basis functions, with more basis matrix, D. In the figure, the dotted lines show the per- functions per atom leading to more accurate, but more formance of NWChem for these two components. Fock expensive approximate solutions. The HF numerical matrix construction stops scaling after between 64 and procedure is to solve an n 3 n generalized nonlinear 144 nodes. The density matrix calculation, which eigenvalue problem via self-consistent field (SCF) involves an eigendecomposition, and which is often iterations. believed to limit scalability, also does not scale but only Practically all quantum chemistry codes implement requires a small portion of the time relative to Fock HF. Those that implement HF with distributed compu- matrix construction. Its execution time never exceeds tation include NWChem (Valiev et al., 2010), GAMESS (Schmidt et al., 1993), ACESIII (Lotrich et 1 al., 2008) and MPQC (Janssen and Nielsen, 2008). School of Computational Science and Engineering, Georgia Institute of Technology, USA NWChem is believed to have the best parallel scalabil- 2 Parallel Computing Lab, Intel Corporation, USA ity Foster et al. (1996); Harrison et al. (1996); Tilson et 3National University of Defense Technology, China al. (1999). The HF portion of this code, however, was developed many years ago with the goal of scaling to Corresponding author: one atom per core, and HF in general has never been Edmond Chow, School of Computational Science and Engineering, Georgia Institute of Technology, 266 Ferst Drive, Atlanta, GA 30332, shown to scale well beyond hundreds of nodes. Given USA. current parallel machines with very large numbers of Email: [email protected] Downloaded from hpc.sagepub.com by guest on July 3, 2015 2 The International Journal of High Performance Computing Applications In this paper, we present the design of a scalable NWChem Fock NWChem Eig code for HF calculations. The code attempts to better 2 10 GTFock Fock exploit features of today’s hardware, in particular, het- GTFock Purif erogeneous nodes and wide single instruction multiple data (SIMD) CPUs and accelerators. The main contri- 1 10 butions of this paper are in four areas, each described in its own section. Time (s) 0 10 Scalable Fock matrix construction. In previous work, we presented an algorithm for load balancing and reducing the communication in Fock matrix −1 construction and showed good performance up to 10 324 nodes (3888 cores) on the Lonestar supercom- 1 9 36 144 529 Number of nodes puter (Liu et al., 2014). However, in scaling up this code on the Tianhe-2 supercomputer, scalability Figure 1. Comparison of NWChem to GTFock for 1hsg_28, a bottlenecks were found. We describe the improve- protein-ligand system with 122 atoms and 1159 basis functions. ments to the algorithms and implementation that Timings obtained on the Stampede supercomputer (CPU cores were necessary for scaling Fock matrix construc- only). tion on Tianhe-2. We also improved the scalability of nonblocking Global Arrays operations used by GTFock on Tianhe-2 (see Section 3). Algorithm 1. SCF algorithm. Optimization of integral calculations. The main computation in Fock matrix construction is the cal- 1 Guess D; culation of a very large number of electron repul- core 2 Compute H ; sion integrals (ERIs). To reduce time-to-solution, 3 Diagonalize S = UsUT; 4 Form X = Us1/2; we have optimized this calculation for CPUs and 5 repeat Intel Xeon Phi coprocessors, paying particular 6 Construct F, which is a function of D; attention to wide SIMD (see Section 4). T 7 Form F# = X FX; P Heterogeneous task scheduling. For heterogeneous 8 Compute energy E = D (Hcore + F ); k, l kl kl kl HF computation, we developed an efficient hetero- # # #T 9 Diagonalize F = C eC ; geneous scheduler that dynamically assigns tasks 10 C = XC#; T simultaneously to CPUs and coprocessors (see 11 Form D = CoccCocc; 12 until converged; Section 5). Density matrix calculation. To improve scalability of density matrix calculation, we employ a purifica- tion algorithm that scales better than diagonaliza- that of Fock matrix construction due to the poor scal- tion approaches. We also implemented efficient, ability of the latter. The figure also shows a preview of offloaded 3D matrix multiply kernels for purifica- the results of the code, GTFock, presented in this tion (see Section 6). paper. Here, both Fock matrix construction and density matrix calculation (using purification rather than eigen- In Section 7, we demonstrate results of large calcula- decomposition) scale better than in NWChem. tions on Stampede and Tianhe-2 supercomputers. An outline of the SCF algorithm, using traditional Scalability is demonstrated on large molecular systems eigendecomposition to calculate D, is shown in involving 2938 atoms and 27,394 basis functions, utiliz- Algorithm 1. For simplicity, we focus on the restricted ing 8100 nodes of Tianhe-2. HF version for closed shell molecules. The most com- putationally intense portions just mentioned are the Fock matrix construction (line 6) and the diagonaliza- 2 Background tion (line 9). In the algorithm, Cocc is the matrix formed In this section, we give the necessary background on by nocc lowest energy eigenvectors of F, where nocc is the generic distributed Fock matrix construction algo- the number of occupied orbitals. The overlap matrix, S, rithm needed for understanding the scalable implemen- the basis transformation, X, and the core Hamiltonian core tation presented in this paper. H , do not change from one iteration to the next and In each SCF iteration, the most computationally are usually precomputed and stored. The algorithm is intensive part is the calculation of the Fock matrix, F. terminated when the change in electronic energy E is Each element is given by less than a given threshold. Downloaded from hpc.sagepub.com by guest on July 3, 2015 Chow et al. 3 X core Fij = Hij + DklðÞð2(ijjkl) À (ikjjl) 1Þ kl Algorithm 2. Generic distributed Fock matrix construction. where Hcore is a fixed matrix, D is the density matrix, 1 for unique shell quartets (MN|PQ)do 2if(MN|PQ) is not screened out then which is fixed per iteration, and the quantity (ijjkl)is 3 Compute shell quartet (MN|PQ); standard notation that denotes an entry in a four- 4 Receive submatrices DMN, DPQ, DNP, DMQ, DNQ, DMP; dimensional tensor of size n 3 n 3 n 3 n and indexed 5 Compute contributions to submatrices FMN, FPQ, FNP, by i, j, k, l, where n is the number of basis functions. FMQ, FNQ, FMP; To implement efficient HF procedures, it is essential to 6 Send submatrices of F to their owners; 7 end exploit the structure of this tensor. Each (ijjkl)isan 8 end integral, called an ERI, given by Z À1 (ijjkl)= fi(x1)fj(x1)r12 fk(x2)fl(x2) dx1 dx2 ð2Þ The generic distributed Fock matrix construction where the f functions are basis functions, x and x are algorithm is shown as Algorithm 2. The algorithm 1 2 loops over shell quartets rather than Fock matrix coordinates in R3, and r =||x 2 x ||. From this for- 12 1 2 entries in order to apply screening to shell quartets, and mula, it is evident that the ERI tensor has eight-way to not repeat the computation of symmetric but other- symmetry, since (ijjkl)=(ijjlk)=(jijkl)= wise identical shell quartets. Because of symmetry, each (jijlk)=(kljij)=(kljji)=(lkjij)=(lkjji). The basis functions are Gaussians (or combinations uniquely computed shell quartet contributes to the computation of six submatrices of F and requires six of Gaussians) whose centers are at the coordinates of submatrices of D. The data access pattern arises from one of the atoms in the molecular system.