A Novel Approach to Parallel Coupled Cluster Calculations: Combining Distributed and Shared Memory Techniques for Modern Cluster Based Systems Ryan M

Chemistry Publications Chemistry 5-2007 A Novel Approach to Parallel Coupled Cluster Calculations: Combining Distributed and Shared Memory Techniques for Modern Cluster Based Systems Ryan M. Olson Iowa State University Jonathan Lee Bentz Iowa State University Ricky A. Kendall Iowa State University Michael Schmidt Iowa State University, [email protected] MFoallorkw S .thi Gors adonnd additional works at: http://lib.dr.iastate.edu/chem_pubs IowaP Satrate of U ntheiversitCyhe, mmigorsdon@itry Commonastate.edus, and the Computer Sciences Commons The ompc lete bibliographic information for this item can be found at http://lib.dr.iastate.edu/ chem_pubs/490. For information on how to cite this item, please visit http://lib.dr.iastate.edu/ howtocite.html. This Article is brought to you for free and open access by the Chemistry at Iowa State University Digital Repository. It has been accepted for inclusion in Chemistry Publications by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. A Novel Approach to Parallel Coupled Cluster Calculations: Combining Distributed and Shared Memory Techniques for Modern Cluster Based Systems Abstract A parallel coupled cluster algorithm that combines distributed and shared memory techniques for the CCSD(T) method (singles + doubles with perturbative triples) is described. The implementation of the massively parallel CCSD(T) algorithm uses a hybrid molecular and “direct” atomic integral driven approach. Shared memory is used to minimize redundant replicated storage per compute process. The algorithm is targeted at modern cluster based architectures that are comprised of multiprocessor nodes connected by a dedicated communication network. Parallelism is achieved on two levels: parallelism within a compute node via shared memory parallel techniques and parallelism between nodes using distributed memory techniques. The new parallel implementation is designed to allow for the routine evaluation of mid- (500−750 basis function) to large-scale (750−1000 basis function) CCSD(T) energies. Sample calculations are performed on five low-lying isomers of water hexamer using the aug-cc-pVTZ basis set. Disciplines Chemistry | Computer Sciences Comments The following article appeared in Journal of Chemical Theory and Computation 3 (2007): and may be found at doi:10.1021/ct600366k. Rights Copyright 2007 American Institute of Physics. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the American Institute of Physics. This article is available at Iowa State University Digital Repository: http://lib.dr.iastate.edu/chem_pubs/490 1312 J. Chem. Theory Comput. 2007, 3, 1312-1328 A Novel Approach to Parallel Coupled Cluster Calculations: Combining Distributed and Shared Memory Techniques for Modern Cluster Based Systems Ryan M. Olson,† Jonathan L. Bentz,‡ Ricky A. Kendall,‡,§ Michael W. Schmidt,† and Mark S. Gordon*,† Department of Chemistry, Iowa State UniVersity, Ames, Iowa, Department of Computer Science, Iowa State UniVersity, Ames, Iowa 50011, and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831 Received January 17, 2007 Abstract: A parallel coupled cluster algorithm that combines distributed and shared memory techniques for the CCSD(T) method (singles + doubles with perturbative triples) is described. The implementation of the massively parallel CCSD(T) algorithm uses a hybrid molecular and “direct” atomic integral driven approach. Shared memory is used to minimize redundant replicated storage per compute process. The algorithm is targeted at modern cluster based architectures that are comprised of multiprocessor nodes connected by a dedicated communication network. Parallelism is achieved on two levels: parallelism within a compute node via shared memory parallel techniques and parallelism between nodes using distributed memory techniques. The new parallel implementation is designed to allow for the routine evaluation of mid- (500-750 basis function) to large-scale (750-1000 basis function) CCSD(T) energies. Sample calculations are performed on five low-lying isomers of water hexamer using the aug-cc-pVTZ basis set. I. Introduction extended formally single-reference CC methods into the Coupled-cluster (CC) methods1-3 are now widely accepted regime of bond making and bond breaking, an area where as the premier single-reference electronic structure methods traditional CC methods break down. for small chemical systems at or near equilibrium geometries. The biggest drawback of CC methods is the large One of the most popular CC methods is CCSD(T), which is computational demands required to perform such calcula- based on an iterative solution of the single and double (SD) tions. However, due to the popularity of methods like CCSD- cluster amplitude equations4 with a noniterative perturbative (T), considerable research has been carried out to generate correction for the triples (T).5 The CCSD(T) approach has highly efficient algorithms4,17-21 and their implementations. been shown6 to be a good compromise between the chemical A variety of CC methods can be found in all of the major accuracy of the higher-order CCSDT (full triples) method7 electronic structure programs available today, including and the computational efficiency of low order many-body GAMESS,22 MOLPRO,23 ACES II,24 Q-CHEM,25 PSI3,26 perturbation theory (MBPT). Equation of motion (EOM) CC NWCHEM,27,28 DALTON,29 and GAUSSIAN03.30 Most CC methods8-12 have been developed for excited-state calcula- programs are highly optimized to run sequentially. This tions. Spin flip13,14 and method of moments CC methods,15 usually means the calculation is performed on a single including the popular renormalized (R),15 completely renor- processor. The speed of the processor and the size of the 15 16 malized (CR), and CR-CCSD(T)L (CCL) methods, have associated memory and disk are limiting factors for sequential algorithms. CCSD(T) calculations, especially those run in * Corresponding author e-mail: [email protected]. C1 symmetry, reach the limit of most single processor † Department of Chemistry, Iowa State University. workstations at around 400-500 basis functions (BF); even ‡ Department of Computer Science, Iowa State University. then, calculations of these sizes may require weeks of time § Oak Ridge National Laboratory. on a dedicated workstation.31 10.1021/ct600366k CCC: $37.00 © 2007 American Chemical Society Published on Web 05/30/2007 Modern Cluster Based Systems J. Chem. Theory Comput., Vol. 3, No. 4, 2007 1313 One means of evaluating computationally demanding the roadmap for future generations of computers. The next problems such as large basis set (>500 BF) CCSD(T) generation(s) of processors is(are) not expected to dramati- calculations is to make use of parallel computing. Parallel cally increase in frequency, which traditionally has accounted computing involves simultaneously evaluating multiple por- for 80% of the performance improvements. Rather, the tions of a larger computational problem on multiple proces- current trend is to add multiple processing “cores” on each sors, in order to achieve an overall reduction in the real- processor. This use of multicore processors in multiprocessor time evaluation of the problem. Equally important, parallel nodes further increases the computational density per node computing can extend computationally demanding methods and further emphasizes the need to address different parallel like CCSD(T) to larger problems because of increased strategies for intra- and internode computing and data computational resources and also storage (memory/disk) management within current and future cluster based systems. resources. There is a wide range of parallel computing The focus of this work is to describe an algorithm for the environments and methodologies, two examples of which CCSD(T) method that can utilize both intranode and inter- are addressed herein. These are as follows: (1) parallelism node forms of parallelism. Algorithms for parallel CC that is achieved by using multiple computers or nodes which methods39-43 have been developed by other groups. These are connected by a dedicated communication network and methodologies for the parallelization of CC methods and (2) parallelism that is achieved by multiple processors within other correlation methods were divided into two categories: a single node that share “local” system resources including those aimed at shared memory machines (SMPs) and those memory and I/O channels. aimed at distributed memory machines. Early work by The tools and methodologies for these two traditional types Komornicki, Lee, and Rendell39 described a highly vectorized of parallel computing environments are very different. shared memory algorithm for evaluating the connected triples Multinode parallelism focuses on combining replicated and/ excitations (T) on the CRAY Y-MP. Vectorized shared- or distributed memory techniques using parallel communica- memory CCSD and CCSD(T) algorithms based on AO tion libraries such as TCGMSG,32 SHMEM,33 MPI,34 Global integrals stored on disk were later implemented by Koch and Arrays (GA),35 and the Distributed Data Interface (DDI).36,37 co-workers.44,45 These early shared-memory vectorized al- One advantage of multinode models is that the aggregate gorithms primarily used optimized library calls to gain system resources increase as the number of nodes increases, computational speedup (the libraries, not the programs

Load more