Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems Junjie Li ([email protected]) Shijie Sheng ([email protected]) Raymond Sheppard ([email protected]) Pervasive Technology Institute, Indiana University 535 W Michigan St, Indianapolis, IN, USA Abstract In this work, we evaluate the performance of Two Cray HPC systems, XE6/XK7 and three popular computational chemistry packages, XC30, are deployed at Indiana University serving NWChem [3], Gromacs [4] and Lammps [5] and scientific research across all the eight campuses. present the results of multiple configurations on Across over 130 different scientific disciplines each of our Cray machines. These software that are using HPC, the major source of workload packages support both quantum chemical comes from chemistry. Here, the performance of methods and classical molecular dynamics (MD). quantum chemistry package NWChem and Various theories are benchmarked, from the molecular dynamic software GROMACS and popular fast methods to the highly expensive LAMMPS are investigated in in terms of parallel accurate methods. scalability and different parallelization This paper is organized as follows: Section II paradigms. Parallel performance up to 12,288 introduces Indiana University’s two Cray HPC MPI ranks are studied, and the results from the systems, Section III outlines the functionality and two different Cray HPC systems are compared. features of the different computational chemistry software packages, Section IV discusses our I. Introduction benchmark methods, Section V presents the Indiana University currently focuses on the results, and the paper is then concluded in physical sciences, liberal arts and medical Section VI. sciences with very limited engineering. Research is supported by a long-serving 1,020-node II. Cray HPCs at Indiana University XE6/XK7 [1] and a recently installed 552-node XC30 [2]. Chemistry and physics are the two A. BigRed II (XE6/XK7) primary consumers of high performance Big Red II is Indiana University's primary computing (HPC). The chemistry department system for high-performance parallel computing. burns over 40% of the total workload. Chemistry With a theoretical peak performance (Rpeak) of has long been a scientific field that relies heavily one thousand trillion floating-point operations per on simulations. Computational chemistry second (1 petaFLOP) and a maximal achieved software are getting increasingly complicated. performance (Rmax) of 596.4 teraFLOPS, Big More advanced algorithms are being modelled, Red II is among the world's fastest research and multiple parallel programming paradigms are supercomputers. Owned and operated solely by often available for each model. Therefore, as an IU, Big Red II is designed to accelerate discovery HPC service provider, it is crucial to understand both through computational experimentation and the performance and limitation of these effective analysis of big data in a wide variety of applications so that we can improve our service fields, including medicine, physics, fine arts, and and help scientists generate the best throughput global climate research. for their scientific research. Big Red II features a hybrid architecture scalable, both in its ability to treat large problems based on two Cray’s supercomputer platforms. efficiently, and in its usage of available parallel As configured upon entering production in computing resources. NWChem has been August 2013, Big Red II comprised 344 XE6 developed by the Molecular Sciences Software (CPU-only) compute nodes and 676 XK7 "GPU- group of the Theory, Modeling & Simulation accelerated" compute nodes, all connected program in the Environmental Molecular through Cray's Gemini scalable interconnect, Sciences Laboratory (EMSL) at the Pacific providing a total of 1,020 compute nodes, 21,824 Northwest National Laboratory (PNNL). It was processor cores, and 43,648 GB of RAM. Each first was funded by the EMSL Construction XE6 node has two AMD Opteron 16-core Abu Project. Dhabi x86_64 CPUs and 64 GB of RAM; each XK7 node has one AMD Opteron 16-core B. GROMACS Interlagos x86_64 CPU, 32 GB of RAM, and one GROningen MAchine for Chemical NVIDIA Tesla K20 GPU accelerator. Simulations (GROMACS) was initially released in 1991 by the University of Groningen, in the B. BigRed II+ (XC30) Netherlands and has been developed at the Royal Big Red II+ is a supercomputer that Institute of Technology and Uppsala University complements Indiana University's Big Red II by in Sweden since 2001. It has gained popularity providing an environment dedicated to large- not only because it is easy to use and has a rich scale, compute-intensive research. Researchers, tool set for post-analysis, but also because its scholars, and artists with large-scale research algorithm has good performance and optimizes needs have benefited from Big Red II; these users well to different parallel paradigms. can now take advantage of faster processing capability and networking provided by Big Red C. LAMMPS II+. The system will help support programs at the Large-scale Atomic/Molecular Massively highest level of the university, such as the Grand Parallel Simulator (LAMMPS) was developed at Challenges Program. Sandia National Laboratory. It utilizes MPI for parallel communications and contains extensive Big Red II+ is a Cray XC30 supercomputer packages (GPU, USER-OMP) to enable other providing 552 compute nodes, each containing parallel models. Many of its models are two Intel Xeon E5 12-Core x86_64 CPUs and 64 optimized for CPUs and GPUs. It also contains GB of DDR3 RAM. Big Red II+ has a theoretical some features that other MD packages do not peak performance (Rpeak) of 286 trillion provide, for example, reactive force field floating-point operations per second (286 simulations. teraFLOPS). All compute nodes are connected through the Cray Aries interconnect. IV. Benchmark Methods III. Computational Chemistry Software NWChem represents the state-of-the-art Packages software for quantum chemistry, its parallel performance outperforms other programs like A. NWChem GAMMES. Here we choose two representative NWChem is an ab initio computational types of calculations for benchmarking. The most chemistry software package which also includes popular in general use due to its relative speed to quantum chemical and molecular dynamics completion is the Density Functional Theory functionality. It was designed to run on high- (DFT) energy calculation. It scales as O(N3~N4). performance, parallel supercomputers as well as The other is the so called “golden standard” conventional workstation clusters. It aims to be method for quantum chemistry – Coupled Cluster 12000 10000 8000 XE6 (32ppn) 6000 XE6 (16ppn) Time (s) XC30 (24ppn) 4000 XC30 (12ppn) 2000 0 2 4 8 16 32 64 Node Count Figure 1. Performance of DFT calculations in NWChem on Cray XE6/XK7 and XC30 using Single, Double and perturbative Triple Extreme Scalability Mode (ESM). This (CCSD(T)). This method has O(N7) scaling. A distinction is not applicable on the XC30. 240-atom Carbon nanotube was used for the DFT calculation with a PBE/6-31G* level of theory. V. Benchmark Results In total, this creates 3600 basis functions. A smaller system, Pentacene, with 378 basis A. NWChem functions, was used for testing CCSD(T) 1. DFT calculations calculations. Using 3600 localized basis functions is about For the molecular dynamics simulations using the upper limit that can be routinely used to LAMMPS and GROMACS, we prepared a box perform energy minimization or ab initio of SPC/E water molecules (~100k atoms) in a 10 dynamics using DFT. Thus, our test system well nm cubic box at room temperature and 1 represents actual research usage. Using only atmosphere pressure. The long-range Coulomb half of the cores on each node for large MPI jobs forces are treated with the particle mesh Ewald may improve performance by distributing the (PME) method. The system is pre-equilibrated MPI communication across a larger section of the and each performance test takes 5 min with 2 fs network and reducing the communication time step in the simulation. pressure from each individual node. In the case All programs are compiled using the GNU of the XE6, this also allows tying each core to a 4.9.3 compiler and its corresponding math single floating-point unit. Both full node (using library. Math library computations are proven to all cores on each node) and half node jobs were the main factor for performance [6]. While the tested. The results are summarized in Figure 1). XE6/XK7 uses AMD processors, the XC30 uses When comparing each test relative to others Intel, processors. For each machine, we chose within the same machine, the calculation’s timing the most recent Cray libsci math library available continues to improve up to 64 nodes on both (17.09.1 on XC30 and 16.03.1 on XE6/XK7). machines. The speedup for both machines in the Cray MPI was used for all applications and all 1 to 16 node range is about 80~90% every time tests. The XE6/XK7 was run exclusively in the node count is doubled, and about 60% in the range from 16 to 32 nodes. At 64 nodes, XE6 only gained ~4% extra speed with respect to the scales relatively well on both machines even up 32-node test. The XC30’s improvement at this to 512 nodes. On 32 nodes, XC30 outperforms range was still measured at 25%. The MPI XE6 by 32%. Using 64 nodes, the XC30 ran 60% libraries on both machines errored at 128 nodes. faster. This trend continued, but was attenuated at When comparing the machines against each other 128 nodes where the XC30 was 65% faster than with the same node counts, the XC30 was a clear the XE6. These timings are a result of not only winner over the XE6. For node counts less than the stronger processing power of the newer CPU 32, the new XC30 outperforms the old XE6 by but also the faster interconnect.

Load more