Quantum Chemistry (QC) on Gpus March 2018 Overview of Life & Material Accelerated Apps
Total Page:16
File Type:pdf, Size:1020Kb
Quantum Chemistry (QC) on GPUs March 2018 Overview of Life & Material Accelerated Apps MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HTMD, HOOMD- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more green* = application where >90% of the workload is on GPU 2 MD vs. QC on GPUs “Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp) Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties Forces calculated from simple empirical formulas Forces derived from electron wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated (FP32) Double precision is important (FP64) Uses cuFFT, CUDA Uses cuBLAS, cuFFT, Tensor/Eigen Solvers, OpenACC GeForce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on 3 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.” Without gpu, the supercomputer would need to be 5x larger for similar performance. 4 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included Abinit GPAW Octopus Quantum ACES III SuperCharger FHI-AIMS ONETEP Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM DIRAC Mopac2012 Quantum VASP GAMESS-US Espresso NWChem WL-LSMS Gaussian GPU Perf compared against dual multi-core x86 CPU socket. 5 ABINIT ABINIT on GPUS Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA For ground-state calculations, the wavelet part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs BigDFT Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA GAMESS February 2018 GAMESS valinomycin rimp2 energy on V100 vs P100 (PCIe 16GB) 0.0025 0.0021 0.0020 0.0019 10.5X 9.5X (Untuned on Volta) 0.0016 0.0015 Running GAMESS 0.0015 0.0014 8.0X 0.0013 7.5X The blue node contains Dual Intel 7.0X 6.5X Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs 1/seconds 0.0010 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.0005 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PCIe 0.0002 (16GB) GPUs 0.0000 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node 16 Gaussian 16 March 2018 Parallelization Strategy Within Gaussi an 16, GPUs are used for a sm all fraction of code that consumes a large fraction of the execution time. T e implementation of GPU parallelism conforms to Gaussi an’s general parallelization strategy. Its main tenets are to avoid changing the underlying source code and to avoid modif cations which negatively af ect CPU per formance. For these reasons, OpenACC was used for GPU parallelization. PGI Accelerator Compilers with OpenACC PGI compilers fully support the current OpenACC standard as wel l as important extensions to it. PGI is an important contributor to the ongoing development of OpenACC. OpenACC enables developer s to implement GPU parallelism by adding compiler directives to their source code, of en eliminating the need for rewriting or restructuring. For example, the following Fortran compiler directive identif es a loop which the compiler should parallelize: !$acc paral l el l oop Other directives allocate GPU memory, copy data to/from GPUs, specify data to remain on the GPU, combine or split loops and other code sections, and gener ally provide hints for optimal work T e Gaussian approach to parallelization relies on environment-speci f c parallelization distribution management, and more. frameworks and tools: OpenMP for shared-memory, Linda for cluster and net work parallelization across discret e nodes, and OpenACC for GPUs. T e OpenACC project is ver y active, and the speci f cations and tools are changing fairly rapidly. T is has been true throughout the T e process of implementing GPU support involved many dif er ent aspects: lifet ime of this project. Indeed, one of its major Identifying places wher e GPUs could be benef ci al. T ese are a subset of areas which challenges has been using OpenACC in the midst of its development. T e talented people at PGI are parallelized for other execution contexts because using GPUs requires f ne grained wer e instrumental in addressing issues that arose parallelism. in one of the ver y f rst uses of OpenACC for a Understanding and optimizing data movem ent/storage at a high level to maximize large commer cial sof ware package. GPU ef ciency. PGI’s sophisticated prof ling and per formance evaluation tools wer e vital to the success of the ef ort. Specifying GPUs to Gaussian 16 T e GPU implementation in Gaussi an 16 is sophisticated and complex but using it is simple and straightforward. GPUs are specif ed with 1 additional Link 0 command (or equivalent Def ault.Route f le entry/command line option). For example, the following commands tell Gaussi an to run the calculation using 24 compute cores plus 8 GPUs+8 controlling cores (32 cores total): %CPU= 0 - 3 1 Request 32 CPUs for the calculation: 24 cores for computation, and 8 cores to control GPUs (see below). %GPUCPU=0- 7=0- 7 Use GPUs 0-7 with CPUs 0-7 as thei r controller s. Det ailed information is available on our website. Project Contributors GAUSSIAN 16 Mike Frisch, Ph.D. President and CEO A Leading Computation Chemistry Code Gaussian, Inc. Roberto Gomperts Michael Frisch Brent Leback Giovanni Scalmani NVIDIA Gaussian NVIDIA/PGI Gaussian Gaussi an, Inc. Using OpenACC allowed us to continue 340 Quinnipiac St. Bldg. 40developmentGaussian is ofa reg iourster ed trfundamentalademark of Gaussian, Inc. All other trademarks and register ed trademarks are Wallingford, CT 06492 USA the proper ties of thei r respective holder s. Speci f cations subject to change without notice. custser v@gaussi an.com algorithmsCopyr ighandt © 2017, software Gaussian, Inc. Acapabilitiesll rights reser ved. simultaneously with the GPU-related Valinomycin work. In the end, we could use the Frequency. APFD 6-311+G(2d,p) same code base for SMP, cluster/ 7X speedup on 8X P100 GPUs network and GPU parallelism. PGI's Hardware: HPE Apollo 6500 server with dual Intel Xeon E5-2680 v4 CPUs (2.40GHz; 14 cores/chip, compilers were essential to the 28 cores total), 256GB memory and 8 Tesla P100 GPU boards (autoboost clocks). Gaussian source code compiled with PGI Accelerator Compilers (17.7) with OpenACC (2.5 standard). success of our efforts. 18 Speed-up over run without GPUs of usingEffects K80s 0.0x 0.3x 0.5x 0.8x 1.0x 1.3x 1.5x 1.8x 2.0x 2.3x 2.5x 7.4 hrs rg-a25 6.4 hrs 5.8 hrs 5.1 hrs 27.9 hrs rg-a25td 25.8 hrs Haswell E5 Haswell Effects of K80using boards Effects 22.6 hrs 20.1 hrs Number Number of K80boards - 2698 [email protected] 0 2.5 hrs 1 rg-on 2 2.1 hrs 4 1.7 hrs 1.5 hrs 55.4 hrs rg-ontd 43.6 hrs 33.7 hrs 26.8 hrs 229.0 hrs rg-val 168.4 hrs 141.7 hrs 19 101.5 hrs GAUSSIAN 16 P100 PERFORMANCE Dual-socket 28-core Broadwell vs 8x P100 GPUs 7.5x 7.0x 6.5x Number of GPUs 6.0x none 8x P100 5.5x 5.0x 4.5x 4.0x 3.5x 3.0x up up over run without GPUs - 2.5x 2.0x Speed 1.5x 1.0x hrs hrs hrs hrs hrs hrs hrs hrs hrs 0.5x hrs 6.9 26.6 2.2 53.6 213.9 2.9 8.5 0.7 12.4 30.5 0.0x rg-a25 rg-a25td rg-on rg-ontd rg-val 20 GPU-ACCELERATED GAUSSIAN 16 AVAILABLE 100% PGI OpenACC Port (no CUDA) • Gaussian is a Top Ten HPC (Quantum Chemistry) Application. • 80-85% of use cases are GPU-accelerated (Hartree-Fock and DFT: energies, 1st derivatives (gradients) and 2nd derivatives for ground & excited states). More functionality to come. • K40, K80, P100 support; B.01 release. • No pricing difference between Gaussian CPU and GPU versions. • Existing Gaussian 09 customers under maintenance contract get (free) upgrade. • Existing non-maintenance customers required to pay upgrade fee.