Quantum Chemistry (QC) on GPUs October 2017 Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, , ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HTMD, HOOMD- UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, Blue*, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, *, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, , ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU 2 MD vs. QC on GPUs

“Classical” (MO, PW, DFT, Semi-Emp) Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties

Forces calculated from simple empirical formulas Forces derived from wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on

3 Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.”

Without gpu, the supercomputer would need to be 5x larger for similar performance.

4 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included

Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 Quantum VASP Espresso NWChem WL-LSMS

GPU Perf compared against dual multi-core CPU socket. 5 ABINIT ABINIT on GPUS

Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA

For ground-state calculations, the part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs BigDFT Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Gaussian 16 April 2017 Parallelization Strategy Within Gaussi an 16, GPUs are used for a sm all fraction of code that consumes a large fraction of the execution time. T e implementation of GPU parallelism conforms to Gaussi an’s general parallelization strategy. Its main tenets are to avoid changing the underlying and to avoid modif cations which negatively af ect CPU per formance. For these reasons, OpenACC was used for GPU parallelization. PGI Accelerator Compilers with OpenACC PGI compilers fully support the current OpenACC standard as wel l as important extensions to it. PGI is an important contributor to the ongoing development of OpenACC. OpenACC enables developer s to implement GPU parallelism by adding compiler directives to their source code, of en eliminating the need for rewriting or restructuring. For example, the following compiler directive identif es a loop which the compiler should parallelize: !$acc paral l el l oop Other directives allocate GPU memory, copy data to/from GPUs, specify data to remain on the GPU, combine or split loops and other code sections, and gener ally provide hints for optimal work T e Gaussian approach to parallelization relies on environment-speci f parallelization distribution management, and more. frameworks and tools: OpenMP for shared-memory, Linda for cluster and net work parallelization across discret e nodes, and OpenACC for GPUs. T e OpenACC project is ver y active, and the speci f cations and tools are changing fairly rapidly. T is has been true throughout the T e process of implementing GPU support involved many dif er ent aspects: lifet ime of this project. Indeed, one of its major Identifying places wher e GPUs could be benef ci al. T ese are a subset of areas which challenges has been using OpenACC in the midst of its development. T e talented people at PGI are parallelized for other execution contexts because using GPUs requires f ne grained wer e instrumental in addressing issues that arose parallelism. in one of the ver y f rst uses of OpenACC for a Understanding and optimizing data movem ent/storage at a high level to maximize large commer cial sof ware package. GPU ef ciency. PGI’s sophisticated prof ling and per formance evaluation tools wer e vital to the success of the ef ort. Specifying GPUs to Gaussian 16 T e GPU implementation in Gaussi an 16 is sophisticated and complex but using it is simple and straightforward. GPUs are specif ed with 1 additional Link 0 command (or equivalent Def ault.Route f le entry/command line option). For example, the following commands tell Gaussi an to run the calculation using 24 compute cores plus 8 GPUs+8 controlling cores (32 cores total): %CPU= 0 - 3 1 Request 32 CPUs for the calculation: 24 cores for computation, and 8 cores to control GPUs (see below). %GPUCPU=0- 7=0- 7 Use GPUs 0-7 with CPUs 0-7 as thei r controller s. Det ailed information is available on our website. Project Contributors

GAUSSIAN 16 Mike Frisch, Ph.D. President and CEO A Leading Computation Chemistry Code Gaussian, Inc.

Roberto Gomperts Michael Frisch Brent Leback Giovanni Scalmani Gaussian NVIDIA/PGI Gaussian

Gaussi an, Inc. Using OpenACC allowed us to continue 340 Quinnipiac St. Bldg. 40developmentGaussian is ofa reg iourster ed trfundamentalademark of Gaussian, Inc. All other trademarks and register ed trademarks are Wallingford, CT 06492 USA the proper ties of thei r respective holder s. Speci f cations subject to change without notice. custser v@gaussi an.com algorithmsCopyr ighandt © 2017, software Gaussian, Inc. Acapabilitiesll rights reser ved. simultaneously with the GPU-related Valinomycin work. In the end, we could use the wB97xD/6-311+(2d,p) Freq 2.25X speedup same code base for SMP, cluster/ network and GPU parallelism. PGI's

Hardware: HPE server with dual Intel Xeon E5-2698 v3 CPUs (2.30GHz ; 16 cores/chip), 256GB compilers were essential to the memory and 4 Tesla K80 dual GPU boards (boost clocks: MEM 2505 and SM 875). Gaussian source code compiled with PGI Accelerator Compilers (16.5) with OpenACC (2.5 standard). success of our efforts. 16 GPU-ACCELERATED GAUSSIAN 16 AVAILABLE 100% PGI OpenACC Port (no CUDA)

• Gaussian is a Top Ten HPC (Quantum Chemistry) Application. • 80-85% of use cases are GPU-accelerated (Hartree-Fock and DFT: energies, 1st derivatives (gradients) and 2nd derivatives). More functionality to come. • K40, K80 support; P100 support coming as a minor release, performance “good”, faster wall clock times. Early P100 results promising. • No pricing difference between Gaussian CPU and GPU versions. • Existing Gaussian 09 customers under maintenance contract get (free) upgrade. • Existing non-maintenance customers required to pay upgrade fee. • To get the bits or to ask about the upgrade fee, please contact Gaussian, Inc.’s Jim Hess, Operations Manager; [email protected].

17 rg-a25 on K80s

1.6x rg-a25 Running Gaussian version 16 1.4x The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 1.2x The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.0x CPUs + Tesla K80 (autoboost) GPUs

0.8x

Alanine 25. Two steps: Force and up up vs Dual Haswell - Frequency. APFD 6-31G*

0.6x nAtoms = 259, nBasis = 2195 Speed

0.4x

0.2x

7.4 hrs 6.4 hrs 5.8 hrs 5.1 hrs 0.0x 1 Haswell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

18 rg-a25td on K80s

1.6x rg-a25td Running Gaussian version 16 1.4x The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 1.2x The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.0x CPUs + Tesla K80 (autoboost) GPUs

0.8x

Alanine 25. Two Time-Dependent (TD) up up vs Dual Haswell - steps: Force and Frequency. APFD 6-31G*

0.6x nAtoms = 259, nBasis = 2195 Speed

0.4x

0.2x

27.9 hrs 25.8 hrs 22.6 hrs 20.1 hrs 0.0x 1 Haswell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

19 rg-on on K80s

1.8x rg-on Running Gaussian version 16 1.6x The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 1.4x The green nodes contain Dual Intel 1.2x Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.0x GFP ONIOM. Two steps: Force and 0.8x

up up vs Dual Haswell Frequency. APFD/6- - 311+G(2d,p):amber=softfirst)=embed

0.6x nAtoms = 3715 (48/3667), nBasis = 813 Speed

0.4x

0.2x

2.5 hrs 2.1 hrs 1.7 hrs 1.5 hrs 0.0x 1 Haswell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

20 rg-ontd on K80s

2.5x Running Gaussian version 16

rg-ontd The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 2.0x The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 1.5x

GFP ONIOM. Two Time-Dependent (TD)

up up vs Dual Haswell steps: Force and Frequency. APFD/6- - 1.0x 311+G(2d,p):amber=softfirst)=embed

nAtoms = 3715 (48/3667), nBasis = 813 Speed

0.5x

55.4 hrs 43.6 hrs 33.7 hrs 26.8 hrs 0.0x 1 Haswell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

21 rg-val on K80s

2.5x Running Gaussian version 16

rg-val The blue node contains Dual Intel Xeon 2.0x E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.5x CPUs + Tesla K80 (autoboost) GPUs

Valinomycin. Two steps: Force and up up vs Dual Haswell - 1.0x Frequency. APFD 6-311+G(2d,p)

nAtoms = 168, nBasis = 2646 Speed

0.5x

229.0 hrs 168.4 hrs 141.7 hrs 101.5 hrs 0.0x 1 Haswell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

22 Effects of using K80s

2.5x Effects of using K80 boards 2.3x Haswell E5-2698 [email protected] 2.0x

1.8x

1.5x

1.3x

1.0x

up up over run withoutGPUs -

0.8x Speed

0.5x

hrs hrs hrs hrs

hrs hrs hrs hrs hrs hrs hrs

0.3x hrs

hrs hrs hrs hrs hrs hrs hrs hrs

27.9 25.8 22.6 20.1 55.4 43.6 33.7 26.8

7.4 6.4 5.8 5.1 2.5 2.1 1.7 1.5

229.0 168.4 141.7 101.5 0.0x rg-a25 rg-a25td rg-on rg-ontd rg-val Number of K80 boards 0 1 2 4 23 Gaussian 16 Supported Platforms

• 4-way collaboration; Gaussian, Inc., PGI, NVIDIA and HPE • HPE, NVIDIA and PGI is the development platform • All released/certified x86_64 versions of Gaussian 16 use the PGI compilers • Certified versions of Gaussian 16 use Intel only for Itanium, XLF for some IBM platforms, Fujitsu compilers for some SPARC-based machines and PGI for the rest (including some Apple products) • GINC is collaborating with IBM, PGI (and NVIDIA) to release an OpenPower version of Gaussian that also uses the PGI compiler • See Gaussian Supported Platforms for more details: http://gaussian.com/g16/g16_plat.pdf

24 CLOSING REMARKS

Significant Progress has been made in enabling Gaussian on GPUs with OpenACC OpenACC is increasingly becoming more versatile Significant work lies ahead to improve performance Expand feature set:

PBC, Solvation, MP2, ONIOM, triples-Corrections

25 ACKNOWLEDGEMENTS

Development is taking place with:

Hewlett-Packard (HP) Series SL2500 Servers (Intel® Xeon® E5-2680 v2 (2.8GHz/10- core/25MB/8.0GT-s QPI/115W, DDR3-1866)

NVIDIA® Tesla® GPUs (K40 and later)

PGI Accelerator Compilers (16.x) with OpenACC (2.5 standard)

10/12/2017 26 GPAW Increase Performance with Kepler

3.5 Running GPAW 10258

The blue nodes contain 1x E5-2687W CPU (8 3.0 3 Cores per CPU). 2.7 2.5 The green nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU) and 1x or 2x NVIDIA K20X for the GPU.

2

1.6 1.5 1.5 1.4

1 1 1

1 Speedup Compared toCPU Only

0.5

0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler

3 Running GPAW 10258

The blue nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU).

The green nodes contain 1x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for 2 the GPU. 2.4x

1.5 2.2x

1

1.7x Speedup Compared toCPU Only

0.5

0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler

1.6 Running GPAW 10258

The blue nodes contain 2x E5-2687W CPUs (8 1.4 Cores per CPU).

1.2 The green nodes contain 2x E5-2687W CPUs (8 1.4x Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU. 1 1.4x 0.8 1.3x 0.6

0.4 Speedup Compared toCPU Only

0.2

0 Silicon K=1 Silicon K=2 Silicon K=3 Used with permission from Samuli Hakala

35 36 37 38 39 40 LSDALTON LSDALTON Minimal Effort

Large-scale application for calculating high- Lines of Code # of Weeks # of Codes to accuracy molecular energies Modified Required Maintain <100 Lines 1 Week 1 Source

Big Performance

LS- CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

11.7x

8.9x

“OpenACC makes GPU computing approachable for 7.9x domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, “

no modifications of our existing CPU implementation. CPU vs Speedup

Janus Juul Eriksen, PhD Fellow qLEAP Center for Theoretical Chemistry, Aarhus University ALANINE-1 ALANINE-2 ALANINE-3 13 ATOMS 23 ATOMS 33 ATOMS

https://developer.nvidia.com/openacc/success-stories 42 NWChem NWChem 6.3 Release with GPU Acceleration

Addresses large complex and challenging molecular-scale scientific problems in the areas of catalysis, materials, geochemistry and biochemistry on highly scalable, parallel computing platforms to obtain the fastest time-to-solution Researchers can for the first time be able to perform large scale with perturbative triples calculations utilizing the NVIDIA GPU technology. A highly scalable multi-reference coupled cluster capability will also be available in NWChem 6.3. The software, released under the Educational Community License 2.0, can be downloaded from the NWChem website at www.-sw.org NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes

System: cluster consisting of dual-socket nodes constructed from:

• 8-core AMD Interlagos processors • 64 GB of memory • Tesla M2090 (Fermi) GPUs

The nodes are connected using a high-performance QDR Infiniband interconnect

Courtesy of Kowolski, K., Bhaskaran-Nair, at al @ PNNL, JCTC (submitted) Kepler, Faster Performance (NWChem)

180 165 Uracil 160

140

120

100 81 80

60 54 Time to Solution (seconds) Solution to Time 40

20 Uracil 0 CPU Only CPU + 1x K20X CPU + 2x K20X

Performance improves by 2x with one GPU and by 3.1x with 2 GPUs Quantum Espresso 5.4.0 December 2016 QUANTUM ESPRESSO Filippo Spiga Head of Research Software Engineering Quantum Chemistry Suite University of Cambridge

CUDA Fortran gives us the full performance potential of the CUDA programming model and NVIDIA GPUs. !$CUF KERNELS directives give us productivity and source code maintainability. It’s the best of both worlds.

www.quantum-espresso.org

48 AUSURF112 on K80s

620 606.00 AUSURF112 600 *Lower is better

580 Running Quantum Espresso version 5.4.0

560 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs seconds 540 528.20 The green node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 520 (Broadwell) CPUs + Tesla K80 (autoboost) 1.1X GPUs 500

480 1 Broadwell node 1 node + 4x K80 per node

49 AUSURF112 on P100s PCIe

700

606.00 AUSURF1120 600 *Lower is better 515.70 500 486.90 1.2X 1.2X Running Quantum Espresso version 5.4.0 400 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo]

seconds (Broadwell) CPUs 300 The green nodes contain Dual Intel Xeon 200 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

100

0 1 Broadwell node 1 node + 1 node + 4x P100 PCIe per node 8x P100 PCIe per node

50 TeraChem 1.5K Speaker, Date 2.8x 2.8x 3.3x

2.3x 2.7x All timings for complete energy 3.7x and gradient Calculations Slide courtesy of TrpCage TrpCage Ru Complex BPTI Olestra Olestra PetaChem LLC / 6-31G 6-31G** LANL2DZ STO-3G 6-31G* 6-31G* K80 = 1 GPU on Todd Martinez Board, not both RHF RHF B3LYP RHF RHF BLYP 1604 bfn 2900 bfn 4512 bfn 2706 bfn 3181 bfn 3181 bfn 284 atoms 284 atoms 1013 atoms 882 atoms 453 atoms 453 atoms TERACHEM 1.5K; TRIPCAGE ON TESLA K40S

200 TeraChem 1.5K; TripCage on Tesla K40s & IVB CPUs (Total Processing Time in Seconds)

160

120

80

40

0 2 x Xeon E5-2697 [email protected] + 1 x 2 x Xeon E5-2697 [email protected] + 2 x 2 x Xeon E5-2697 [email protected] + 4 x 2 x Xeon E5-2697 [email protected] + 8 x Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node) 53 TERACHEM 1.5K; TRIPCAGE ON TESLA K40S & HASWELL CPUS

200 TeraChem 1.5K; TripCage on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds)

160

120

80

40

0 2 x Xeon E5-2698 [email protected] + 1 x Tesla 2 x Xeon E5-2698 [email protected] + 2 x Tesla 2 x Xeon E5-2698 [email protected] + 4 x Tesla K40@875Mhz (1 node) K40@875Mhz (1 node) K40@875Mhz (1 node) 54 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & IVB CPUS

120

TeraChem 1.5K; TripCage on Tesla K80s & IVB CPUs (Total Processing Time in Seconds)

80

40

0 2 x Xeon E5-2697 [email protected] + 1 x Tesla K80 board 2 x Xeon E5-2697 [email protected] + 2 x Tesla K80 boards 2 x Xeon E5-2697 [email protected] + 4 x Tesla K80 boards (1 node) (1 node) (1 node) 55 TERACHEM 1.5K; TRIPCAGE ON TESLA K80S & HASWELL CPUS

120 TeraChem 1.5K; TripCage on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds)

90

60

30

0 2 x Xeon E5-2698 [email protected] + 1 x Tesla K80 2 x Xeon E5-2698 [email protected] + 2 x Tesla K80 2 x Xeon E5-2698 [email protected] + 4 x Tesla K80 board (1 node) boards (1 node) boards (1 node) 56 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS

12000 TeraChem 1.5K; BPTI on Tesla K40s & IVB CPUs (Total Processing Time in Seconds) 10000

8000

6000

4000

2000

0 2 x Xeon E5-2697 [email protected] + 1 x 2 x Xeon E5-2697 [email protected] + 2 x 2 x Xeon E5-2697 [email protected] + 4 x 2 x Xeon E5-2697 [email protected] + 8 x Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node) Tesla K40@875Mhz (1 node)

57 TERACHEM 1.5K; BPTI ON TESLA K80S & IVB CPUS

8000 TeraChem 1.5K; BPTI on Tesla K80s & IVB CPUs (Total Processing Time in Seconds)

6000

4000

2000

0 2 x Xeon E5-2697 [email protected] + 1 x Tesla K80 2 x Xeon E5-2697 [email protected] + 2 x Tesla K80 2 x Xeon E5-2697 [email protected] + 4 x Tesla K80 board (1 node) boards (1 node) boards (1 node)

58 TERACHEM 1.5K; BPTI ON TESLA K40S & IVB CPUS

12000 TeraChem 1.5K; BPTI on Tesla K40s & Haswell CPUs (Total Processing Time in Seconds) 10000

8000

6000

4000

2000

0 2 x Xeon E5-2698 [email protected] + 1 x Tesla 2 x Xeon E5-2698 [email protected] + 2 x Tesla 2 x Xeon E5-2698 [email protected] + 4 x Tesla K40@875Mhz (1 node) K40@875Mhz (1 node) K40@875Mhz (1 node)

59 TERACHEM 1.5K; BPTI ON TESLA K80S & HASWELL CPUS

6000 TeraChem 1.5K; BPTI on Tesla K80s & Haswell CPUs (Total Processing Time in Seconds)

4000

2000

0 2 x Xeon E5-2698 [email protected] + 1 x Tesla K80 board 2 x Xeon E5-2698 [email protected] + 2 x Tesla K80 2 x Xeon E5-2698 [email protected] + 4 x Tesla K80 (1 node) boards (1 node) boards (1 node)

60 TeraChem Supercomputer Speeds on GPUs Time for SCF Step 100 TeraChem running on 8 C2050s on 1 node 90 NWChem running on 4096 Quad Core CPUs 80 In the Chinook Supercomputer

70 Giant Fullerene C240 Molecule 60

50

40 Time(Seconds) 30

20

10

0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Similar performance from just a handful of GPUs TeraChem Bang for the Buck Performance/Price TeraChem running on 8 C2050s on 1 node 600 NWChem running on 4096 Quad Core CPUs 493 In the Chinook Supercomputer 500 Giant Fullerene C240 Molecule

400 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW 300 vendor for actual pricing.

200

100 Price/Performance Price/Performance relative Supercomputerto 1 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Dollars spent on GPUs do 500x more science than those spent on CPUs Kepler’s Even Better

Olestra BLYP 453 Atoms B3LYP/6-31G(d) 800 2000 TeraChem running on C2050 and K20C

700 1800 First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 1600 600 1400 500 1200

400 1000

Seconds Seconds 300 800

600 200 400 100 200

0 0 C2050 K20C C2050 K20C

Kepler performs 2x faster than Tesla VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe

0.00700 0.00628 0.00600 (Untuned on Volta) 0.00537 3.0X Running VASP version 5.4.4 0.00500 The blue node contains Dual Intel Xeon 0.00418 2.6X E5-2690 [email protected] [3.5GHz Turbo] 0.00400 (Broadwell) CPUs

2.0X The green nodes contain Dual Intel 0.00300 Xeon E5-2690 [email protected] [3.5GHz 1/seconds Turbo] (Broadwell) CPUs + 0.00210 Tesla V100 PCIe (16GB) GPUs 0.00200

0.00100 240 ions, cristobalite (high) bulk 720 bands ? plane waves ALGO = Very Fast (RMM-DIIS) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

65 Silica IFPEN on V100s SXM2

0.00700

0.00600 0.00580 (Untuned on Volta) 0.00541 Running VASP version 5.4.4 0.00500 2.8X The blue node contains Dual Intel Xeon 0.00423 2.6X E5-2698 [email protected] [3.6GHz Turbo] 0.00400 (Broadwell) CPUs

2.0X The green nodes contain Dual Intel 0.00300 Xeon E5-2698 [email protected] [3.6GHz 1/seconds Turbo] (Broadwell) CPUs + 0.00210 Tesla V100 SXM2 (16GB) GPUs 0.00200

240 ions, cristobalite (high) bulk 0.00100 720 bands ? plane waves ALGO = Very Fast (RMM-DIIS) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

66 Si-Huge on V100s PCIe

0.00070 0.00065

0.00060 0.00057 (Untuned on Volta) 3.8X Running VASP version 5.4.4 0.00050 0.00045 The blue node contains Dual Intel Xeon 3.4X E5-2690 [email protected] [3.5GHz Turbo] 0.00040 (Broadwell) CPUs

2.6X The green nodes contain Dual Intel 0.00030 Xeon E5-2690 [email protected] [3.5GHz 1/seconds Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 0.00020 0.00017

512 Si atoms 0.00010 1282 bands 864000 Plane Waves Algo = Normal (blocked Davidson) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

67 Si-Huge on V100s SXM2

0.00080

0.00070 0.00067 (Untuned on Volta) Running VASP version 5.4.4 0.00060 0.00056 4.0X The blue node contains Dual Intel Xeon 0.00050 E5-2698 [email protected] [3.6GHz Turbo] 0.00044 3.3X (Broadwell) CPUs 0.00040 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

1/seconds 2.6X 0.00030 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.00020 0.00017

512 Si atoms 0.00010 1282 bands 864000 Plane Waves Algo = Normal (blocked Davidson) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

68 SupportedSystems on V100s PCIe

0.0100

0.0090 0.0087 (Untuned on Volta) 0.0080 Running VASP version 5.4.4

0.0068 0.0070 2.4X The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 0.0060 (Broadwell) CPUs

0.0050 1.8X The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz

1/seconds 0.0040 0.0037 Turbo] (Broadwell) CPUs + 0.0030 Tesla V100 PCIe (16GB) GPUs

0.0020 267 ions 0.0010 788 bands 762048 plane waves ALGO = Fast (Davidson + RMM-DIIS) 0.0000 1 Broadwell node 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe per node (16GB) per node (16GB)

69 SupportedSystems on V100s SXM2

0.0120

0.0100 0.0100 (Untuned on Volta) Running VASP version 5.4.4 0.0087 2.7X 0.0080 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] 0.0068 2.4X (Broadwell) CPUs 0.0060 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 1/seconds 1.8X Turbo] (Broadwell) CPUs + 0.0037 0.0040 Tesla V100 SXM2 (16GB) GPUs

0.0020 267 ions 788 bands 762048 plane waves ALGO = Fast (Davidson + RMM-DIIS) 0.0000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

70 NiAl-MD on V100s PCIe

0.0080

0.0070 0.0068 (Untuned on Volta) 0.0063 Running VASP version 5.4.4 0.0060 The blue node contains Dual Intel Xeon 0.0050 2.2X E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

0.0040 2.0X The green nodes contain Dual Intel 0.0031 Xeon E5-2690 [email protected] [3.5GHz 1/seconds 0.0030 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 0.0020

500 ions 0.0010 3200 bands 729000 plane waves ALGO = Fast (Davidson + RMM-DIIS) 0.0000 1 Broadwell node 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe per node (16GB) per node (16GB)

71 NiAl-MD on V100s SXM2

0.0080 0.0074 0.0070 0.0070 0.0064 (Untuned on Volta) 2.4X Running VASP version 5.4.4 0.0060 2.3X The blue node contains Dual Intel Xeon 0.0050 E5-2698 [email protected] [3.6GHz Turbo] 2.1X (Broadwell) CPUs 0.0040 The green nodes contain Dual Intel 0.0031 Xeon E5-2698 [email protected] [3.6GHz 1/seconds 0.0030 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.0020

500 ions 0.0010 3200 bands 729000 plane waves ALGO = Fast (Davidson + RMM-DIIS) 0.0000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

72 B.hR105 on V100s PCIe

0.0140

0.0119 (Untuned on Volta) 0.0120 0.0112 Running VASP version 5.4.4

0.0100 14.9X The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 14.0X (Broadwell) CPUs 0.0080 0.0077 The green nodes contain Dual Intel 9.6X Xeon E5-2690 [email protected] [3.5GHz

0.0060 Turbo] (Broadwell) CPUs + 1/seconds Tesla V100 PCIe (16GB) GPUs 0.0040 105 Boron atoms (β-rhombohedral structure) 0.0020 216 bands 110592 plane waves 0.0008 Hybrid Functional with blocked Davicson (ALGO=Normal) 0.0000 LHFCALC=.True. (Exact Exchange) 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

73 B.hR105 on V100s SXM2

0.0140 0.0128

0.0116 (Untuned on Volta) 0.0120 Running VASP version 5.4.4 16.0X 0.0100 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] 0.0079 14.5X (Broadwell) CPUs 0.0080 The green nodes contain Dual Intel 9.9X Xeon E5-2698 [email protected] [3.6GHz

0.0060 Turbo] (Broadwell) CPUs + 1/seconds Tesla V100 SXM2 (16GB) GPUs 0.0040 105 Boron atoms (β-rhombohedral structure) 0.0020 216 bands 110592 plane waves 0.0008 Hybrid Functional with blocked Davicson (ALGO=Normal) 0.0000 LHFCALC=.True. (Exact Exchange) 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

74 B.aP107 on V100s PCIe

0.000600 (Untuned on Volta) Running VASP version 5.4.4 0.000490 0.000500 0.000462 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 12.9X (Broadwell) CPUs 0.000400 The green nodes contain Dual Intel 0.000323 12.2X Xeon E5-2690 [email protected] [3.5GHz 0.000300 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

1/seconds 8.5X 0.000200 107 Boron atoms (symmetry broken 107-atom β′ variant) 216 bands 0.000100 110592 plane waves 0.000038 Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. 0.000000 Hybrid Functional with blocked Davidson 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe (ALGO=Normal) per node (16GB) per node (16GB) per node (16GB) LHFCALC=.True. (Exact Exchange)

75 B.aP107 on V100s SXM2

0.000600 (Untuned on Volta) 0.000523 Running VASP version 5.4.4

0.000500 0.000465 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] 13.8X (Broadwell) CPUs 0.000400 The green nodes contain Dual Intel 0.000324 12.2X Xeon E5-2698 [email protected] [3.6GHz 0.000300 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

1/seconds 8.5X 0.000200 107 Boron atoms (symmetry broken 107-atom β′ variant) 216 bands 0.000100 110592 plane waves 0.000038 Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint parallelization. 0.000000 Hybrid Functional with blocked Davidson 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 (ALGO=Normal) per node (16GB) per node (16GB) per node (16GB) LHFCALC=.True. (Exact Exchange)

76 VASP 5.4.1 February 2017 Interface on P100s PCIe

0.00500 Running VASP version 5.4.1

Interface 0.00434 0.00450 The blue node contains Dual Intel Xeon 0.00400 E5-2699 [email protected] [3.6GHz Turbo] 0.00359 2.5X (Broadwell) CPUs 0.00350 0.00308 The green nodes contain Dual Intel 0.00300 2.1X Xeon E5-2699 [email protected] [3.6GHz 1.8X Turbo] (Broadwell) CPUs + Tesla P100

0.00250 0.00228 PCIe GPUs 1/seconds 0.00200 1x P100 PCIe is paired with Single 0.00171 ➢ 1.3X Intel Xeon E5-2699 [email protected] 0.00150 [3.6GHz Turbo] (Broadwell)

0.00100 Interface between a platinum slab Pt(111) 0.00050 (108 atoms) and liquid water (120 water ) (468 ions) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1256 bands 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe 762048 plane waves per node per node per node per node ALGO = Fast (Davidson + RMM-DIIS)

78 Interface on P100s SXM2

0.00500 Running VASP version 5.4.1 Interface 0.00462 0.00450 2.7X The blue node contains Dual Intel Xeon 0.00400 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 0.00350 0.00326 The green nodes contain Dual Intel 0.00300 Xeon E5-2698 [email protected] [3.6GHz 0.00270 1.9X Turbo] (Broadwell) CPUs + Tesla P100

0.00250 0.00228 SXM2 GPUs 1/seconds 0.00200 1.6X ➢ 1x P100 SXM2 is paired with Single 0.00171 1.3X Intel Xeon E5-2698 [email protected] 0.00150 [3.6GHz Turbo] (Broadwell)

0.00100 Interface between a platinum slab Pt(111) 0.00050 (108 atoms) and liquid water (120 water molecules) (468 ions) 0.00000 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1256 bands 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 762048 plane waves per node per node per node per node ALGO = Fast (Davidson + RMM-DIIS)

79 Silica IFPEN on P100s PCIe

0.00800 Silica IFPEN Running VASP version 5.4.1 0.00700 0.00674 0.00616 The blue node contains Dual Intel Xeon 0.00600 2.5X E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

0.00500 0.00474 2.3X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 0.00400 0.00380 Turbo] (Broadwell) CPUs + Tesla P100

1/seconds 1.7X PCIe GPUs 0.00300 0.00273 ➢ 1x P100 PCIe is paired with Single 1.4X Intel Xeon E5-2699 [email protected] 0.00200 [3.6GHz Turbo] (Broadwell)

0.00100 240 ions, cristobalite (high) bulk 720 bands 0.00000 ? plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Very Fast (RMM-DIIS) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe per node per node per node per node

80 Silica IFPEN on P100s SXM2

0.00800

Silica IFPEN 0.00692 Running VASP version 5.4.1 0.00700 0.00616 The blue node contains Dual Intel Xeon 0.00600 2.5X E5-2699 [email protected] [3.6GHz Turbo] 2.3X (Broadwell) CPUs 0.00500 0.00475 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.00400 0.00352 Turbo] (Broadwell) CPUs + Tesla P100 1/seconds 1.7X SXM2 GPUs 0.00300 0.00273 ➢ 1x P100 SXM2 is paired with Single 1.3X Intel Xeon E5-2698 [email protected] 0.00200 [3.6GHz Turbo] (Broadwell)

0.00100 240 ions, cristobalite (high) bulk 720 bands 0.00000 ? plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Very Fast (RMM-DIIS) 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

81 Si-Huge on P100s PCIe

0.00080 0.00074 Si-Huge Running VASP version 5.4.1 0.00070 3.9X The blue node contains Dual Intel Xeon 0.00060 0.00058 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

0.00050 0.00044 The green nodes contain Dual Intel 3.1X Xeon E5-2699 [email protected] [3.6GHz 0.00040 Turbo] (Broadwell) CPUs + Tesla P100 1/seconds 0.00034 2.3X PCIe GPUs 0.00030 ➢ 1x P100 PCIe is paired with Single 0.00019 1.8X Intel Xeon E5-2699 [email protected] 0.00020 [3.6GHz Turbo] (Broadwell)

0.00010 512 Si atoms 1282 bands 0.00000 864000 Plane Waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + Algo = Normal (blocked Davidson) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe per node per node per node per node

82 Si-Huge on P100s SXM2

0.00070 0.00066 Si-Huge Running VASP version 5.4.1 0.00060 The blue node contains Dual Intel Xeon 3.5X E5-2699 [email protected] [3.6GHz Turbo] 0.00050 (Broadwell) CPUs 0.00045 0.00040 The green nodes contain Dual Intel 0.00040 Xeon E5-2698 [email protected] [3.6GHz 0.00033 2.4X Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1/seconds 0.00030 2.1X 1.7X ➢ 1x P100 SXM2 is paired with Single 0.00019 0.00020 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell)

0.00010 512 Si atoms 1282 bands 0.00000 864000 Plane Waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + Algo = Normal (blocked Davidson) 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

83 SupportedSystems on P100s PCIe

0.01000 SupportedSystems Running VASP version 5.4.1 0.00794 0.00796 0.00800 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 0.00651 1.9X 1.9X 0.00600 The green nodes contain Dual Intel 0.00518 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100

0.00413 1.6X PCIe GPUs 1/seconds 0.00400 ➢ 1x P100 PCIe is paired with Single 1.3X Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) 0.00200

267 ions 788 bands 0.00000 762048 plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Fast (Davidson + RMM-DIIS) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe per node per node per node per node

84 SupportedSystems on P100s SXM2

0.01000 0.00938 SupportedSystems Running VASP version 5.4.1 0.00900 2.3X 0.00800 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.00692 0.00700 (Broadwell) CPUs

0.00600 0.00570 The green nodes contain Dual Intel 0.00516 Xeon E5-2698 [email protected] [3.6GHz 0.00500 1.7X Turbo] (Broadwell) CPUs + Tesla P100

0.00413 SXM2 GPUs 1/seconds 0.00400 1.4X 1.2X ➢ 1x P100 SXM2 is paired with Single 0.00300 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.00200

0.00100 267 ions 788 bands 0.00000 762048 plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Fast (Davidson + RMM-DIIS) 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

85 NiAl-MD on P100s PCIe

0.01000 0.00936 0.00902 0.00900 NiAl-MD Running VASP version 5.4.1

0.00800 2.7X The blue node contains Dual Intel Xeon 0.00731 2.6X E5-2699 [email protected] [3.6GHz Turbo] 0.00700 (Broadwell) CPUs

0.00600 0.00577 2.1X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 0.00500 Turbo] (Broadwell) CPUs + Tesla P100

PCIe GPUs 1/seconds 0.00400 1.7X 0.00347 ➢ 1x P100 PCIe is paired with Single 0.00300 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) 0.00200

0.00100 500 ions 3200 bands 0.00000 729000 plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Fast (Davidson + RMM-DIIS) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe per node per node per node per node

86 NiAl-MD on P100s SXM2

0.0100 0.0090 0.0090 NiAl-MD Running VASP version 5.4.1 0.0081 2.6X 0.0080 The blue node contains Dual Intel Xeon 0.0074 2.3X E5-2699 [email protected] [3.6GHz Turbo] 0.0070 (Broadwell) CPUs

0.0060 0.0057 2.1X The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.0050 Turbo] (Broadwell) CPUs + Tesla P100

SXM2 GPUs 1/seconds 0.0040 1.6X 0.0035 ➢ 1x P100 SXM2 is paired with Single 0.0030 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.0020

0.0010 500 ions 3200 bands 0.0000 729000 plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Fast (Davidson + RMM-DIIS) 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

87 LiZnO on P100s PCIe

0.00180

0.00160 LiZnO 0.00153

0.00137 Running VASP version 5.4.1 0.00140 1.4X The blue node contains Dual Intel Xeon 0.00120 E5-2699 [email protected] [3.6GHz Turbo] 0.00106 (Broadwell) CPUs 0.00100 1.3X The green nodes contain Dual Intel

0.00080 Xeon E5-2699 [email protected] [3.6GHz 1/seconds Turbo] (Broadwell) CPUs + Tesla P100 0.00060 PCIe GPUs

0.00040 500 ions 3200 bands 0.00020 729000 plane waves ALGO = Fast (Davidson + RMM-DIIS) 0.00000 1 Broadwell node 1 node + 1 node + 2x P100 PCIe 4x P100 PCIe per node per node

88 LiZnO on P100s SXM2

0.0020

0.0018 LiZnO 0.0018 Running VASP version 5.4.1

0.0016 0.0015 The blue node contains Dual Intel Xeon 1.6X E5-2699 [email protected] [3.6GHz Turbo] 0.0014 0.0013 (Broadwell) CPUs 1.4X 0.0012 0.0011 The green nodes contain Dual Intel 0.0011 Xeon E5-2698 [email protected] [3.6GHz 0.0010 1.2X Turbo] (Broadwell) CPUs + Tesla P100

SXM2 GPUs 1/seconds 0.0008 1.0X ➢ 1x P100 SXM2 is paired with Single 0.0006 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.0004

0.0002 500 ions 3200 bands 0.0000 729000 plane waves 1 Broadwell node 1 node + 1 node + 1 node + 1 node + ALGO = Fast (Davidson + RMM-DIIS) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe per node per node per node per node

89 B.hR105 on P100s PCIe

0.00800 Running VASP version 5.4.1 B.hR105 0.00702 0.00700 The blue node contains Dual Intel Xeon 7.8X E5-2699 [email protected] [3.6GHz Turbo] 0.00600 0.00560 (Broadwell) CPUs The green nodes contain Dual Intel 0.00500 Xeon E5-2699 [email protected] [3.6GHz 6.2X Turbo] (Broadwell) CPUs + Tesla P100

0.00400 0.00371 PCIe GPUs 1/seconds 0.00300 ➢ 1x P100 PCIe is paired with Single 0.00223 Intel Xeon E5-2699 [email protected] 4.1X [3.6GHz Turbo] (Broadwell) 0.00200

0.00090 2.5X 0.00100 105 Boron atoms (β-rhombohedral structure) 216 bands 110592 plane waves 0.00000 Hybrid Functional with blocked Davicson 1 Broadwell node 1 node + 1 node + 1 node + 1 node + (ALGO=Normal) 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe LHFCALC=.True. (Exact Exchange) per node per node per node per node

90 B.hR105 on P100s SXM2

0.0090 Running VASP version 5.4.1 B.hR105 0.0078 0.0080 The blue node contains Dual Intel Xeon 8.7X E5-2699 [email protected] [3.6GHz Turbo] 0.0070 (Broadwell) CPUs 0.0059 0.0060 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.0050 6.6X Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

secpnds 0.0039

0.0040 1/ ➢ 1x P100 SXM2 is paired with Single 0.0030 4.3X Intel Xeon E5-2698 [email protected] 0.0024 [3.6GHz Turbo] (Broadwell) 0.0020

0.0009 2.7X 0.0010 105 Boron atoms (β-rhombohedral structure) 216 bands 110592 plane waves 0.0000 Hybrid Functional with blocked Davicson 1 Broadwell node 1 node + 1 node + 1 node + 1 node + (ALGO=Normal) 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 LHFCALC=.True. (Exact Exchange) per node per node per node per node

91 B.aP107 on P100s PCIe

0.00045 Running VASP version 5.4.1 B.aP107 0.00041 0.00040 The blue node contains Dual Intel Xeon 13.7X E5-2699 [email protected] [3.6GHz Turbo] 0.00035 (Broadwell) CPUs 0.00031 0.00030 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 0.00025 10.3X Turbo] (Broadwell) CPUs + Tesla P100 0.00021 PCIe GPUs 0.00020

1/seconds ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 [email protected] 0.00015 7.0X 0.00012 [3.6GHz Turbo] (Broadwell)

0.00010 107 Boron atoms (symmetry broken 107-atom β′ 4.0X variant) 0.00005 216 bands 0.00003 110592 plane waves Hybrid functional calculation (exact exchange) 0.00000 with blocked Davidson. No KPoint parallelization. 1 Broadwell node 1 node + 1 node + 1 node + 1 node + Hybrid Functional with blocked Davidson 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (ALGO=Normal) per node per node per node per node LHFCALC=.True. (Exact Exchange)

92 B.aP107 on P100s SXM2

0.00050 Running VASP version 5.4.1

0.00044 0.00045 B.aP107 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.00040 14.7X (Broadwell) CPUs 0.00035 The green nodes contain Dual Intel 0.00030 Xeon E5-2698 [email protected] [3.6GHz 0.00027 Turbo] (Broadwell) CPUs + Tesla P100 0.00025 SXM2 GPUs

1/seconds 0.00020 0.00020 9.0X ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 0.00015 6.7X [3.6GHz Turbo] (Broadwell) 0.00011 0.00010 107 Boron atoms (symmetry broken 107-atom β′ variant) 3.7X 216 bands 0.00005 0.00003 110592 plane waves Hybrid functional calculation (exact exchange) 0.00000 with blocked Davidson. No KPoint parallelization. 1 Broadwell node 1 node + 1 node + 1 node + 1 node + Hybrid Functional with blocked Davidson 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 (ALGO=Normal) per node per node per node per node LHFCALC=.True. (Exact Exchange)

93 Quantum Chemistry (QC) on GPUs Dec, 19, 2016 GPU-Accelerated Molecular Dynamics Apps Green Lettering Indicates Performance Slides Included

ACEMD GENESIS LAMMPS AMBER GPUGrid.net mdcore CHARMM GROMACS MELD DESMOND HALMD NAMD ESPResSO HOOMD-Blue OpenMM Folding@Home HTMD PolyFTS

GPU Perf compared against dual multi-core x86 CPU socket. 95