Molecular Dynamics (MD) on GPUs

October 2017 and RELION too Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.”

Without gpu, the supercomputer would need to be 5x larger for similar performance.

2 Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, , ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, *, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, , ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU 3 MD vs. QC on GPUs

“Classical” (MO, PW, DFT, Semi-Emp) Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties

Forces calculated from simple empirical formulas Forces derived from wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important Uses cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC, Eigen / Tensor Solvers GeForce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on

4 GPU-Accelerated Molecular Dynamics Apps Green Lettering Indicates Performance Slides Included

GENESIS LAMMPS ACEMD GPUGrid.net mdcore AMBER GROMACS MELD CHARMM HALMD NAMD DESMOND HOOMD-Blue OpenMM ESPResSO HTMD PolyFTS Folding@Home

GPU Perf compared against dual multi-core x86 CPU socket. 5 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?

• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost and save “Big Money” on CPUs, networks • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • P100 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive 6 RELION 2.0.3 August 2017 RELION: Plasmodium ribosome on P100s PCIe

0.0140

0.0120 0.0120 0.0112 Running RELION version 2.0.3 0.0101 40.0X 0.0100 The blue node contains Dual Intel Xeon 37.3X E5-2690 [email protected] (Broadwell) CPUs 0.0080 0.0070 33.7X The green nodes contain Dual Intel 0.0060 Xeon E5-2690 [email protected] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 1/Minutes 0.0046 23.3X 0.0040 0.0027 15.3X Data Citation: 0.0020 http://en.community.dell.com/techcenter/hi 0.0003 gh-performance- 9.0X computing/b/general_hpc/archive/2017/03/1 0.0000 4/application-performance-on-p100-pcie-gpus 1 Broadwell 1 node + 1 node + 1 node + 2 nodes + 3 nodes + 4 nodes + node 1x P100 2x P100 4x P100 8x P100 12x P100 16x P100 PCIe (16GB)PCIe (16GB)PCIe (16GB) PCIe PCIe PCIe per node per node per node (16GB) (16GB) (16GB)

8 ACEMD : Extremely efficient and robust MD software built on GPUs

610 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. de Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) • Standardised and easy to use: ACEMD reads CHARMM/NAMD and AMBER input files and uses similar syntax to other MD software.

• Fully featured: NVT, NPT, PME, TCL, PLUMED.1

• Robust: ACEMD is a proven computational engine and is used in one of the largest distributed projects Worldwide: GPUGRID.

• Compatible: ACEMD works with CUDA and OpenCL, the new standard framework for parallel and high-performance computing.

• Validated: ACEMD is used in reputable academic and industrial institutions. Results describing its applications have appeared in peer-reviewed journals of high impact such as Nature Chemistry, PNAS, Scientific Reports, PLoS and JACS.2

1. M. J. Harvey, and G. de Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371-2377 (2009) 2. For a list of selected references see http://www.acellera.com/science AMBER 16 on V100s October 2017 PME-Cellulose_NPT on V100s PCIe

50 47.67

40

(Untuned on Volta) 24.6X Running AMBER version 16.8 30 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 20 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 10 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 1.94

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

13 PME-Cellulose_NPT on V100s SXM2

60 54.74 55.52

50 28.2X 28.6X 40 (Untuned on Volta) Running AMBER version 16.8

The blue node contains Dual Intel Xeon 30 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

20 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 SXM2 (16GB) GPUs

1.94 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

14 PME-Cellulose_NVE on V100s PCIe

60 54.08

50 27.6X 40 (Untuned on Volta) Running AMBER version 16.8

The blue node contains Dual Intel Xeon 30 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

20 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 PCIe (16GB) GPUs

1.96 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

15 PME-Cellulose_NVE on V100s SXM2

70 65.02 63.04

60 33.2X 50 (Untuned on Volta) Running AMBER version 16.8 40 32.2X The blue node contains Dual Intel Xeon

ns/day E5-2698 [email protected] [3.6GHz Turbo] 30 (Broadwell) CPUs

The green nodes contain Dual Intel 20 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 SXM2 (16GB) GPUs

1.96 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

16 PME-FactorIX_NPT on V100s PCIe

250

200 193.16

(Untuned on Volta) 20.7X Running AMBER version 16.8 150 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 9.33

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

17 PME-FactorIX_NPT on V100s SXM2

250 224.23 217.95

200

24.0X (Untuned on Volta) Running AMBER version 16.8 150 23.4X The blue node contains Dual Intel Xeon

ns/day E5-2698 [email protected] [3.6GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 9.33

0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

18 PME-FactorIX_NVE on V100s PCIe

250

217.95

200

(Untuned on Volta) 22.7X Running AMBER version 16.8 150 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 9.61

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

19 PME-FactorIX_NVE on V100s SXM2

300

261.19 249.63 250 27.2X 200 (Untuned on Volta) Running AMBER version 16.8

26.0X The blue node contains Dual Intel Xeon 150 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

100 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 50 Tesla V100 SXM2 (16GB) GPUs

9.61 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

20 PME-JAC_NPT on V100s PCIe

500

439.87

400

(Untuned on Volta) 12.8X Running AMBER version 16.8 300 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 200 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 100 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 34.35

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

21 PME-JAC_NPT on V100s SXM2

600

515.36 500 481.75 15.0X 400 (Untuned on Volta) Running AMBER version 16.8

14.0X The blue node contains Dual Intel Xeon 300 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 34.35

0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

22 PME-JAC_NVE on V100s PCIe

600

490.77 500 13.4X 400 (Untuned on Volta) Running AMBER version 16.8

The blue node contains Dual Intel Xeon 300 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

200 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 PCIe (16GB) GPUs 36.53

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

23 PME-JAC_NVE on V100s SXM2

700

600 583.33 539.78

500 (Untuned on Volta) 16.0X Running AMBER version 16.8 400 The blue node contains Dual Intel Xeon ns/day 14.8X E5-2698 [email protected] [3.6GHz Turbo] 300 (Broadwell) CPUs

The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 36.53

0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

24 PME-JAC_NPT_4fs on V100s PCIe

900 863.80

750 13.1X 600 (Untuned on Volta) Running AMBER version 16.8

The blue node contains Dual Intel Xeon 450 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

300 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 150 Tesla V100 PCIe (16GB) GPUs 65.74

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

25 PME-JAC_NPT_4fs on V100s SXM2

1200

1006.32 1000 946.57 15.3X 800 (Untuned on Volta) Running AMBER version 16.8

14.4X The blue node contains Dual Intel Xeon 600 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

400 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 200 Tesla V100 SXM2 (16GB) GPUs 65.74

0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

26 PME-JAC_NVE_4fs on V100s PCIe

1050 940.32

900

750 (Untuned on Volta) 26.0X Running AMBER version 16.8 600 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 450 (Broadwell) CPUs

The green nodes contain Dual Intel 300 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 150 Tesla V100 PCIe (16GB) GPUs 67.10

0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

27 PME-JAC_NVE_4fs on V100s SXM2

1200 1123.40

1027.44 1000 16.7X 800 (Untuned on Volta) Running AMBER version 16.8

15.3X The blue node contains Dual Intel Xeon 600 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

400 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 200 Tesla V100 SXM2 (16GB) GPUs 67.10

0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

28 PME-STMV_NPT_4fs on V100s PCIe

35 33.21

30 31.3X 25 (Untuned on Volta) Running AMBER version 16.8 20 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 15 (Broadwell) CPUs

The green nodes contain Dual Intel 10 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 5 Tesla V100 PCIe (16GB) GPUs

1.06 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

29 PME-STMV_NPT_4fs on V100s SXM2

40 37.24

35

30 35.1X (Untuned on Volta) 25 Running AMBER version 16.8

The blue node contains Dual Intel Xeon 20 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 15 The green nodes contain Dual Intel 10 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 5 Tesla V100 SXM2 (16GB) GPUs 1.06 0 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

30 GB-Myoglobin on V100s PCIe

750 699.21

600

(Untuned on Volta) 31.4X Running AMBER version 16.8 450 The blue node contains Dual Intel Xeon

ns/day E5-2690 [email protected] [3.5GHz Turbo] 300 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 150 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

22.30 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

31 GB-Myoglobin on V100s SXM2

800 750.76

700 33.7X 600 (Untuned on Volta) 500 Running AMBER version 16.8

The blue node contains Dual Intel Xeon 400 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 300 The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 22.30 0 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

32 GB-Nucleosome on V100s PCIe

85 78.39

68

(Untuned on Volta) 252.9X Running AMBER version 16.8 51 49.14 The blue node contains Dual Intel Xeon ns/day 158.5X E5-2690 [email protected] [3.5GHz Turbo] 34 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 17 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

0.31 0 1 Broadwell node 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe per node (16GB) per node (16GB)

33 GB-Nucleosome on V100s SXM2

100 92.46

75 (Untuned on Volta) 298.3X Running AMBER version 16.8 52.89 The blue node contains Dual Intel Xeon 50 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

170.6X The green nodes contain Dual Intel 25 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs

0.31 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)

34 Rubisco on V100s PCIe

8

7 6.78

6 5.22 678.0X (Untuned on Volta) 5 Running AMBER version 16.8

The blue node contains Dual Intel Xeon 4 ns/day E5-2690 [email protected] [3.5GHz Turbo] 522.0X (Broadwell) CPUs 3 2.79 The green nodes contain Dual Intel 2 Xeon E5-2690 [email protected] [3.5GHz 279.0X Turbo] (Broadwell) CPUs + 1 Tesla V100 PCIe (16GB) GPUs

0.01 0 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

35 Rubisco on V100s SXM2

8

7.00 7

5.96 6 700.0X (Untuned on Volta) 5 Running AMBER version 16.8

The blue node contains Dual Intel Xeon 4 ns/day E5-2698 [email protected] [3.6GHz Turbo] 3.00 596.0X (Broadwell) CPUs 3 The green nodes contain Dual Intel 2 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 1 300.0X Tesla V100 SXM2 (16GB) GPUs

0.01 0 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

36 AMBER 16 on P100s February 2017 PME-Cellulose_NPT on P100s PCIe

40 PME-Cellulose_NPT 35

30.00 Running AMBER version 16.3 30 The blue node contains Dual Intel Xeon 25 E5-2699 [email protected] [3.6GHz Turbo] 21.85 (Broadwell) CPUs

ns/day 12.8X 20 The green nodes contain Dual Intel 9.3X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

38 PME-Cellulose_NPT on P100s SXM2

40 PME-Cellulose_NPT 36.65 35 32.22 Running AMBER version 16.3 30 15.6X 13.7X The blue node contains Dual Intel Xeon 25 23.37 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 20

ns/day The green nodes contain Dual Intel 9.9X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 10 ➢ 1x P100 SXM2 is paired with Single 5 Intel Xeon E5-2698 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

39 PME-Cellulose_NVE on P100s PCIe

40 PME-Cellulose_NVE 35 32.55 Running AMBER version 16.3 30 13.2X The blue node contains Dual Intel Xeon 25 23.34 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 20 The green nodes contain Dual Intel 9.4X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.47 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

40 PME-Cellulose_NVE on P100s SXM2

45 PME-Cellulose_NVE 40.88 40

35.16 35 Running AMBER version 16.3

30 16.6X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 24.94 25 14.2X (Broadwell) CPUs

ns/day 20 The green nodes contain Dual Intel 10.1X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

10 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 5 2.47 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

41 PME-FactorIX_NPT on P100s PCIe

140 PME-FactorIX_NPT 132.86 120 Running AMBER version 16.3 98.77 11.6X 100 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 80 (Broadwell) CPUs

ns/day 8.6X The green nodes contain Dual Intel 60 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 40 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

42 PME-FactorIX_NPT on P100s SXM2

180 PME-FactorIX_NPT 159.80 160 144.11 140 14.0X Running AMBER version 16.3

120 12.6X The blue node contains Dual Intel Xeon 106.25 E5-2699 [email protected] [3.6GHz Turbo] 100 (Broadwell) CPUs 9.3X ns/day 80 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

60 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

43 PME-FactorIX_NVE on P100s PCIe

160 PME-FactorIX_NVE 145.83 140

12.2X Running AMBER version 16.3 120 105.86 The blue node contains Dual Intel Xeon 100 E5-2699 [email protected] [3.6GHz Turbo] 8.8X (Broadwell) CPUs

ns/day 80 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 60 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 40 ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.98 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

44 PME-FactorIX_NVE on P100s SXM2

200

178.02 180 PME-FactorIX_NVE 159.24 160 14.9X Running AMBER version 16.3 140 The blue node contains Dual Intel Xeon 120 114.88 13.3X E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 100

ns/day The green nodes contain Dual Intel 80 9.6X Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 60 SXM2 GPUs

40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.98 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

45 PME-JAC_NPT on P100s PCIe

350 PME-JAC_NPT 327.69 300 283.60 7.1X Running AMBER version 16.3 250 The blue node contains Dual Intel Xeon 6.2X E5-2699 [email protected] [3.6GHz Turbo] 200 (Broadwell) CPUs

ns/day The green nodes contain Dual Intel 150 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 45.89 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

46 PME-JAC_NPT on P100s SXM2

450 PME-JAC_NPT 423.09 400 360.64 350 9.2X Running AMBER version 16.3 310.52 300 7.9X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 250 6.8X (Broadwell) CPUs

ns/day 200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

150 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

100 ➢ 1x P100 SXM2 is paired with Single 45.89 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node

47 PME-JAC_NVE on P100s PCIe

400 PME-JAC_NVE 363.79 350 308.46 7.6X Running AMBER version 16.3 300 The blue node contains Dual Intel Xeon 250 E5-2699 [email protected] [3.6GHz Turbo] 6.4X (Broadwell) CPUs 200

ns/day The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 150 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 100 ➢ 1x P100 PCIe is paired with Single 47.90 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

48 PME-JAC_NVE on P100s SXM2

500 473.10

450 PME-JAC_NVE 402.18 9.9X 400 Running AMBER version 16.3

350 339.81 8.4X The blue node contains Dual Intel Xeon 300 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 250 7.1X

ns/day The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 150 SXM2 GPUs

100 ➢ 1x P100 SXM2 is paired with Single 47.90 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node

49 GB-Myoglobin on P100s PCIe

600 GB-Myoglobin 561.94 500 483.37 19.5X Running AMBER version 16.3 400 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 16.7X (Broadwell) CPUs

ns/day 300 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz

200 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

➢ 1x P100 PCIe is paired with Single 100 Intel Xeon E5-2699 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node

50 GB-Myoglobin on P100s SXM2

700 GB-Myoglobin 639.37 600 534.28 22.2X Running AMBER version 16.3 500 The blue node contains Dual Intel Xeon 18.5X E5-2699 [email protected] [3.6GHz Turbo] 400 (Broadwell) CPUs

ns/day The green nodes contain Dual Intel 300 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 200 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single 100 Intel Xeon E5-2698 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 4x P100 PCIe per node per node

51 GB-Nucleosome on P100s PCIe

50 45.92 45 GB-Nucleosome 39.91 114.8X 40 Running AMBER version 16.3

35 99.8X The blue node contains Dual Intel Xeon 30 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 25 22.77 The green nodes contain Dual Intel 20 Xeon E5-2699 [email protected] [3.6GHz 56.9X Turbo] (Broadwell) CPUs + Tesla P100 15 PCIe (16GB) GPUs 11.91 10 ➢ 1x P100 PCIe is paired with Single 29.8X Intel Xeon E5-2699 [email protected] 5 [3.6GHz Turbo] (Broadwell) 0.40 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

52 GB-Nucleosome on P100s SXM2

60 GB-Nucleosome 50 48.29 46.29 Running AMBER version 16.3 120.7X 40 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 115.7X (Broadwell) CPUs 30

ns/day 25.53 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

20 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 13.36 63.8X ➢ 1x P100 SXM2 is paired with Single 10 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.40 33.4X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

53 Rubisco-75K on P100s PCIe

4.5 Rubisco-75K 4.20 4.0 420.0X 3.5 Running AMBER version 16.3

3.0 The blue node contains Dual Intel Xeon 2.69 E5-2699 [email protected] [3.6GHz Turbo] 2.5 (Broadwell) CPUs

ns/day 2.0 269.0X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz

1.5 1.40 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

1.0 0.71 ➢ 1x P100 PCIe is paired with Single 140.0X Intel Xeon E5-2699 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 71.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

54 Rubisco-75K on P100s SXM2

5.0

4.46 4.5 Rubisco-75K

4.0 Running AMBER version 16.3

3.5 446.0X The blue node contains Dual Intel Xeon 3.06 3.0 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 2.5 306.0X

ns/day The green nodes contain Dual Intel 2.0 Xeon E5-2698 [email protected] [3.6GHz 1.57 Turbo] (Broadwell) CPUs + Tesla P100 1.5 SXM2 GPUs

1.0 0.80 157.0X ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 80.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

55 Recommended GPU Node Configuration for AMBER Workstation or Single Node Configuration # of CPU sockets 2

Cores per CPU socket 6+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+ System memory per node (GB) 16

GPUs P100, V100

1-4 # of GPUs per CPU socket

GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB

Network configuration Infiniband QDR or better 56 Scale to multiple nodes with same single node configuration 56 CHARMM DOMDEC-GUI July 2016 CHARMM DOMDEC-GUI 465 K System Benchmark

4 465 K System

(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.15 The green nodes contain Dual Intel 2 ns/day Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs 6.0X

1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36

0 1 Haswell node 1 node + 1x K80 per node

58 CHARMM DOMDEC-GUI 534 K System Benchmark

2.0 534 K System

(POPC_PSPC_CHL1mixture) Running CHARMM version c40a1 1.5 1.43 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs

The green nodes contain Dual Intel

1.0 Xeon E5-2698 [email protected] GHz (Haswell) ns/day 8.0X CPUs + Tesla K80 (autoboost) GPUs

0.5 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.18

0.0 1 Haswell node 1 node + 1x K80 per node

59 CHARMM DOMDEC-GUI 20 K System Benchmark

80 20 K System (Crambin)

59.68 Running CHARMM version c40a1 60 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs

The green nodes contain Dual Intel ns/day 40 Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla M40 GPUs 3.7X

20 16.00 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error.

0 1 Haswell node 1 node + 1x M40 per node

60 CHARMM DOMDEC-GUI 61 K System Benchmark

35 61 K System (GlnBP) 30 *Higher is better Running CHARMM version c40a1 25.08 25 The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 20 The green nodes contain Dual Intel

6.4X Xeon E5-2698 [email protected] GHz (Haswell) ns/day 15 CPUs + Tesla M40 GPUs

10 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 5 3.90

0 1 Haswell node 1 node + 1x M40 per node

61 CHARMM DOMDEC-GUI 465 K System Benchmark

4 465 K System

(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.27 The green nodes contain Dual Intel 2 Xeon E5-2698 [email protected] GHz (Haswell) ns/day CPUs + Tesla M40 GPUs 6.3X

1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36

0 1 Haswell node 1 node + 1x M40 per node

62 GROMACS 2018 February 2018 GROMACS Water 1.5M on V100 vs P100 (PCIe 16GB)

14

11.86 12

10.37 10.41 4.9X 9.83 10 (Untuned on Volta) 4.3X 4.3X Running GROMACS version 2018 8.41 4.0X 8 3.5X The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day 6.06 6 (Broadwell) CPUs 2.5X The green nodes contain Dual Intel 4 Xeon E5-2690 [email protected] [3.5GHz 2.43 Turbo] (Broadwell) CPUs + Tesla P100 2 PCIe (16GB) or V100 PCIe (16GB) GPUs

0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node

64 GROMACS Water 1.5M on V100 vs P100 (SXM2 16GB)

14 12.67 12 5.2X 11.02 10.92 10.19 10 4.5X 4.5X 9.16 (Untuned on Volta) 4.2X Running GROMACS version 2018

8 3.8X 6.88 The blue node contains Dual Intel Xeon

E5-2690 [email protected] [3.5GHz Turbo] ns/day 6 2.8X (Broadwell) CPUs The green nodes contain Dual Intel 4 Xeon E5-2698 [email protected] [3.6GHz 2.43 Turbo] (Broadwell) CPUs + Tesla P100 2 SXM2 (16GB) or V100 SXM2 (16GB) GPUs

0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per node node node node node node

65 GROMACS ADH Dodec on V100 vs P100 (PCIe 16GB)

180

160 154.62

137.66 4.9X 140 130.99 133.66 4.3X 4.2X (Untuned on Volta) 120 4.1X Running GROMACS version 2018 101.49 102.57 100 The blue node contains Dual Intel

3.2X 3.2X Xeon E5-2690 [email protected] [3.5GHz ns/day 80 Turbo] (Broadwell) CPUs

60 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 40 31.79 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PCIe 20 (16GB) GPUs

0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 2x P100 4x P100 8x P100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node

66 GROMACS ADH Dodec on V100 vs P100 (SXM2 16GB)

180 167.33 156.86 160 156.23 147.93 5.3X 4.9X 4.9X 140 4.7X 131.92 123.17 (Untuned on Volta) 120 4.1X 3.9X Running GROMACS version 2018 100 The blue node contains Dual Intel

Xeon E5-2690 [email protected] [3.5GHz ns/day 80 Turbo] (Broadwell) CPUs

60 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 40 31.79 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 SXM2 20 (16GB) GPUs

0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 2x P100 4x P100 8x P100 2x V100 4x V100 8x V100 SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per node node node node node node

67 GROMACS 2016.4 October 2017 Water 1.5M on V100 vs P100 (PCIe)

8 7.30 6.97 7

6 3.2X 5.34 (Untuned on Volta) 3.1X Running GROMACS version 2016.4 5 4.34 The blue node contains Dual Intel Xeon 4 2.3X E5-2690 [email protected] [3.5GHz Turbo] ns/day 1.9X (Broadwell) CPUs 3 The green nodes contain Dual Intel 2.28 Xeon E5-2690 [email protected] [3.5GHz 2 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PCIe 1 (16GB) GPUs

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 1x V100 PCIe 2x V100 PCIe per node (16GB) per node (16GB) per node (16GB) per node (16GB)

69 Water 3M on V100 vs P100 (PCIe)

5

4 3.85 3.53 (Untuned on Volta) 3.4X Running GROMACS version 2016.4 3 The blue node contains Dual Intel Xeon 2.53 3.2X E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 1.96 2 The green nodes contain Dual Intel 2.3X Xeon E5-2690 [email protected] [3.5GHz 1.12 Turbo] (Broadwell) CPUs + Tesla P100 1 PCIe (16GB) or V100 PCIe 1.8X (16GB) GPUs 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 1x V100 PCIe 2x V100 PCIe per node (16GB) per node (16GB) per node (16GB) per node (16GB)

70 Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Tesla P100, V100

2x # of GPUs per CPU socket Volta GPUs: need fast Skylake or Broadwell

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher

71 Network configuration Gemini, InfiniBand 71 HOOMD-Blue 2.2.2 March 2018 HOOMD-Blue lj-liquid on V100 vs P100 (PCIe 16GB)

4500 4261.93 Running HOOMD-Blue version 2.2.2

4000 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 3500 19.8X (Broadwell) CPUs

3000 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 2379.87 (Broadwell) CPUs + Tesla P100 PCIe 2500 (16GB) or V100 PCIe (16GB) GPUs 2000 Average TPS 11.1X 64,000 particles 1500 Force field/method: Lennard-Jones MD This is a synthetic benchmark for historical reasons and for making direct comparisons 1000 with LAMMPS

500 214.95

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 1x V100 PCIe per node per node

73 HOOMD-Blue microsphere on V100 vs P100 (PCIe 16GB)

500 Running HOOMD-Blue version 2.2.2 455.52 450 The blue node contains Dual Intel Xeon 55.6X E5-2690 [email protected] [3.5GHz Turbo] 400 373.55 (Broadwell) CPUs 348.39 350 45.6X The green nodes contain Dual Intel Xeon 293.90 300 42.5X E5-2690 [email protected] [3.5GHz Turbo] 257.89 (Broadwell) CPUs + Tesla P100 PCIe 250 35.8X (16GB) or V100 PCIe 31.5X 220.63 (16GB) GPUs 200 Average TPS 173.24 26.9X 1,428,364 particles Force field/method: Bead spring polymbers 150 21.1X 112.48 via dissipative particle dynamics Results published in: 100 http://dx.doi.org/10.1002/adma.201501329 13.7X 50 8.20 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node

74 HOOMD-Blue quasicrystal on V100 vs P100 (PCIe 16GB)

3000 Running HOOMD-Blue version 2.2.2

The blue node contains Dual Intel Xeon 2473.63 2500 E5-2690 [email protected] [3.5GHz Turbo] 2285.10 50.5X (Broadwell) CPUs 46.6X The green nodes contain Dual Intel Xeon 2000 1871.90 1768.09 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe 38.2X 36.1X (16GB) or V100 PCIe 1500 (16GB) GPUs 1165.62

100,000 particles Average TPS 1000 861.78 23.8X Force field/method: Oscilating pair potential MD Results published in: 17.6X http://dx.doi.org/10.1038/NMAT4152 500

49.03 0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node

75 HOOMD-Blue triblock-copolymer on V100 vs P100 (PCIe 16GB)

3500 Running HOOMD-Blue version 2.2.2 3158.74

3000 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 15.9X (Broadwell) CPUs 2500 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 2000 (Broadwell) CPUs + Tesla P100 PCIe 1711.50 (16GB) or V100 PCIe (16GB) GPUs 1500 Average TPS 64,017 particles Force field/method: Bead-spring polymer MD 8.6X Results published in: 1000 http://dx.doi.org/10.1021/ma061120f

500 198.50

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 1x V100 PCIe per node per node

76 HOOMD-Blue dodecahedron on V100 vs P100 (PCIe 16GB)

300 Running HOOMD-Blue version 2.2.2

The blue node contains Dual Intel Xeon 250 234.20 239.26 E5-2690 [email protected] [3.5GHz Turbo] 10.6X (Broadwell) CPUs 200 186.73 10.4X The green nodes contain Dual Intel 180.04 Xeon E5-2690 [email protected] [3.5GHz 165.18 8.3X Turbo] (Broadwell) CPUs + Tesla P100 8.0X PCIe (16GB) or V100 PCIe 150 136.43 7.3X (16GB) GPUs

Average TPS 101.06 6.1X 131,072 particles 100 Force field/method: Hard dodecahedra 80.99 (via MC simulation) 4.5X This is a synethic benchmark to test scalability 3.6X and is much larger than anyone would 50 reasonably run 22.54

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node

77 HOOMD-Blue hexagon on V100 vs P100 (PCIe 16GB)

120 Running HOOMD-Blue version 2.2.2 104.50 The blue node contains Dual Intel Xeon 100 E5-2690 [email protected] [3.5GHz Turbo] 89.48 19.0X (Broadwell) CPUs

80 The green nodes contain Dual Intel Xeon 16.3X E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe 59.15 60 (16GB) or V100 PCIe (16GB) GPUs 48.56 10.8X Average TPS 1,048,576 particles 40 8.8X 32.29 Force field/method: Hard hexagons 26.90 (via MC simulation) 5.9X Results published in: 16.99 http://dx.doi.org/10.1103/PhysRevX.7.021001 20 14.08 4.9X 5.50 2.6X 3.1X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node

78 LAMMPS 2017 October 2017 Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s PCIe

1.00 2,048,000 atoms

0.80 0.73 (Untuned on Volta) Running LAMMPS version 2017 0.60 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 3.0X (Broadwell) CPUs

1/seconds 0.40 The green nodes contain Dual Intel 0.25 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 0.20 Tesla V100 PCIe (16GB) GPUs

0.00 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

80 Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s SXM2

1.00

2,048,000 atoms 0.82 0.80

(Untuned on Volta) 3.3X Running LAMMPS version 2017 0.60 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

1/seconds 0.40 The green nodes contain Dual Intel 0.25 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 0.20 Tesla V100 SXM2 (16GB) GPUs

0.00 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

81 Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s PCIe

0.80 2,048,000 atoms

0.60 0.60 (Untuned on Volta) 0.47 Running LAMMPS version 2017 0.45 10.0X The blue node contains Dual Intel Xeon 0.40 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

1/seconds 7.8X The green nodes contain Dual Intel 7.5X Xeon E5-2690 [email protected] [3.5GHz 0.20 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 0.06

0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

82 Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s SXM2

0.80 2,048,000 atoms

0.60 0.55 0.56 (Untuned on Volta) 0.48 9.3X Running LAMMPS version 2017 9.2X The blue node contains Dual Intel Xeon 0.40 E5-2698 [email protected] [3.6GHz Turbo]

(Broadwell) CPUs 1/seconds 8.0X The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.20 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.06

0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

83 Course-grain Water on V100s PCIe

0.020

2,048,000 atoms 0.016

0.015 (Untuned on Volta) Running LAMMPS version 2017 0.011 5.3X The blue node contains Dual Intel Xeon 0.010 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs

1/seconds 0.007 3.7X The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.005 Turbo] (Broadwell) CPUs + 0.003 2.3X Tesla V100 PCIe (16GB) GPUs

0.000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

84 Course-grain Water on V100s SXM2

0.025 2,048,000 atoms

0.020 0.020

(Untuned on Volta) 6.7X Running LAMMPS version 2017 0.015 0.014 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] 4.7X (Broadwell) CPUs

1/seconds 0.010 0.009 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 3.0X Turbo] (Broadwell) CPUs + 0.005 0.003 Tesla V100 SXM2 (16GB) GPUs

0.000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

85 Gay-Berne on V100s PCIe

0.06 2,097,152 atoms 0.05 0.05

(Untuned on Volta) 0.04 Running LAMMPS version 2017

7.5X The blue node contains Dual Intel Xeon 0.03 E5-2690 [email protected] [3.5GHz Turbo]

(Broadwell) CPUs 1/seconds 0.02 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.01 Turbo] (Broadwell) CPUs + 0.01 Tesla V100 PCIe (16GB) GPUs

0.00 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)

86 Gay-Berne on V100s SXM2

0.06 2,097,152 atoms 0.05 0.05

5.0X (Untuned on Volta) 0.04 Running LAMMPS version 2017

The blue node contains Dual Intel Xeon 0.03 E5-2698 [email protected] [3.6GHz Turbo]

(Broadwell) CPUs 1/seconds 0.02 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.01 Turbo] (Broadwell) CPUs + 0.01 Tesla V100 SXM2 (16GB) GPUs

0.00 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)

87 Rhodopsin on V100s PCIe

0.80 256,000 atoms

0.60 0.58 0.53 (Untuned on Volta) 3.4X Running LAMMPS version 2017 0.44 3.1X The blue node contains Dual Intel Xeon 0.40 E5-2690 [email protected] [3.5GHz Turbo]

(Broadwell) CPUs 1/seconds 2.6X The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.20 0.17 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs

0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)

88 Rhodopsin on V100s SXM2

0.80 256,000 atoms 0.70 0.68

0.60 0.60 4.0X (Untuned on Volta) 0.50 Running LAMMPS version 2017 0.42 3.5X The blue node contains Dual Intel Xeon 0.40 E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 1/seconds 0.30 The green nodes contain Dual Intel 2.5X Xeon E5-2698 [email protected] [3.6GHz 0.20 0.17 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.10

0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)

89 Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

GTX Titan X, GPUs Tesla P100, V100

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand Scale to thousands of nodes with same single node configuration 90 90 NAMD 2.12 July 2017 APOA1 on P100s PCIe

28 APOA1 24 22.58 22.85

20 Running NAMD version 2.12

16 The blue node contains Dual Intel Xeon

E5-2690 [email protected] [3.5GHz Turbo] ns/day 12 (Broadwell) CPUs The green nodes contain Dual Intel 8 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 4 3.45

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node

92 APOA1 on P100s SXM2

30 APOA1 25 23.87 22.98 23.44

20 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 15 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

10 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 5 3.45

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

93 F1ATPASE on P100s PCIe

10 F1ATPASE 8 7.34 7.40 6.99 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 PCIe (16GB) GPUs 1.15

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe (16GB) per node (16GB) per node (16GB) per node

94 F1ATPASE on P100s SXM2

10 F1ATPASE 8 7.11 7.11 6.85 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 SXM2 GPUs 1.15

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

95 STMV on P100s PCIe

3.0 STMV 2.5 2.32 2.15

2.0 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 PCIe (16GB) GPUs 0.29

0.0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node

96 STMV on P100s SXM2

3.0 STMV 2.5

2.077 2.0 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 SXM2 GPUs 0.292

0.0 1 Broadwell node 1 node + 1x P100 SXM2 per node

97 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?

• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost and save “Big Money” on CPUs, networks • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive 98 Molecular Dynamics (MD) on GPUs Dec. 19, 2016 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included

Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 Quantum VASP Espresso NWChem WL-LSMS

GPU Perf compared against dual multi-core x86 CPU socket. 100