<<

Molecular (MD) on GPUs Feb. 2, 2017 Accelerating Discoveries

Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all- of the HIV and discovered the chemical structure of its — “the perfect target for fighting the infection.”

Without gpu, the supercomputer would need to be 5x larger for similar performance.

2 Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, , ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, *, PEtot, QUICK, -Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, , ONETEP, Quantum Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU 3 MD vs. QC on GPUs

“Classical” Quantum (MO, PW, DFT, Semi-Emp) Simulates positions of over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties

Forces calculated from simple empirical formulas derived from electron wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on

4 GPU-Accelerated Molecular Dynamics Apps Green Lettering Indicates Performance Slides Included

GPUGrid.net MELD ACEMD GROMACS NAMD AMBER HALMD OpenMM CHARMM HOOMD-Blue PolyFTS DESMOND LAMMPS ESPResSO mdcore Folding@Home

GPU Perf compared against dual multi-core CPU socket. 5 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?

• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost with marginal price increase • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www..com/GPUTestDrive 6 ACEMD www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications AMBER 16 June 2017 JAC_NVE on GP100s

450 23,558 atoms PME 2fs 404.09 400 370.32

350 320.19 320.14

300 Running AMBER version 16 250

ns/day The green nodes contain Dual Intel(R) 200 Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink) 150

100

50

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

11 JAC_NVE on GP100s

900 23,558 atoms PME 4fs 800 782.11 714.23 700 614.42 613.16 600 Running AMBER version 16 500

ns/day The green nodes contain Dual Intel(R) 400 Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink) 300

200

100

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

12 JAC_NPT on GP100s

400 23,558 atoms PME 2fs 360.64 350 333.03

295.75 295.42 300

250 Running AMBER version 16

200

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 150 Quadro GP100s GPUs (PCIe and NVLink)

100

50

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

13 JAC_NPT on GP100s

800

23,558 atoms PME 4fs 706.53 700 654.66

600 580.47 578.48

500 Running AMBER version 16

400

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 300 Quadro GP100s GPUs (PCIe and NVLink)

200

100

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

14 FactorIX_NVE on GP100s

180 90,906 atoms PME 166.61 160 142.45 140

120 106.23 105.98 Running AMBER version 16 100

ns/day The green nodes contain Dual Intel(R) 80 Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink) 60

40

20

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

15 FactorIX_NPT on GP100s

160 90,906 atoms PME 146.34 140 126.75

120

102.27 102.26 100 Running AMBER version 16

80

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 60 Quadro GP100s GPUs (PCIe and NVLink)

40

20

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

16 Cellulose_NVE on GP100s

40 408,609 atoms PME 36.91 35 31.35

30

25 24.01 24.02 Running AMBER version 16

20

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 15 Quadro GP100s GPUs (PCIe and NVLink)

10

5

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

17 Cellulose_NPT on GP100s

35 408,609 atoms PME 32.37 30 28.76

25 22.76 22.8

20 Running AMBER version 16

ns/day The green nodes contain Dual Intel(R) 15 Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

10

5

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

18 STMV_NPT on GP100s

25 1,067,095 atoms PME 23.13 20.22 20

15.64 15.43 15 Running AMBER version 16

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 10 Quadro GP100s GPUs (PCIe and NVLink)

5

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

19 TRPCAGE on GP100s

1500 304 atoms GB 1216.56 1250 1187.3

1000 Running AMBER version 16

750 The green nodes contain Dual Intel(R)

ns/day Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink) 500

250

0 1 node + 1x GP100 1 node + 1x GP100 per node (PCIe) per node (NVLink)

20 Myoglobin on GP100s

600 2,492 atoms GB

470.41 458.28 443.49 447.23 450

Running AMBER version 16

300

ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + Quadro GP100s GPUs (PCIe and NVLink)

150

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

21 Nucleosome on GP100s

25

21.29 25,095 atoms GB 20.51 20

15 Running AMBER version 16

11.47 11.29 ns/day The green nodes contain Dual Intel(R) Core(TM) i7-4820K @ 3.70GHz CPUs + 10 Quadro GP100s GPUs (PCIe and NVLink)

5

0 1 node + 1x GP100 1 node + 1x GP100 1 node + 2x GP100 1 node + 2x GP100 per node (PCIe) per node (NVLink) per node (PCIe) per node (NVLink)

22 AMBER 16 February 2017 PME-Cellulose_NPT on K80s

20 PME-Cellulose_NPT 16 15.43 Running AMBER version 16.3

The blue node contains Dual Intel Xeon 6.6X E5-2699 [email protected] [3.6GHz Turbo] 12 11.36 (Broadwell) CPUs

The green nodes contain Dual Intel ns/day 4.8X Xeon E5-2699 [email protected] [3.6GHz 8 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

➢ 1x K80 is paired with Single Intel 4 Xeon E5-2699 [email protected] [3.6GHz 2.35 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

24 PME-Cellulose_NPT on P100s PCIe

40 PME-Cellulose_NPT 35

30.00 Running AMBER version 16.3 30 The blue node contains Dual Intel Xeon 25 E5-2699 [email protected] [3.6GHz Turbo] 21.85 (Broadwell) CPUs

ns/day 12.8X 20 The green nodes contain Dual Intel 9.3X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

25 PME-Cellulose_NPT on P100s SXM2

40 PME-Cellulose_NPT 36.65 35 32.22 Running AMBER version 16.3 30 15.6X 13.7X The blue node contains Dual Intel Xeon 25 23.37 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 20

ns/day The green nodes contain Dual Intel 9.9X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 10 ➢ 1x P100 SXM2 is paired with Single 5 Intel Xeon E5-2698 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

26 PME-Cellulose_NVE on K80s

20 PME-Cellulose_NVE 16.53 16 Running AMBER version 16.3

6.7X The blue node contains Dual Intel Xeon 11.85 E5-2699 [email protected] [3.6GHz Turbo] 12 (Broadwell) CPUs

ns/day 4.8X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 8 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

➢ 1x K80 is paired with Single Intel 4 2.47 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

27 PME-Cellulose_NVE on P100s PCIe

40 PME-Cellulose_NVE 35 32.55 Running AMBER version 16.3 30 13.2X The blue node contains Dual Intel Xeon 25 23.34 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 20 The green nodes contain Dual Intel 9.4X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.47 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

28 PME-Cellulose_NVE on P100s SXM2

45 PME-Cellulose_NVE 40.88 40

35.16 35 Running AMBER version 16.3

30 16.6X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 24.94 25 14.2X (Broadwell) CPUs

ns/day 20 The green nodes contain Dual Intel 10.1X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

10 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 5 2.47 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

29 PME-FactorIX_NPT on K80s

80 PME-FactorIX_NPT 70 66.68 Running AMBER version 16.3 60 5.8X The blue node contains Dual Intel Xeon 50 48.54 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

40

ns/day The green nodes contain Dual Intel 4.2X Xeon E5-2699 [email protected] [3.6GHz 30 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

20 ➢ 1x K80 is paired with Single Intel 11.43 Xeon E5-2699 [email protected] [3.6GHz 10 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

30 PME-FactorIX_NPT on P100s PCIe

140 PME-FactorIX_NPT 132.86 120 Running AMBER version 16.3 98.77 11.6X 100 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 80 (Broadwell) CPUs

ns/day 8.6X The green nodes contain Dual Intel 60 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 40 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

31 PME-FactorIX_NPT on P100s SXM2

180 PME-FactorIX_NPT 159.80 160 144.11 140 14.0X Running AMBER version 16.3

120 12.6X The blue node contains Dual Intel Xeon 106.25 E5-2699 [email protected] [3.6GHz Turbo] 100 (Broadwell) CPUs 9.3X ns/day 80 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

60 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

32 PME-FactorIX_NVE on K80s

80 PME-FactorIX_NVE 71.49 70 6.0X Running AMBER version 16.3 60

51.14 The blue node contains Dual Intel Xeon 50 E5-2699 [email protected] [3.6GHz Turbo] 5.4X (Broadwell) CPUs 40 The green nodes contain Dual Intel ns/day Xeon E5-2699 [email protected] [3.6GHz 30 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

20 ➢ 1x K80 is paired with Single Intel 11.98 Xeon E5-2699 [email protected] [3.6GHz 10 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

33 PME-FactorIX_NVE on P100s PCIe

160 PME-FactorIX_NVE 145.83 140

12.2X Running AMBER version 16.3 120 105.86 The blue node contains Dual Intel Xeon 100 E5-2699 [email protected] [3.6GHz Turbo] 8.8X (Broadwell) CPUs

ns/day 80 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 60 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 40 ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.98 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

34 PME-FactorIX_NVE on P100s SXM2

200

178.02 180 PME-FactorIX_NVE 159.24 160 14.9X Running AMBER version 16.3 140 The blue node contains Dual Intel Xeon 120 114.88 13.3X E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 100

ns/day The green nodes contain Dual Intel 80 9.6X Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 60 SXM2 GPUs

40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.98 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

35 PME-JAC_NPT on K80s

250 PME-JAC_NPT 216.78 200 Running AMBER version 16.3

162.09 4.7X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 150 (Broadwell) CPUs 3.5X The green nodes contain Dual Intel ns/day Xeon E5-2699 [email protected] [3.6GHz 100 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

45.89 ➢ 1x K80 is paired with Single Intel 50 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

36 PME-JAC_NPT on P100s PCIe

350 PME-JAC_NPT 327.69 300 283.60 7.1X Running AMBER version 16.3 250 The blue node contains Dual Intel Xeon 6.2X E5-2699 [email protected] [3.6GHz Turbo] 200 (Broadwell) CPUs

ns/day The green nodes contain Dual Intel 150 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 45.89 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

37 PME-JAC_NPT on P100s SXM2

450 PME-JAC_NPT 423.09 400 360.64 350 9.2X Running AMBER version 16.3 310.52 300 7.9X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 250 6.8X (Broadwell) CPUs

ns/day 200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

150 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

100 ➢ 1x P100 SXM2 is paired with Single 45.89 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node

38 PME-JAC_NVE on K80s

250 234.99 PME-JAC_NVE 200 Running AMBER version 16.3 173.20 4.9X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 150 (Broadwell) CPUs

3.6X The green nodes contain Dual Intel ns/day Xeon E5-2699 [email protected] [3.6GHz 100 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

47.90 ➢ 1x K80 is paired with Single Intel 50 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

39 PME-JAC_NVE on P100s PCIe

400 PME-JAC_NVE 363.79 350 308.46 7.6X Running AMBER version 16.3 300 The blue node contains Dual Intel Xeon 250 E5-2699 [email protected] [3.6GHz Turbo] 6.4X (Broadwell) CPUs 200

ns/day The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 150 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 100 ➢ 1x P100 PCIe is paired with Single 47.90 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node

40 PME-JAC_NVE on P100s SXM2

500 473.10

450 PME-JAC_NVE 402.18 9.9X 400 Running AMBER version 16.3

350 339.81 8.4X The blue node contains Dual Intel Xeon 300 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 250 7.1X

ns/day The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 150 SXM2 GPUs

100 ➢ 1x P100 SXM2 is paired with Single 47.90 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node

41 GB-Myoglobin on K80s

400 GB-Myoglobin 350 339.45

Running AMBER version 16.3 300 288.47 11.8X The blue node contains Dual Intel Xeon 250 E5-2699 [email protected] [3.6GHz Turbo] 10.0X (Broadwell) CPUs ns/day 200 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 150 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

100 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 50 Turbo] (Broadwell) 28.86

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

42 GB-Myoglobin on P100s PCIe

600 GB-Myoglobin 561.94 500 483.37 19.5X Running AMBER version 16.3 400 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 16.7X (Broadwell) CPUs

ns/day 300 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz

200 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

➢ 1x P100 PCIe is paired with Single 100 Intel Xeon E5-2699 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node

43 GB-Myoglobin on P100s SXM2

700 GB-Myoglobin 639.37 600 534.28 22.2X Running AMBER version 16.3 500 The blue node contains Dual Intel Xeon 18.5X E5-2699 [email protected] [3.6GHz Turbo] 400 (Broadwell) CPUs

ns/day The green nodes contain Dual Intel 300 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 200 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single 100 Intel Xeon E5-2698 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 4x P100 PCIe per node per node

44 GB-Nucleosome on K80s

25

GB-Nucleosome 20.55 20 Running AMBER version 16.3 51.4X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 15 (Broadwell) CPUs

11.31 ns/day The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 10 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 5.84 28.3X ➢ 1x K80 is paired with Single Intel 5 Xeon E5-2699 [email protected] [3.6GHz 14.6X Turbo] (Broadwell) 0.40 0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

45 GB-Nucleosome on P100s PCIe

50 45.92 45 GB-Nucleosome 39.91 114.8X 40 Running AMBER version 16.3

35 99.8X The blue node contains Dual Intel Xeon 30 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 25 22.77 The green nodes contain Dual Intel 20 Xeon E5-2699 [email protected] [3.6GHz 56.9X Turbo] (Broadwell) CPUs + Tesla P100 15 PCIe (16GB) GPUs 11.91 10 ➢ 1x P100 PCIe is paired with Single 29.8X Intel Xeon E5-2699 [email protected] 5 [3.6GHz Turbo] (Broadwell) 0.40 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

46 GB-Nucleosome on P100s SXM2

60 GB-Nucleosome 50 48.29 46.29 Running AMBER version 16.3 120.7X 40 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 115.7X (Broadwell) CPUs 30

ns/day 25.53 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

20 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 13.36 63.8X ➢ 1x P100 SXM2 is paired with Single 10 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.40 33.4X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

47 Rubisco-75K on K80s

1.6 Rubisco-75K 1.4 1.34 Running AMBER version 16.3 1.2 134.0X The blue node contains Dual Intel Xeon 1.0 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 0.8 0.69 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 0.6 Turbo] (Broadwell) CPUs + Tesla K80 69.0X (autoboost) GPUs 0.4 0.35 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 0.2 Turbo] (Broadwell)

0.01 35.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

48 Rubisco-75K on P100s PCIe

4.5 Rubisco-75K 4.20 4.0 420.0X 3.5 Running AMBER version 16.3

3.0 The blue node contains Dual Intel Xeon 2.69 E5-2699 [email protected] [3.6GHz Turbo] 2.5 (Broadwell) CPUs

ns/day 2.0 269.0X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz

1.5 1.40 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

1.0 0.71 ➢ 1x P100 PCIe is paired with Single 140.0X Intel Xeon E5-2699 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 71.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

49 Rubisco-75K on P100s SXM2

5.0

4.46 4.5 Rubisco-75K

4.0 Running AMBER version 16.3

3.5 446.0X The blue node contains Dual Intel Xeon 3.06 3.0 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 2.5 306.0X

ns/day The green nodes contain Dual Intel 2.0 Xeon E5-2698 [email protected] [3.6GHz 1.57 Turbo] (Broadwell) CPUs + Tesla P100 1.5 SXM2 GPUs

1.0 0.80 157.0X ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 80.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

50 AMBER 14 AMBER 14 vs. AMBER 12

Courtesy of Scott Le Grand From GTC 2014 presentation

52 AMBER 14; large P2P and small Boost Clocks impacts

AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks Impact DHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off) 250

215.18

196.68 200

150 132.97

125.77 ns/day 100

No Boost Boost No Boost Boost No P2P No P2P P2P P2P 50

0 2 x Xeon E5-2690 [email protected] + 4 x 2 x Xeon E5-2690 [email protected] + 4 x 2 x Xeon E5-2690 [email protected] + 4 x 2 x Xeon E5-2690 [email protected] + 4 x Tesla K40@745Mhz (no P2P) Tesla K40@875Mhz (no P2P) Tesla K40@745Mhz (P2P) Tesla K40@875Mhz (P2P) Series1 125.77 132.97 196.68 215.18

53 AMBER Performance Over Time

Courtesy of Scott Le Grand From GTC 2014 presentation 54 54 Cellulose on K40s, K80s and M6000s

20 Running AMBER version 14 PME-Cellulose_NVE The blue node contains Dual Intel E5- 15.38 16 14.90 2698 [email protected], 3.6GHz Turbo CPUs 13.67 The green nodes contain Dual Intel E5- 11.76 8.0X 12 7.7X 2698 [email protected], 3.6GHz Turbo CPUs + 10.49 7.1X either NVIDIA Tesla K40@875Mhz, Tesla 8.96 6.1X K80@562Mhz (autoboost), or Quadro 7.87 8 M6000@987Mhz GPUs 4.6X 5.4X 4.1X 4 Simulated Simulated Time (ns/day) 1.93

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

55 Factor IX on K40s, K80s and M6000s

80 Running AMBER version 14 PME-FactorIX_NVE 70 66.89 The blue node contains Dual Intel E5- 61.18 60.93 2698 [email protected], 3.6GHz Turbo CPUs 60 7.0X 50.70 6.4X The green nodes contain Dual Intel E5- 50 47.80 6.3X 2698 [email protected], 3.6GHz Turbo CPUs + 40.48 5.2X either NVIDIA Tesla K40@875Mhz, Tesla 40 5.0X K80@562Mhz (autoboost), or Quadro 33.59 M6000@987Mhz GPUs 30 4.2X 3.5X 20

Simulated Simulated Time (ns/day) 9.68 10

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

56 JAC on K40s, K80s and M6000s

250 Running AMBER version 14 225.34 PME-JAC_NVE 219.83 200.34 The blue node contains Dual Intel E5- 200 2698 [email protected], 3.6GHz Turbo CPUs 6.0X 5.9X 174.34 5.4X 161.53 The green nodes contain Dual Intel E5- 2698 [email protected], 3.6GHz Turbo CPUs + 150 134.82 4.7X 121.30 either NVIDIA Tesla K40@875Mhz, Tesla 4.3X K80@562Mhz (autoboost), or Quadro 3.6X 100 M6000@987Mhz GPUs 3.2X

50 37.38 Simulated Simulated Time (ns/day)

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

57 Cellulose on M40s

18

PME - Cellulose_NPT 15.90 16 Running AMBER version 14 14.40 14.9X 14 The blue node contain Single Intel Xeon 13.5X E5-2698 [email protected] (Haswell) CPUs 12 The green nodes contain Single Intel 10.12 10 Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 8 9.5X

6 Simulated Simulated Time (ns/Day)

4

2 1.07

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

58 Cellulose on M40s

18 PME - Cellulose_NVE 17.13 16 15.41 16.0X Running AMBER version 14 14 The blue node contain Single Intel Xeon 14.4X E5-2698 [email protected] (Haswell) CPUs 12 10.50 The green nodes contain Single Intel 10 Xeon E5-2697 [email protected] (IvyBridge) 9.8X CPUs + Tesla M40 (autoboost) GPUs 8

6 Simulated Simulated Time (ns/Day)

4

2 1.07

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

59 FactorIX on M40s

80 PME - FactorIX_NPT 72.96 70 67.37 Running AMBER version 14 13.6X 60 The blue node contain Single Intel Xeon 12.5X E5-2698 [email protected] (Haswell) CPUs 50 46.90 The green nodes contain Single Intel Xeon E5-2697 [email protected] (IvyBridge) 40 CPUs + Tesla M40 (autoboost) GPUs 8.7X

30 Simulated Simulated Time (ns/Day) 20

10 5.38

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

60 FactorIX on M40s

90 PME - FactorIX_NVE 80.04 80 Running AMBER version 14 73.00 14.6X 70 The blue node contain Single Intel Xeon 13.3X E5-2698 [email protected] (Haswell) CPUs 60 The green nodes contain Single Intel 49.33 50 Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 40 9.0X

30 Simulated Simulated Time (ns/Day)

20

10 5.47

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

61 JAC on M40s

250 PME - JAC_NPT 226.63 211.97 Running AMBER version 14 200 10.9X The blue node contain Single Intel Xeon 10.2X E5-2698 [email protected] (Haswell) CPUs 149.40 150 The green nodes contain Single Intel Xeon E5-2697 [email protected] (IvyBridge) 7.2X CPUs + Tesla M40 (autoboost) GPUs

100 Simulated Simulated Time (ns/Day)

50

20.88

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

62 JAC on M40s

300 PME - JAC_NVE 246.15 Running AMBER version 14 250 230.18 The blue node contain Single Intel Xeon 11.7X E5-2698 [email protected] (Haswell) CPUs 200 The green nodes contain Single Intel 157.68 10.9X Xeon E5-2697 [email protected] (IvyBridge) 150 CPUs + Tesla M40 (autoboost) GPUs 7.5X

100 Simulated Simulated Time (ns/Day)

50 21.11

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

63 Myoglobin on M40s

350 GB - Myoglobin 322.09 300.86 300 Running AMBER version 14 32.8X The blue node contain Single Intel Xeon 30.6X E5-2698 [email protected] (Haswell) CPUs 250 232.20 The green nodes contain Single Intel 200 Xeon E5-2697 [email protected] (IvyBridge) 23.6X CPUs + Tesla M40 (autoboost) GPUs 150

Simulated Simulated Time (ns/Day) 100

50

9.83 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

64 Nucleosome on M40s

18 GB - Nucleosome 16.11 16 Running AMBER version 14

14 The blue node contain Single Intel Xeon 123.9X E5-2698 [email protected] (Haswell) CPUs 12 The green nodes contain Single Intel 10 Xeon E5-2697 [email protected] (IvyBridge) 9.05 CPUs + Tesla M40 (autoboost) GPUs 8 69.6X

6 Simulated Simulated Time (ns/Day) 4.67 4 35.9X 2

0.13 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

65 TrpCage on M40s

900 831.91 GB - TrpCage 800 2.03X Running AMBER version 14 700 The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs 600 551.36 The green nodes contain Single Intel 500 464.63 Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 408.88 1.3X 400 1.1X

300 Simulated Simulated Time (ns/Day)

200

100

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

66 Recommended GPU Node Configuration for AMBER Workstation or Single Node Configuration # of CPU sockets 2

Cores per CPU socket 6+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+ System memory per node (GB) 16

GPUs Kepler K20, K40, K80, P100

1-4 # of GPUs per CPU socket

GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB

Network configuration Infiniband QDR or better 67 Scale to multiple nodes with same single node configuration 67 CHARMM DOMDEC-GUI July 2016 CHARMM DOMDEC-GUI 465 K System Benchmark

4 465 K System

(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.15 The green nodes contain Dual Intel 2 ns/day Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs 6.0X

1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36

0 1 Haswell node 1 node + 1x K80 per node

69 CHARMM DOMDEC-GUI 534 K System Benchmark

2.0 534 K System

(POPC_PSPC_CHL1mixture) Running CHARMM version c40a1 1.5 1.43 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs

The green nodes contain Dual Intel

1.0 Xeon E5-2698 [email protected] GHz (Haswell) ns/day 8.0X CPUs + Tesla K80 (autoboost) GPUs

0.5 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.18

0.0 1 Haswell node 1 node + 1x K80 per node

70 CHARMM DOMDEC-GUI 20 K System Benchmark

80 20 K System (Crambin)

59.68 Running CHARMM version c40a1 60 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs

The green nodes contain Dual Intel ns/day 40 Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla M40 GPUs 3.7X

20 16.00 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error.

0 1 Haswell node 1 node + 1x M40 per node

71 CHARMM DOMDEC-GUI 61 K System Benchmark

35 61 K System (GlnBP) 30 *Higher is better Running CHARMM version c40a1 25.08 25 The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 20 The green nodes contain Dual Intel

6.4X Xeon E5-2698 [email protected] GHz (Haswell) ns/day 15 CPUs + Tesla M40 GPUs

10 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 5 3.90

0 1 Haswell node 1 node + 1x M40 per node

72 CHARMM DOMDEC-GUI 465 K System Benchmark

4 465 K System

(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.27 The green nodes contain Dual Intel 2 Xeon E5-2698 [email protected] GHz (Haswell) ns/day CPUs + Tesla M40 GPUs 6.3X

1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36

0 1 Haswell node 1 node + 1x M40 per node

73 GROMACS 2016 October 2016 Erik Lindahl (GROMACS developer) video

75 Water 1.5M on K80s

7 Water 1.5M 6.14 6 5.22 2.2X 5 Running GROMACS version 2016 1.9X The blue node contains Dual Intel Xeon 4 E5-2698 [email protected] [3.6GHz Turbo]

(Broadwell) CPUs ns/day 3 2.79 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

2 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1

0 1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node

76 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 3M on K80s

4

Water 3M 3.05 3 2.66 2.3X 3 Running GROMACS version 2016 2.0X The blue node contains Dual Intel Xeon 2 E5-2698 [email protected] [3.6GHz Turbo]

(Broadwell) CPUs ns/day 2 1.32 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 1 (autoboost) GPUs

1

0 1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node

77 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 1.5M on M40s

8 Water 1.5M 7.60 7 6.15 2.7X 6 2.2X Running GROMACS version 2016 5 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo]

4 (Broadwell) CPUs ns/day The green nodes contain Dual Intel 2.79 3 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla M40 2 (autoboost) GPUs

1

0 1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node

78 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 3M on M40s

4.5

3.94 4.0 Water 3M

3.5

2.97 3.0X Running GROMACS version 2016 3.0 The blue node contains Dual Intel Xeon 2.5 E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 2.3X 2.0 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 1.5 1.32 Turbo] (Broadwell) CPUs + Tesla M40 (autoboost) GPUs 1.0

0.5

0.0 1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node

79 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 1.5M on P40s

9

8.07 8 Water 1.5M

7 6.60 2.9X Running GROMACS version 2016 6 2.4X The blue node contains Dual Intel Xeon 5 E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 4 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 3 2.79 Turbo] (Broadwell) CPUs + Tesla P40 GPUs 2

1

0 1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node

80 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 3M on P40s

4.5 4.19

4.0 Water 3M

3.5 3.36 3.2X Running GROMACS version 2016 3.0 2.5X The blue node contains Dual Intel Xeon 2.5 E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

ns/day 2.0 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 1.5 1.32 Turbo] (Broadwell) CPUs + Tesla P40 GPUs 1.0

0.5

0.0 1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node

81 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 1.5M on P100 PCIes

8 Water 1.5M 7.11 7 6.34 2.5X 6 Running GROMACS version 2016 5 2.3X The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo]

4 (Broadwell) CPUs ns/day

3 2.79 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 2 Tesla P100 PCIe (16GB) GPUs

1

0 1 Broadwell node 1 node + 2x P100 PCIe (16GB) 1 node + 4x P100 PCIe (16GB) per node per node

82 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. Water 3M on P100 PCIes

4.0

3.5 Water 3M 3.43 3.16

3.0 2.6X 2.4X Running GROMACS version 2016 2.5 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo]

2.0 (Broadwell) CPUs ns/day

1.5 1.32 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 1.0 Tesla P100 PCIe (16GB) GPUs

0.5

0.0 1 Broadwell node 1 node + 2x P100 PCIe (16GB) 1 node + 4x P100 PCIe (16GB) per node per node

83 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. GROMACS 5.1.2 February 2017 Water 1.5M on K80s

7 Water 1.5M 6 5.75 Running GROMACS version 5.1.2

5 The blue node contains Dual Intel Xeon 1.9X E5-2699 [email protected] [3.6GHz Turbo] 4 (Broadwell) CPUs 3.49

3.04 The green nodes contain Dual Intel ns/day 3 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 1.1X (autoboost) GPUs 2 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 1 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

85 Water 1.5M on P100s PCIe

10 Water 1.5M 8 7.21 Running GROMACS version 5.1.2 6.96 The blue node contains Dual Intel Xeon 6 E5-2699 [email protected] [3.6GHz Turbo] 2.4X (Broadwell) CPUs 4.39 ns/day 2.3X The green nodes contain Dual Intel 4 Xeon E5-2699 [email protected] [3.6GHz 3.04 Turbo] (Broadwell) CPUs + Tesla P100 1.4X PCIe (16GB) GPUs 2 ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node per node

86 Water 1.5M on P100s SXM2

9

7.88 8 Water 1.5M 7.18 7 6.70 2.6X Running GROMACS version 5.1.2

6 2.4X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 5 2.2X (Broadwell) CPUs

4.11 ns/day 4 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 3.04 3 1.4X Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

2 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 1 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x 100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

87 Water 3M on K80s

3.5

Water 3M 2.98 3.0 Running GROMACS version 5.1.2 2.5 The blue node contains Dual Intel Xeon 2.2X E5-2699 [email protected] [3.6GHz Turbo] 2.0 (Broadwell) CPUs

1.59 The green nodes contain Dual Intel ns/day 1.5 1.38 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 1.2X (autoboost) GPUs 1.0 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 0.5 Turbo] (Broadwell)

0.0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

88 Water 3M on P100s PCIe

4.0 Water 3M 3.80 3.5 3.43

2.8X Running GROMACS version 5.1.2 3.0 2.5X The blue node contains Dual Intel Xeon 2.5 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 1.96

ns/day 2.0 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 1.5 1.38 Turbo] (Broadwell) CPUs + Tesla P100 1.4X PCIe (16GB) GPUs 1.0 ➢ 1x P100 PCIe is paired with Single 0.5 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0.0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node per node

89 Water 3M on P100s SXM2

4.5

4.0 Water 3M 3.82

3.50 3.5 Running GROMACS version 5.1.2

3.0 2.8X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 2.5 2.5X (Broadwell) CPUs

ns/day 2.0 1.84 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz

1.5 1.38 Turbo] (Broadwell) CPUs + Tesla P100 1.3X SXM2 GPUs 1.0 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 0.5 [3.6GHz Turbo] (Broadwell)

0.0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

90 Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K20, K40, K80

1x # of GPUs per CPU socket Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher

91 Network configuration Gemini, InfiniBand 91 HOOMD-Blue 1.3.3 February 2017 lj-liquid on K80s

2500 lj-liquid 2000 1942.12 Running HOOMD-Blue version 1.3.3

1594.37 The blue node contains Dual Intel Xeon 5.9X E5-2699 [email protected] [3.6GHz Turbo] 1500 1324.84 (Broadwell) CPUs The green nodes contain Dual Intel timesteps/sec 4.9X Xeon E5-2699 [email protected] [3.6GHz 1000 avg Turbo] (Broadwell) CPUs + Tesla K80 4.1X (autoboost) GPUs ➢ 1x K80 is paired with Single Intel 500 326.52 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

93 lj-liquid on P100s PCIe

3500 lj-liquid 3217.68 3000 2912.66 9.9X Running HOOMD-Blue version 1.3.3 2500 The blue node contains Dual Intel Xeon 8.9X E5-2699 [email protected] [3.6GHz Turbo] /sec 2000 (Broadwell) CPUs

The green nodes contain Dual Intel timesteps 1500 Xeon E5-2699 [email protected] [3.6GHz

avg Turbo] (Broadwell) CPUs + Tesla P100 1000 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 500 326.52 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 8x P100 PCIe (16GB) per node per node

94 lj-liquid on P100s SXM2

4000 lj-liquid 3500 3397.74 3129.11 Running HOOMD-Blue version 1.3.3 3000 10.4X The blue node contains Dual Intel Xeon 2500 E5-2699 [email protected] [3.6GHz Turbo] /sec 9.6X (Broadwell) CPUs 2000 The green nodes contain Dual Intel timesteps Xeon E5-2698 [email protected] [3.6GHz

avg 1500 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 1000 ➢ 1x P100 SXM2 is paired with Single 500 Intel Xeon E5-2698 [email protected] 326.52 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1x P100 SXM2 8x P100 SXM2 per node per node

95 lj_liquid_512k on K80s

600 lj_liquid_512k 526.47 500 12.1X Running HOOMD-Blue version 1.3.3 400 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 334.59 /sec (Broadwell) CPUs

300 The green nodes contain Dual Intel

timesteps Xeon E5-2699 [email protected] [3.6GHz 220.10 7.7X

avg Turbo] (Broadwell) CPUs + Tesla K80 200 (autoboost) GPUs 5.1X ➢ 1x K80 is paired with Single Intel 100 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) 43.43

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

96 lj_liquid_512k on P100s PCIe

1200 lj_liquid_512k 1045.50 1000 24.1X Running HOOMD-Blue version 1.3.3 800 770.18 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 17.7X (Broadwell) CPUs 600 534.54 The green nodes contain Dual Intel timesteps Xeon E5-2699 [email protected] [3.6GHz

avg 398.12 400 12.3X Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

9.2X ➢ 1x P100 PCIe is paired with Single 200 Intel Xeon E5-2699 [email protected] 43.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

97 lj_liquid_512k on P100s SXM2

1200 1119.76 lj_liquid_512k 1000 Running HOOMD-Blue version 1.3.3

793.36 25.8X 800 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 18.3X (Broadwell) CPUs 600 568.51

The green nodes contain Dual Intel timesteps

443.74 Xeon E5-2698 [email protected] [3.6GHz avg 400 13.1X Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 10.2X ➢ 1x P100 SXM2 is paired with Single 200 Intel Xeon E5-2698 [email protected] 43.43 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

98 lj_liquid_1m on K80s

350

lj_liquid_1m 303.00 300 Running HOOMD-Blue version 1.3.3 250 The blue node contains Dual Intel Xeon 13.7X E5-2699 [email protected] [3.6GHz Turbo]

/sec 200 (Broadwell) CPUs 181.42 The green nodes contain Dual Intel

timesteps 150 Xeon E5-2699 [email protected] [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 avg 109.54 8.2X (autoboost) GPUs 100 ➢ 1x K80 is paired with Single Intel 5.0X Xeon E5-2699 [email protected] [3.6GHz 50 Turbo] (Broadwell) 22.07

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

99 lj_liquid_1m on P100s PCIe

800 lj_liquid_1m 700 672.46

Running HOOMD-Blue version 1.3.3 600 30.5X The blue node contains Dual Intel Xeon 500 465.58 E5-2699 [email protected] [3.6GHz Turbo]

/sec (Broadwell) CPUs 400 The green nodes contain Dual Intel timesteps 294.88 21.1X Xeon E5-2699 [email protected] [3.6GHz 300 avg Turbo] (Broadwell) CPUs + Tesla P100 204.67 13.4X PCIe (16GB) GPUs 200 9.3X ➢ 1x P100 PCIe is paired with Single 100 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) 22.07 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

100 lj_liquid_1m on P100s SXM2

800 lj_liquid_1m 707.73 700

Running HOOMD-Blue version 1.3.3 600 32.1X The blue node contains Dual Intel Xeon 488.04 500 E5-2699 [email protected] [3.6GHz Turbo] /sec (Broadwell) CPUs 400 22.1X The green nodes contain Dual Intel timesteps 315.07 Xeon E5-2698 [email protected] [3.6GHz

avg 300 Turbo] (Broadwell) CPUs + Tesla P100 221.02 14.3X SXM2 GPUs 200 10.0X ➢ 1x P100 SXM2 is paired with Single 100 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 22.07 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

101 Microsphere on K80s

180 microsphere 166.74 160 9.5X 140 Running HOOMD-Blue version 1.3.3

120 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo]

/sec 98.43 (Broadwell) CPUs 100 The green nodes contain Dual Intel 80 timesteps Xeon E5-2699 [email protected] [3.6GHz 64.87 avg 5.6X Turbo] (Broadwell) CPUs + Tesla K80 60 (autoboost) GPUs 3.7X 40 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 17.53 20 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

102 Microsphere on P100s PCIe

400 microsphere 371.24 350

21.2X Running HOOMD-Blue version 1.3.3 300 257.58 The blue node contains Dual Intel Xeon 250 E5-2699 [email protected] [3.6GHz Turbo] /sec 14.7X (Broadwell) CPUs 200

179.54 The green nodes contain Dual Intel timesteps 145.71 Xeon E5-2699 [email protected] [3.6GHz avg 150 Turbo] (Broadwell) CPUs + Tesla P100 10.2X PCIe (16GB) GPUs 100 8.3X ➢ 1x P100 PCIe is paired with Single 50 Intel Xeon E5-2699 [email protected] 17.53 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

103 Microsphere on P100s SXM2

450 microsphere 400 384.72

350 Running HOOMD-Blue version 1.3.3 21.9X 300 The blue node contains Dual Intel Xeon

271.21 E5-2699 [email protected] [3.6GHz Turbo] /sec 250 (Broadwell) CPUs 15.5X 200 186.01 The green nodes contain Dual Intel timesteps Xeon E5-2698 [email protected] [3.6GHz

avg 151.51 150 Turbo] (Broadwell) CPUs + Tesla P100 10.6X SXM2 GPUs 100 ➢ 1x P100 SXM2 is paired with Single 8.6X Intel Xeon E5-2698 [email protected] 50 17.53 [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

104 on K80s

1600 polymer 1518.99 1400

1209.45 4.2X Running HOOMD-Blue version 1.3.3 1200 The blue node contains Dual Intel Xeon 975.14 1000 E5-2699 [email protected] [3.6GHz Turbo] /sec 3.3X (Broadwell) CPUs 800 The green nodes contain Dual Intel

timesteps 2.7X Xeon E5-2699 [email protected] [3.6GHz

avg 600 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 362.19 400 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 200 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

105 Polymer on P100s PCIe

3000

polymer 2480.70 2500 Running HOOMD-Blue version 1.3.3 2143.15 1999.64 6.8X 2000 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 5.9X (Broadwell) CPUs 1500 The green nodes contain Dual Intel

timesteps 5.5X Xeon E5-2699 [email protected] [3.6GHz avg 1000 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

➢ 1x P100 PCIe is paired with Single 500 362.19 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe (16GB) 4x P100 PCIe (16GB) 8x P100 PCIe (16GB) per node per node per node

106 Polymer on P100s SXM2

3000 polymer 2651.56 2500 2272.27 Running HOOMD-Blue version 1.3.3 2111.99 7.3X 2000 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 6.3X (Broadwell) CPUs 1500 The green nodes contain Dual Intel

timesteps 5.8X Xeon E5-2698 [email protected] [3.6GHz avg 1000 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

➢ 1x P100 SXM2 is paired with Single 500 362.19 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node

107 Quasicrystal on K80s

1400 quasicrystal 1280.44 1200 16.3X Running HOOMD-Blue version 1.3.3 1000 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo]

/sec 800 767.90 (Broadwell) CPUs

The green nodes contain Dual Intel

timesteps 600 Xeon E5-2699 [email protected] [3.6GHz 502.53

avg Turbo] (Broadwell) CPUs + Tesla K80 9.8X (autoboost) GPUs 400 ➢ 1x K80 is paired with Single Intel 6.4X Xeon E5-2699 [email protected] [3.6GHz 200 Turbo] (Broadwell) 78.32

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

108 Quasicrystal on P100s PCIe

2500 quasicrystal 2261.72 2000 Running HOOMD-Blue version 1.3.3 1791.41 28.9X The blue node contains Dual Intel Xeon 1500 E5-2699 [email protected] [3.6GHz Turbo] /sec (Broadwell) CPUs 1199.64 22.9X

The green nodes contain Dual Intel timsteps 1000 Xeon E5-2699 [email protected] [3.6GHz avg 851.29 Turbo] (Broadwell) CPUs + Tesla P100 15.3X PCIe (16GB) GPUs 10.9X 500 ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) 78.32 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

109 Quasicrystal on P100s SXM2

3000 quasicrystal 2500 2429.68 Running HOOMD-Blue version 1.3.3

2000 1940.29 31.0X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 24.8X (Broadwell) CPUs 1500 1249.90 The green nodes contain Dual Intel

timsteps Xeon E5-2698 [email protected] [3.6GHz avg 1000 939.53 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

16.0X ➢ 1x P100 SXM2 is paired with Single 500 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 78.32 12.0X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

110 Triblock-copolymer on K80s

1600 triblock-copolymer 1492.01 1400 4.1X Running HOOMD-Blue version 1.3.3 1200 1170.47 The blue node contains Dual Intel Xeon 1000 953.01 E5-2699 [email protected] [3.6GHz Turbo] /sec 3.2X (Broadwell) CPUs 800 The green nodes contain Dual Intel

timesteps 2.6X Xeon E5-2699 [email protected] [3.6GHz

avg 600 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 361.42 400 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 200 Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

111 Triblock-copolymer on P100s PCIe

3000 triblock-copolymer 2456.09 2500 Running HOOMD-Blue version 1.3.3 2155.27 1999.14 6.8X 2000 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo]

/sec 6.0X (Broadwell) CPUs 1500 5.5X The green nodes contain Dual Intel

timesteps Xeon E5-2699 [email protected] [3.6GHz avg 1000 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

➢ 1x P100 PCIe is paired with Single 500 361.42 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe (16GB) 4x P100 PCIe (16GB) 8x P100 PCIe (16GB) per node per node per node

112 Triblock-copolymer on P100s SXM2

3000.00

triblock-copolymer 2587.91 2500.00 2253.83 Running HOOMD-Blue version 1.3.3 2132.92 7.2X 2000.00 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] /sec 6.2X (Broadwell) CPUs 1500.00 The green nodes contain Dual Intel

timesteps 5.9X Xeon E5-2698 [email protected] [3.6GHz avg 1000.00 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

➢ 1x P100 SXM2 is paired with Single 500.00 361.42 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node

113 LAMMPS 2016 February 2017 Atomic-Fluid Lennard-Jones 2.5 Cutoff on K80s 1.00 Atomic-Fluid Lennard-Jones

0.80 2.5 Cutoff

Running LAMMPS version 2016

0.60 0.57 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

1/seconds 0.40 0.37 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 1.5X Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 0.20

0.00 1 Broadwell node 1 node + 2x K80 per node

115 Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s PCIe 1.00 Atomic-Fluid Lennard-

0.80 Jones 2.5 Cutoff

0.62 Running LAMMPS version 2016 0.60 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 0.37 1/seconds 0.40 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 1.7X Turbo] (Broadwell) CPUs + Tesla P100 0.20 PCIe (16GB) GPUs

0.00 1 Broadwell node 1 node + 2x P100 PCIe (16GB) per node

116 Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s SXM2

1.00 Atomic-Fluid Lennard- Jones 2.5 Cutoff 0.75 Running LAMMPS version 2016 0.64 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.50 (Broadwell) CPUs

0.37 The green nodes contain Dual Intel 1/seconds 1.7X Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 0.25 SXM2 (autoboost) GPUs

0.00 1 Broadwell node 1 node + 2x P100 SXM2 per node

117 Atomic-Fluid Lennard-Jones 5.0 Cutoff on K80s 1.00 Atomic-Fluid Lennard-

0.80 Jones 5.0 Cutoff Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.60 (Broadwell) CPUs

The green nodes contain Dual Intel

1/seconds Xeon E5-2699 [email protected] [3.6GHz 0.40 0.36 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 0.26 3.6X ➢ 1x K80 is paired with Single Intel 0.20 0.14 Xeon E5-2699 [email protected] [3.6GHz 0.10 2.6X Turbo] (Broadwell) 1.4X 0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

118 Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s PCIe 1.00 Atomic-Fluid Lennard-Jones

0.80 5.0 Cutoff Running LAMMPS version 2016

The blue node contains Dual Intel Xeon 0.60 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

The green nodes contain Dual Intel 0.37 0.38 Xeon E5-2699 [email protected] [3.6GHz 1/seconds 0.40 0.35 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 0.22 3.7X 3.8X 0.20 3.5X ➢ 1x P100 PCIe is paired with Single 0.10 Intel Xeon E5-2699 [email protected] 2.2X [3.6GHz Turbo] (Broadwell) 0.00 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

119 Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s SXM2

1.00 Atomic-Fluid Lennard-

Running LAMMPS version 2016 0.75 Jones 5.0 Cutoff The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 0.50 0.41 The green nodes contain Dual Intel

1/seconds 0.36 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 4.1X SXM2 GPUs 0.25 0.22 3.6X ➢ 1x P100 SXM2 is paired with Single 0.10 Intel Xeon E5-2698 [email protected] 2.2X [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

120 Course-grain Water on K80s

0.0100 Course-grain Water 0.0080

Running LAMMPS version 2016

0.0060 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo]

0.00437 0.00444 (Broadwell) CPUs 1/seconds 0.0040 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 1.0X Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 0.0020

0.0000 1 Broadwell node 1 node + 4x K80 per node

121 Course-grain Water on P100s PCIe

0.0100 0.0093 0.0090 Course-grain Water

0.0080

0.0070 2.1X Running LAMMPS version 2016 0.0061 0.0060 The blue node contains Dual Intel Xeon 0.0050 E5-2699 [email protected] [3.6GHz Turbo] 0.0044 (Broadwell) CPUs

1/seconds 0.0040 1.4X The green nodes contain Dual Intel 0.0030 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 0.0020 PCIe (16GB) GPUs

0.0010

0.0000 1 Broadwell node 1 node + 1 node + 4x P100 PCIe (16GB) 8x P100 PCIe (16GB) per node per node

122 Course-grain Water on P100s SXM2

0.0120 Course-grain Water 0.0110 0.0100 2.5X 0.0080 Running LAMMPS version 2016 0.0069 The blue node contains Dual Intel Xeon 0.0060 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 1/seconds 0.0044 1.6X 0.0040 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 0.0020

0.0000 1 Broadwell node 1 node + 1 node + 4x P100 SXM2 8x 100 SXM2 per node per node

123 EAM on K80s

0.07 0.07 EAM 0.06 7.0X Running LAMMPS version 2016 0.05 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.04 0.04 (Broadwell) CPUs

The green nodes contain Dual Intel 0.03 1/seconds Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla K80 0.02 4.0X (autoboost) GPUs 0.02 ➢ 1x K80 is paired with Single Intel 0.01 2.0X Xeon E5-2699 [email protected] [3.6GHz 0.01 Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

124 EAM on P100s PCIe

0.14 EAM 0.13 0.12 13.0X Running LAMMPS version 2016 0.10 The blue node contains Dual Intel Xeon 0.08 E5-2699 [email protected] [3.6GHz Turbo] 0.08 (Broadwell) CPUs

The green nodes contain Dual Intel 0.06

1/seconds Xeon E5-2699 [email protected] [3.6GHz 0.05 8.0X Turbo] (Broadwell) CPUs + Tesla P100 0.04 PCIe (16GB) GPUs 0.03 5.0X ➢ 1x P100 PCIe is paired with Single 0.02 Intel Xeon E5-2699 [email protected] 0.01 3.0X [3.6GHz Turbo] (Broadwell) 0.00 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

125 EAM on P100s SXM2

0.14 EAM 0.13 0.12 13.0X Running LAMMPS version 2016 0.10 The blue node contains Dual Intel Xeon 0.08 E5-2699 [email protected] [3.6GHz Turbo] 0.08 (Broadwell) CPUs

8.0X The green nodes contain Dual Intel 0.06 Xeon E5-2698 [email protected] [3.6GHz 0.05

1/seconds Turbo] (Broadwell) CPUs + Tesla P100 0.04 SXM2 GPUs 0.03 5.0X ➢ 1x P100 SXM2 is paired with Single 0.02 Intel Xeon E5-2698 [email protected] 0.01 3.0X [3.6GHz Turbo] (Broadwell) 0.00 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

126 Gay-Berne on K80s

0.05 Gay-Berne 0.04 0.04 Running LAMMPS version 2016

0.03 The blue node contains Dual Intel Xeon 4.0X E5-2699 [email protected] [3.6GHz Turbo] 0.03 (Broadwell) CPUs

The green nodes contain Dual Intel 3.0X Xeon E5-2699 [email protected] [3.6GHz 1/seconds 0.02 Turbo] (Broadwell) CPUs + Tesla K80 0.02 0.01 (autoboost) GPUs

2.0X ➢ 1x K80 is paired with Single Intel 0.01 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

127 Gay-Berne on P100s PCIe

0.05 Gay-Berne 0.05

0.04 5.0X 0.04 Running LAMMPS version 2016 4.0X The blue node contains Dual Intel Xeon 0.03 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

The green nodes contain Dual Intel 0.02 0.02 Xeon E5-2699 [email protected] [3.6GHz 1/seconds Turbo] (Broadwell) CPUs + Tesla P100 0.01 2.0X PCIe (16GB) GPUs 0.01 ➢ 1x P100 PCIe is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node per node

128 Gay-Berne on P100s SXM2

0.05 Gay-Berne 0.05 0.04 5.0X 0.04 Running LAMMPS version 2016 4.0X The blue node contains Dual Intel Xeon 0.03 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs

The green nodes contain Dual Intel 0.02

1/seconds 0.02 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 0.01 2.0X SXM2 GPUs 0.01 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x SXM2 2x SXM2 4x SXM2 per node per node per node

129 Rhodopsin on K80s

0.40 Rhodopsin 0.38 0.35 0.31 1.7X Running LAMMPS version 2016 0.30 1.4X The blue node contains Dual Intel Xeon 0.25 E5-2699 [email protected] [3.6GHz Turbo] 0.22 0.22 (Broadwell) CPUs

0.20 The green nodes contain Dual Intel

1/seconds Xeon E5-2699 [email protected] [3.6GHz 0.15 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.10 ➢ 1x K80 is paired with Single Intel Xeon E5-2699 [email protected] [3.6GHz 0.05 Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1x K80 per node 2x K80 per node 4x K80 per node

130 Rhodopsin on P100s PCIe

0.60

Rhodopsin 0.52 0.50 0.48 2.4X Running LAMMPS version 2016 0.40 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 0.33 2.2X (Broadwell) CPUs 0.30 0.29 1.5X The green nodes contain Dual Intel

0.22 Xeon E5-2699 [email protected] [3.6GHz 1/seconds 0.20 1.3X Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs

➢ 1x P100 PCIe is paired with Single 0.10 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node

131 Rhodopsin on P100s SXM2

0.60

Rhodopsin 0.50 0.50 0.49 Running LAMMPS version 2016 2.2X 2.3X 0.40 0.38 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 0.30 1.7X 0.30 The green nodes contain Dual Intel

1/seconds 0.22 1.4X Xeon E5-2698 [email protected] [3.6GHz 0.20 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

➢ 1x P100 SXM2 is paired with Single 0.10 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell)

0.00 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node

132 Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

GTX X, GPUs Kepler K20, K40, K80, M40

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand Scale to thousands of nodes with same single node configuration 13 133 3 NAMD 2.12 July 2017 APOA1 on K80s

20 APOA1 17.73 16 14.92

Running NAMD version 2.12

12 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 8 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 4 3.45 (autoboost) GPUs

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

135 APOA1 on P100s PCIe

28 APOA1 24 22.58 22.85

20 Running NAMD version 2.12

16 The blue node contains Dual Intel Xeon

E5-2690 [email protected] [3.5GHz Turbo] ns/day 12 (Broadwell) CPUs The green nodes contain Dual Intel 8 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 4 3.45

0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node

136 APOA1 on P100s SXM2

30 APOA1 25 23.87 22.98 23.44

20 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 15 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

10 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 5 3.45

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

137 F1ATPASE on K80s

8 F1ATPASE 6.27 6 Running NAMD version 2.12 4.81 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day 4 (Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 2 Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 1.15

0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

138 F1ATPASE on P100s PCIe

10 F1ATPASE 8 7.34 7.40 6.99 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 PCIe (16GB) GPUs 1.15

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe (16GB) per node (16GB) per node (16GB) per node

139 F1ATPASE on P100s SXM2

10 F1ATPASE 8 7.11 7.11 6.85 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 SXM2 GPUs 1.15

0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node

140 STMV on K80s

3.0 STMV 2.5

2.085 2.0 Running NAMD version 2.12

The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day 1.5 1.274 (Broadwell) CPUs The green nodes contain Dual Intel 1.0 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs 0.5 0.292

0.0 1 Broadwell node 1 node + 1 node + 1x K80 per node 2x K80 per node

141 STMV on P100s PCIe

3.0 STMV 2.5 2.32 2.15

2.0 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 PCIe (16GB) GPUs 0.29

0.0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node

142 STMV on P100s SXM2

3.0 STMV 2.5

2.077 2.0 Running NAMD version 2.12

The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs

1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 SXM2 GPUs 0.292

0.0 1 Broadwell node 1 node + 1x P100 SXM2 per node

143 NAMD 2.11 – Up to 2X Faster New GPU features in NAMD 2.11 Selected Text from the NAMD website

• GPU-accelerated simulations up to twice as fast as NAMD 2.10 • Pressure calculation with fixed atoms on GPU works as on CPU • Improved scaling for GPU-accelerated particle-mesh Ewald calculation

• CPU-side operations overlap better and are parallelized across cores. • Improved scaling for GPU-accelerated simulations

• Nonbonded calculation results are streamed from the GPU for better overlap. • NVIDIA CUDA GPU-acceleration binaries for Mac OS X

145 NAMD 2.11 is up to 2x faster

25 APoA1 (92,224 atoms) 20 2.0X 1.6X

15

10 1.2X Simulated Simulated Time (ns/day) 5

0 1 Node 2 Nodes 4 Nodes

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs 146 NAMD 2.11 APoA1 on 1 and 2 nodes

25 24.31 APoA1 4.7X Running NAMD version 2.11 (92,224 atoms) 19.73 20 The blue nodes contain Dual Intel E5- 2698 [email protected] (Haswell) CPUs 16.99 3.8X The green nodes contain Dual Intel E5- 15 2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 11.67 6.1X

10

4.2X Simulated Simulated Time (ns/day) 5.22 5 2.77

0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80

147 NAMD 2.11 APoA1 on 4 and 8 nodes

30 27.83 27.74 APoA1 Running NAMD version 2.11 (92,224 atoms) 25 23.52 The blue nodes contain Dual Intel E5- 1.7X 1.6X 2698 [email protected] (Haswell) CPUs 20.64 2.3X 20 The green nodes contain Dual Intel E5- 16.85 2698 [email protected] (Haswell) CPUs + Tesla 2.0X K80 (autoboost) GPUs 15

10.27

Simulated Simulated Time (ns/day) 10

5

0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80 148 NAMD 2.11 is up to 1.8x faster

10 F1-ATPase (327,506 atoms) 8 1.8X 1.4X

6

4 1.1X Simulated Simulated Time (ns/day) 2

0 1 Node 2 Nodes 4 Nodes

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs 149 NAMD 2.11 F1-ATPase on 1 and 2 nodes

15 Running NAMD version 2.11

F1-ATPase The blue nodes contain Dual Intel E5- (327,506 atoms) 2698 [email protected] (Haswell) CPUs 10.58 10 The green nodes contain Dual Intel E5- 2698 [email protected] (Haswell) CPUs + Tesla 5.7X K80 (autoboost) GPUs 7.23 6.11 3.9X 5 3.87

Simulated Simulated Time (ns/day) 6.5X

4.1X 1.86 0.94

0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80

150 NAMD 2.11 F1-ATPase on 4 and 8 nodes

20 F1-ATPase Running NAMD version 2.11 15.74 The blue nodes contain Dual Intel E5- (327,506 atoms) 2698 [email protected] (Haswell) CPUs 15 14.22

12.62 2.3X The green nodes contain Dual Intel E5- 11.66 2.1X 2698 [email protected] (Haswell) CPUs + Tesla 3.5X K80 (autoboost) GPUs 10 3.2X

6.88 Simulated Simulated Time (ns/day)

5 3.63

0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80

151 NAMD 2.11 is up to 1.5x faster

4.0 STMV (1,066,628 atoms) 3.5 1.5X 3.0

2.5

2.0 1.5X

1.5

Simulated Simulated Time (ns/day) 1.0 1.1X

0.5

0.0 1 Node 2 Nodes 4 Nodes

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs 152 NAMD 2.11 STMV on 1 and 2 nodes

4 STMV Running NAMD version 2.11 3.27 (1,066,628 atoms) The blue nodes contain Dual Intel E5- 3 2698 [email protected] (Haswell) CPUs 7.1X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs (Haswell) + Tesla K80 (autoboost) GPUs 1.98 2 1.75 4.3X

7.6X Simulated Simulated Time (ns/day) 1.03 1

4.5X 0.46 0.23

0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80

153 NAMD 2.11 STMV on 4 and 8 nodes

8 STMV Running NAMD version 2.11 (1,066,628 atoms) 6.24 The blue nodes contain Dual Intel E5- 6 5.86 2698 [email protected] (Haswell) CPUs 3.6X The green nodes contain Dual Intel E5- 4.54 2698 [email protected] CPUs (Haswell) + Tesla 3.4X K80 (autoboost) GPUs 4 3.61 5.0X

4.0X Simulated Simulated Time (ns/day) 2 1.74

0.90

0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80

154 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?

• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost with marginal price increase • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive 155 Molecular Dynamics (MD) on GPUs Dec. 19, 2016 GPU-Accelerated Apps Green Lettering Indicates Performance Slides Included

Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 Quantum VASP Espresso NWChem WL-LSMS

GPU Perf compared against dual multi-core x86 CPU socket. 157