Molecular Dynamics (MD) on GPUs
October 2017 and RELION too Accelerating Discoveries
Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid — “the perfect target for fighting the infection.”
Without gpu, the supercomputer would need to be 5x larger for similar performance.
2 Overview of Life & Material Accelerated Apps
MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more
green* = application where >90% of the workload is on GPU 3 MD vs. QC on GPUs
“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp) Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties
Forces calculated from simple empirical formulas Forces derived from electron wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods
Single precision dominated Double precision is important Uses cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC, Eigen / Tensor Solvers GeForce (Workstations), Tesla (Servers) Tesla recommended ECC off ECC on
4 GPU-Accelerated Molecular Dynamics Apps Green Lettering Indicates Performance Slides Included
GENESIS LAMMPS ACEMD GPUGrid.net mdcore AMBER GROMACS MELD CHARMM HALMD NAMD DESMOND HOOMD-Blue OpenMM ESPResSO HTMD PolyFTS Folding@Home
GPU Perf compared against dual multi-core x86 CPU socket. 5 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?
• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost and save “Big Money” on CPUs, networks • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • P100 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive 6 RELION 2.0.3 August 2017 RELION: Plasmodium ribosome on P100s PCIe
0.0140
0.0120 0.0120 0.0112 Running RELION version 2.0.3 0.0101 40.0X 0.0100 The blue node contains Dual Intel Xeon 37.3X E5-2690 [email protected] (Broadwell) CPUs 0.0080 0.0070 33.7X The green nodes contain Dual Intel 0.0060 Xeon E5-2690 [email protected] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 1/Minutes 0.0046 23.3X 0.0040 0.0027 15.3X Data Citation: 0.0020 http://en.community.dell.com/techcenter/hi 0.0003 gh-performance- 9.0X computing/b/general_hpc/archive/2017/03/1 0.0000 4/application-performance-on-p100-pcie-gpus 1 Broadwell 1 node + 1 node + 1 node + 2 nodes + 3 nodes + 4 nodes + node 1x P100 2x P100 4x P100 8x P100 12x P100 16x P100 PCIe (16GB)PCIe (16GB)PCIe (16GB) PCIe PCIe PCIe per node per node per node (16GB) (16GB) (16GB)
8 ACEMD : Extremely efficient and robust MD software built on GPUs
610 ns/day on 1 GPU for DHFR (23K atoms)
M. Harvey, G. Giupponi and G. de Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) • Standardised and easy to use: ACEMD reads CHARMM/NAMD and AMBER input files and uses similar syntax to other MD software.
• Fully featured: NVT, NPT, PME, TCL, PLUMED.1
• Robust: ACEMD is a proven computational engine and is used in one of the largest distributed projects Worldwide: GPUGRID.
• Compatible: ACEMD works with CUDA and OpenCL, the new standard framework for parallel and high-performance computing.
• Validated: ACEMD is used in reputable academic and industrial institutions. Results describing its applications have appeared in peer-reviewed journals of high impact such as Nature Chemistry, PNAS, Scientific Reports, PLoS and JACS.2
1. M. J. Harvey, and G. de Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371-2377 (2009) 2. For a list of selected references see http://www.acellera.com/science AMBER 16 on V100s October 2017 PME-Cellulose_NPT on V100s PCIe
50 47.67
40
(Untuned on Volta) 24.6X Running AMBER version 16.8 30 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 20 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 10 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 1.94
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
13 PME-Cellulose_NPT on V100s SXM2
60 54.74 55.52
50 28.2X 28.6X 40 (Untuned on Volta) Running AMBER version 16.8
The blue node contains Dual Intel Xeon 30 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
20 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 SXM2 (16GB) GPUs
1.94 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
14 PME-Cellulose_NVE on V100s PCIe
60 54.08
50 27.6X 40 (Untuned on Volta) Running AMBER version 16.8
The blue node contains Dual Intel Xeon 30 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs
20 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 PCIe (16GB) GPUs
1.96 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
15 PME-Cellulose_NVE on V100s SXM2
70 65.02 63.04
60 33.2X 50 (Untuned on Volta) Running AMBER version 16.8 40 32.2X The blue node contains Dual Intel Xeon
ns/day E5-2698 [email protected] [3.6GHz Turbo] 30 (Broadwell) CPUs
The green nodes contain Dual Intel 20 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 10 Tesla V100 SXM2 (16GB) GPUs
1.96 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
16 PME-FactorIX_NPT on V100s PCIe
250
200 193.16
(Untuned on Volta) 20.7X Running AMBER version 16.8 150 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 9.33
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
17 PME-FactorIX_NPT on V100s SXM2
250 224.23 217.95
200
24.0X (Untuned on Volta) Running AMBER version 16.8 150 23.4X The blue node contains Dual Intel Xeon
ns/day E5-2698 [email protected] [3.6GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 9.33
0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
18 PME-FactorIX_NVE on V100s PCIe
250
217.95
200
(Untuned on Volta) 22.7X Running AMBER version 16.8 150 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 100 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 50 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 9.61
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
19 PME-FactorIX_NVE on V100s SXM2
300
261.19 249.63 250 27.2X 200 (Untuned on Volta) Running AMBER version 16.8
26.0X The blue node contains Dual Intel Xeon 150 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
100 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 50 Tesla V100 SXM2 (16GB) GPUs
9.61 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
20 PME-JAC_NPT on V100s PCIe
500
439.87
400
(Untuned on Volta) 12.8X Running AMBER version 16.8 300 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 200 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 100 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 34.35
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
21 PME-JAC_NPT on V100s SXM2
600
515.36 500 481.75 15.0X 400 (Untuned on Volta) Running AMBER version 16.8
14.0X The blue node contains Dual Intel Xeon 300 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 34.35
0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
22 PME-JAC_NVE on V100s PCIe
600
490.77 500 13.4X 400 (Untuned on Volta) Running AMBER version 16.8
The blue node contains Dual Intel Xeon 300 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs
200 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 PCIe (16GB) GPUs 36.53
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
23 PME-JAC_NVE on V100s SXM2
700
600 583.33 539.78
500 (Untuned on Volta) 16.0X Running AMBER version 16.8 400 The blue node contains Dual Intel Xeon ns/day 14.8X E5-2698 [email protected] [3.6GHz Turbo] 300 (Broadwell) CPUs
The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 36.53
0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
24 PME-JAC_NPT_4fs on V100s PCIe
900 863.80
750 13.1X 600 (Untuned on Volta) Running AMBER version 16.8
The blue node contains Dual Intel Xeon 450 ns/day E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs
300 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 150 Tesla V100 PCIe (16GB) GPUs 65.74
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
25 PME-JAC_NPT_4fs on V100s SXM2
1200
1006.32 1000 946.57 15.3X 800 (Untuned on Volta) Running AMBER version 16.8
14.4X The blue node contains Dual Intel Xeon 600 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
400 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 200 Tesla V100 SXM2 (16GB) GPUs 65.74
0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
26 PME-JAC_NVE_4fs on V100s PCIe
1050 940.32
900
750 (Untuned on Volta) 26.0X Running AMBER version 16.8 600 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 450 (Broadwell) CPUs
The green nodes contain Dual Intel 300 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 150 Tesla V100 PCIe (16GB) GPUs 67.10
0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
27 PME-JAC_NVE_4fs on V100s SXM2
1200 1123.40
1027.44 1000 16.7X 800 (Untuned on Volta) Running AMBER version 16.8
15.3X The blue node contains Dual Intel Xeon 600 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
400 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 200 Tesla V100 SXM2 (16GB) GPUs 67.10
0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
28 PME-STMV_NPT_4fs on V100s PCIe
35 33.21
30 31.3X 25 (Untuned on Volta) Running AMBER version 16.8 20 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 15 (Broadwell) CPUs
The green nodes contain Dual Intel 10 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 5 Tesla V100 PCIe (16GB) GPUs
1.06 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
29 PME-STMV_NPT_4fs on V100s SXM2
40 37.24
35
30 35.1X (Untuned on Volta) 25 Running AMBER version 16.8
The blue node contains Dual Intel Xeon 20 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 15 The green nodes contain Dual Intel 10 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 5 Tesla V100 SXM2 (16GB) GPUs 1.06 0 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)
30 GB-Myoglobin on V100s PCIe
750 699.21
600
(Untuned on Volta) 31.4X Running AMBER version 16.8 450 The blue node contains Dual Intel Xeon
ns/day E5-2690 [email protected] [3.5GHz Turbo] 300 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 150 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs
22.30 0 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
31 GB-Myoglobin on V100s SXM2
800 750.76
700 33.7X 600 (Untuned on Volta) 500 Running AMBER version 16.8
The blue node contains Dual Intel Xeon 400 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 300 The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 100 Tesla V100 SXM2 (16GB) GPUs 22.30 0 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)
32 GB-Nucleosome on V100s PCIe
85 78.39
68
(Untuned on Volta) 252.9X Running AMBER version 16.8 51 49.14 The blue node contains Dual Intel Xeon ns/day 158.5X E5-2690 [email protected] [3.5GHz Turbo] 34 (Broadwell) CPUs The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 17 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs
0.31 0 1 Broadwell node 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe per node (16GB) per node (16GB)
33 GB-Nucleosome on V100s SXM2
100 92.46
75 (Untuned on Volta) 298.3X Running AMBER version 16.8 52.89 The blue node contains Dual Intel Xeon 50 ns/day E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
170.6X The green nodes contain Dual Intel 25 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs
0.31 0 1 Broadwell node 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 per node (16GB) per node (16GB)
34 Rubisco on V100s PCIe
8
7 6.78
6 5.22 678.0X (Untuned on Volta) 5 Running AMBER version 16.8
The blue node contains Dual Intel Xeon 4 ns/day E5-2690 [email protected] [3.5GHz Turbo] 522.0X (Broadwell) CPUs 3 2.79 The green nodes contain Dual Intel 2 Xeon E5-2690 [email protected] [3.5GHz 279.0X Turbo] (Broadwell) CPUs + 1 Tesla V100 PCIe (16GB) GPUs
0.01 0 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)
35 Rubisco on V100s SXM2
8
7.00 7
5.96 6 700.0X (Untuned on Volta) 5 Running AMBER version 16.8
The blue node contains Dual Intel Xeon 4 ns/day E5-2698 [email protected] [3.6GHz Turbo] 3.00 596.0X (Broadwell) CPUs 3 The green nodes contain Dual Intel 2 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 1 300.0X Tesla V100 SXM2 (16GB) GPUs
0.01 0 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)
36 AMBER 16 on P100s February 2017 PME-Cellulose_NPT on P100s PCIe
40 PME-Cellulose_NPT 35
30.00 Running AMBER version 16.3 30 The blue node contains Dual Intel Xeon 25 E5-2699 [email protected] [3.6GHz Turbo] 21.85 (Broadwell) CPUs
ns/day 12.8X 20 The green nodes contain Dual Intel 9.3X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
38 PME-Cellulose_NPT on P100s SXM2
40 PME-Cellulose_NPT 36.65 35 32.22 Running AMBER version 16.3 30 15.6X 13.7X The blue node contains Dual Intel Xeon 25 23.37 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 20
ns/day The green nodes contain Dual Intel 9.9X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 10 ➢ 1x P100 SXM2 is paired with Single 5 Intel Xeon E5-2698 [email protected] 2.35 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
39 PME-Cellulose_NVE on P100s PCIe
40 PME-Cellulose_NVE 35 32.55 Running AMBER version 16.3 30 13.2X The blue node contains Dual Intel Xeon 25 23.34 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
ns/day 20 The green nodes contain Dual Intel 9.4X Xeon E5-2699 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 10 ➢ 1x P100 PCIe is paired with Single 5 Intel Xeon E5-2699 [email protected] 2.47 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
40 PME-Cellulose_NVE on P100s SXM2
45 PME-Cellulose_NVE 40.88 40
35.16 35 Running AMBER version 16.3
30 16.6X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 24.94 25 14.2X (Broadwell) CPUs
ns/day 20 The green nodes contain Dual Intel 10.1X Xeon E5-2698 [email protected] [3.6GHz 15 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs
10 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 5 2.47 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
41 PME-FactorIX_NPT on P100s PCIe
140 PME-FactorIX_NPT 132.86 120 Running AMBER version 16.3 98.77 11.6X 100 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 80 (Broadwell) CPUs
ns/day 8.6X The green nodes contain Dual Intel 60 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 40 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.43 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
42 PME-FactorIX_NPT on P100s SXM2
180 PME-FactorIX_NPT 159.80 160 144.11 140 14.0X Running AMBER version 16.3
120 12.6X The blue node contains Dual Intel Xeon 106.25 E5-2699 [email protected] [3.6GHz Turbo] 100 (Broadwell) CPUs 9.3X ns/day 80 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz
60 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs
40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.43 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
43 PME-FactorIX_NVE on P100s PCIe
160 PME-FactorIX_NVE 145.83 140
12.2X Running AMBER version 16.3 120 105.86 The blue node contains Dual Intel Xeon 100 E5-2699 [email protected] [3.6GHz Turbo] 8.8X (Broadwell) CPUs
ns/day 80 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 60 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 40 ➢ 1x P100 PCIe is paired with Single 20 Intel Xeon E5-2699 [email protected] 11.98 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
44 PME-FactorIX_NVE on P100s SXM2
200
178.02 180 PME-FactorIX_NVE 159.24 160 14.9X Running AMBER version 16.3 140 The blue node contains Dual Intel Xeon 120 114.88 13.3X E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 100
ns/day The green nodes contain Dual Intel 80 9.6X Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 60 SXM2 GPUs
40 ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 20 11.98 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
45 PME-JAC_NPT on P100s PCIe
350 PME-JAC_NPT 327.69 300 283.60 7.1X Running AMBER version 16.3 250 The blue node contains Dual Intel Xeon 6.2X E5-2699 [email protected] [3.6GHz Turbo] 200 (Broadwell) CPUs
ns/day The green nodes contain Dual Intel 150 Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 100 PCIe (16GB) GPUs ➢ 1x P100 PCIe is paired with Single 45.89 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
46 PME-JAC_NPT on P100s SXM2
450 PME-JAC_NPT 423.09 400 360.64 350 9.2X Running AMBER version 16.3 310.52 300 7.9X The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 250 6.8X (Broadwell) CPUs
ns/day 200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz
150 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs
100 ➢ 1x P100 SXM2 is paired with Single 45.89 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node
47 PME-JAC_NVE on P100s PCIe
400 PME-JAC_NVE 363.79 350 308.46 7.6X Running AMBER version 16.3 300 The blue node contains Dual Intel Xeon 250 E5-2699 [email protected] [3.6GHz Turbo] 6.4X (Broadwell) CPUs 200
ns/day The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz 150 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 100 ➢ 1x P100 PCIe is paired with Single 47.90 50 Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 2x P100 PCIe (16GB) per node per node
48 PME-JAC_NVE on P100s SXM2
500 473.10
450 PME-JAC_NVE 402.18 9.9X 400 Running AMBER version 16.3
350 339.81 8.4X The blue node contains Dual Intel Xeon 300 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 250 7.1X
ns/day The green nodes contain Dual Intel 200 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 150 SXM2 GPUs
100 ➢ 1x P100 SXM2 is paired with Single 47.90 Intel Xeon E5-2698 [email protected] 50 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe per node per node per node
49 GB-Myoglobin on P100s PCIe
600 GB-Myoglobin 561.94 500 483.37 19.5X Running AMBER version 16.3 400 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 16.7X (Broadwell) CPUs
ns/day 300 The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz
200 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs
➢ 1x P100 PCIe is paired with Single 100 Intel Xeon E5-2699 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe (16GB) 4x P100 PCIe (16GB) per node per node
50 GB-Myoglobin on P100s SXM2
700 GB-Myoglobin 639.37 600 534.28 22.2X Running AMBER version 16.3 500 The blue node contains Dual Intel Xeon 18.5X E5-2699 [email protected] [3.6GHz Turbo] 400 (Broadwell) CPUs
ns/day The green nodes contain Dual Intel 300 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + Tesla P100 200 SXM2 GPUs ➢ 1x P100 SXM2 is paired with Single 100 Intel Xeon E5-2698 [email protected] 28.86 [3.6GHz Turbo] (Broadwell)
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 4x P100 PCIe per node per node
51 GB-Nucleosome on P100s PCIe
50 45.92 45 GB-Nucleosome 39.91 114.8X 40 Running AMBER version 16.3
35 99.8X The blue node contains Dual Intel Xeon 30 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
ns/day 25 22.77 The green nodes contain Dual Intel 20 Xeon E5-2699 [email protected] [3.6GHz 56.9X Turbo] (Broadwell) CPUs + Tesla P100 15 PCIe (16GB) GPUs 11.91 10 ➢ 1x P100 PCIe is paired with Single 29.8X Intel Xeon E5-2699 [email protected] 5 [3.6GHz Turbo] (Broadwell) 0.40 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node
52 GB-Nucleosome on P100s SXM2
60 GB-Nucleosome 50 48.29 46.29 Running AMBER version 16.3 120.7X 40 The blue node contains Dual Intel Xeon E5-2699 [email protected] [3.6GHz Turbo] 115.7X (Broadwell) CPUs 30
ns/day 25.53 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz
20 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 13.36 63.8X ➢ 1x P100 SXM2 is paired with Single 10 Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) 0.40 33.4X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node
53 Rubisco-75K on P100s PCIe
4.5 Rubisco-75K 4.20 4.0 420.0X 3.5 Running AMBER version 16.3
3.0 The blue node contains Dual Intel Xeon 2.69 E5-2699 [email protected] [3.6GHz Turbo] 2.5 (Broadwell) CPUs
ns/day 2.0 269.0X The green nodes contain Dual Intel Xeon E5-2699 [email protected] [3.6GHz
1.5 1.40 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs
1.0 0.71 ➢ 1x P100 PCIe is paired with Single 140.0X Intel Xeon E5-2699 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 71.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe 8x P100 PCIe (16GB) per node (16GB) per node (16GB) per node (16GB) per node
54 Rubisco-75K on P100s SXM2
5.0
4.46 4.5 Rubisco-75K
4.0 Running AMBER version 16.3
3.5 446.0X The blue node contains Dual Intel Xeon 3.06 3.0 E5-2699 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 2.5 306.0X
ns/day The green nodes contain Dual Intel 2.0 Xeon E5-2698 [email protected] [3.6GHz 1.57 Turbo] (Broadwell) CPUs + Tesla P100 1.5 SXM2 GPUs
1.0 0.80 157.0X ➢ 1x P100 SXM2 is paired with Single Intel Xeon E5-2698 [email protected] 0.5 [3.6GHz Turbo] (Broadwell) 0.01 80.0X 0.0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 8x P100 SXM2 per node per node per node per node
55 Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2
Cores per CPU socket 6+ (1 CPU core drives 1 GPU)
CPU speed (Ghz) 2.66+ System memory per node (GB) 16
GPUs P100, V100
1-4 # of GPUs per CPU socket
GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB
Network configuration Infiniband QDR or better 56 Scale to multiple nodes with same single node configuration 56 CHARMM DOMDEC-GUI July 2016 CHARMM DOMDEC-GUI 465 K System Benchmark
4 465 K System
(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.15 The green nodes contain Dual Intel 2 ns/day Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs 6.0X
1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36
0 1 Haswell node 1 node + 1x K80 per node
58 CHARMM DOMDEC-GUI 534 K System Benchmark
2.0 534 K System
(POPC_PSPC_CHL1mixture) Running CHARMM version c40a1 1.5 1.43 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs
The green nodes contain Dual Intel
1.0 Xeon E5-2698 [email protected] GHz (Haswell) ns/day 8.0X CPUs + Tesla K80 (autoboost) GPUs
0.5 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.18
0.0 1 Haswell node 1 node + 1x K80 per node
59 CHARMM DOMDEC-GUI 20 K System Benchmark
80 20 K System (Crambin)
59.68 Running CHARMM version c40a1 60 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs
The green nodes contain Dual Intel ns/day 40 Xeon E5-2698 [email protected] GHz (Haswell) CPUs + Tesla M40 GPUs 3.7X
20 16.00 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error.
0 1 Haswell node 1 node + 1x M40 per node
60 CHARMM DOMDEC-GUI 61 K System Benchmark
35 61 K System (GlnBP) 30 *Higher is better Running CHARMM version c40a1 25.08 25 The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 20 The green nodes contain Dual Intel
6.4X Xeon E5-2698 [email protected] GHz (Haswell) ns/day 15 CPUs + Tesla M40 GPUs
10 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 5 3.90
0 1 Haswell node 1 node + 1x M40 per node
61 CHARMM DOMDEC-GUI 465 K System Benchmark
4 465 K System
(Her1_HER1_membrane) Running CHARMM version c40a1 3 *Higher is better The blue node contains Dual Intel Xeon E5-2698 [email protected] GHz (Haswell) CPUs 2.27 The green nodes contain Dual Intel 2 Xeon E5-2698 [email protected] GHz (Haswell) ns/day CPUs + Tesla M40 GPUs 6.3X
1 Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who is responsible for possible benchmarking error. 0.36
0 1 Haswell node 1 node + 1x M40 per node
62 GROMACS 2018 February 2018 GROMACS Water 1.5M on V100 vs P100 (PCIe 16GB)
14
11.86 12
10.37 10.41 4.9X 9.83 10 (Untuned on Volta) 4.3X 4.3X Running GROMACS version 2018 8.41 4.0X 8 3.5X The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day 6.06 6 (Broadwell) CPUs 2.5X The green nodes contain Dual Intel 4 Xeon E5-2690 [email protected] [3.5GHz 2.43 Turbo] (Broadwell) CPUs + Tesla P100 2 PCIe (16GB) or V100 PCIe (16GB) GPUs
0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node
64 GROMACS Water 1.5M on V100 vs P100 (SXM2 16GB)
14 12.67 12 5.2X 11.02 10.92 10.19 10 4.5X 4.5X 9.16 (Untuned on Volta) 4.2X Running GROMACS version 2018
8 3.8X 6.88 The blue node contains Dual Intel Xeon
E5-2690 [email protected] [3.5GHz Turbo] ns/day 6 2.8X (Broadwell) CPUs The green nodes contain Dual Intel 4 Xeon E5-2698 [email protected] [3.6GHz 2.43 Turbo] (Broadwell) CPUs + Tesla P100 2 SXM2 (16GB) or V100 SXM2 (16GB) GPUs
0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per node node node node node node
65 GROMACS ADH Dodec on V100 vs P100 (PCIe 16GB)
180
160 154.62
137.66 4.9X 140 130.99 133.66 4.3X 4.2X (Untuned on Volta) 120 4.1X Running GROMACS version 2018 101.49 102.57 100 The blue node contains Dual Intel
3.2X 3.2X Xeon E5-2690 [email protected] [3.5GHz ns/day 80 Turbo] (Broadwell) CPUs
60 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 40 31.79 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PCIe 20 (16GB) GPUs
0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 2x P100 4x P100 8x P100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node
66 GROMACS ADH Dodec on V100 vs P100 (SXM2 16GB)
180 167.33 156.86 160 156.23 147.93 5.3X 4.9X 4.9X 140 4.7X 131.92 123.17 (Untuned on Volta) 120 4.1X 3.9X Running GROMACS version 2018 100 The blue node contains Dual Intel
Xeon E5-2690 [email protected] [3.5GHz ns/day 80 Turbo] (Broadwell) CPUs
60 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 40 31.79 Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (16GB) or V100 SXM2 20 (16GB) GPUs
0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 2x P100 4x P100 8x P100 2x V100 4x V100 8x V100 SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per SXM2 per node node node node node node
67 GROMACS 2016.4 October 2017 Water 1.5M on V100 vs P100 (PCIe)
8 7.30 6.97 7
6 3.2X 5.34 (Untuned on Volta) 3.1X Running GROMACS version 2016.4 5 4.34 The blue node contains Dual Intel Xeon 4 2.3X E5-2690 [email protected] [3.5GHz Turbo] ns/day 1.9X (Broadwell) CPUs 3 The green nodes contain Dual Intel 2.28 Xeon E5-2690 [email protected] [3.5GHz 2 Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) or V100 PCIe 1 (16GB) GPUs
0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 1x V100 PCIe 2x V100 PCIe per node (16GB) per node (16GB) per node (16GB) per node (16GB)
69 Water 3M on V100 vs P100 (PCIe)
5
4 3.85 3.53 (Untuned on Volta) 3.4X Running GROMACS version 2016.4 3 The blue node contains Dual Intel Xeon 2.53 3.2X E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 1.96 2 The green nodes contain Dual Intel 2.3X Xeon E5-2690 [email protected] [3.5GHz 1.12 Turbo] (Broadwell) CPUs + Tesla P100 1 PCIe (16GB) or V100 PCIe 1.8X (16GB) GPUs 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 1x V100 PCIe 2x V100 PCIe per node (16GB) per node (16GB) per node (16GB) per node (16GB)
70 Recommended GPU Node Configuration for GROMACS Computational Chemistry
Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+
System memory per socket (GB) 32
GPUs Tesla P100, V100
2x # of GPUs per CPU socket Volta GPUs: need fast Skylake or Broadwell
GPU memory preference (GB) 6
GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher
71 Network configuration Gemini, InfiniBand 71 HOOMD-Blue 2.2.2 March 2018 HOOMD-Blue lj-liquid on V100 vs P100 (PCIe 16GB)
4500 4261.93 Running HOOMD-Blue version 2.2.2
4000 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 3500 19.8X (Broadwell) CPUs
3000 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 2379.87 (Broadwell) CPUs + Tesla P100 PCIe 2500 (16GB) or V100 PCIe (16GB) GPUs 2000 Average TPS 11.1X 64,000 particles 1500 Force field/method: Lennard-Jones MD This is a synthetic benchmark for historical reasons and for making direct comparisons 1000 with LAMMPS
500 214.95
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 1x V100 PCIe per node per node
73 HOOMD-Blue microsphere on V100 vs P100 (PCIe 16GB)
500 Running HOOMD-Blue version 2.2.2 455.52 450 The blue node contains Dual Intel Xeon 55.6X E5-2690 [email protected] [3.5GHz Turbo] 400 373.55 (Broadwell) CPUs 348.39 350 45.6X The green nodes contain Dual Intel Xeon 293.90 300 42.5X E5-2690 [email protected] [3.5GHz Turbo] 257.89 (Broadwell) CPUs + Tesla P100 PCIe 250 35.8X (16GB) or V100 PCIe 31.5X 220.63 (16GB) GPUs 200 Average TPS 173.24 26.9X 1,428,364 particles Force field/method: Bead spring polymbers 150 21.1X 112.48 via dissipative particle dynamics Results published in: 100 http://dx.doi.org/10.1002/adma.201501329 13.7X 50 8.20 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node
74 HOOMD-Blue quasicrystal on V100 vs P100 (PCIe 16GB)
3000 Running HOOMD-Blue version 2.2.2
The blue node contains Dual Intel Xeon 2473.63 2500 E5-2690 [email protected] [3.5GHz Turbo] 2285.10 50.5X (Broadwell) CPUs 46.6X The green nodes contain Dual Intel Xeon 2000 1871.90 1768.09 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe 38.2X 36.1X (16GB) or V100 PCIe 1500 (16GB) GPUs 1165.62
100,000 particles Average TPS 1000 861.78 23.8X Force field/method: Oscilating pair potential MD Results published in: 17.6X http://dx.doi.org/10.1038/NMAT4152 500
49.03 0 1 Broadwell 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + node 1x P100 2x P100 4x P100 1x V100 2x V100 4x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node
75 HOOMD-Blue triblock-copolymer on V100 vs P100 (PCIe 16GB)
3500 Running HOOMD-Blue version 2.2.2 3158.74
3000 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 15.9X (Broadwell) CPUs 2500 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 2000 (Broadwell) CPUs + Tesla P100 PCIe 1711.50 (16GB) or V100 PCIe (16GB) GPUs 1500 Average TPS 64,017 particles Force field/method: Bead-spring polymer MD 8.6X Results published in: 1000 http://dx.doi.org/10.1021/ma061120f
500 198.50
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 1x V100 PCIe per node per node
76 HOOMD-Blue dodecahedron on V100 vs P100 (PCIe 16GB)
300 Running HOOMD-Blue version 2.2.2
The blue node contains Dual Intel Xeon 250 234.20 239.26 E5-2690 [email protected] [3.5GHz Turbo] 10.6X (Broadwell) CPUs 200 186.73 10.4X The green nodes contain Dual Intel 180.04 Xeon E5-2690 [email protected] [3.5GHz 165.18 8.3X Turbo] (Broadwell) CPUs + Tesla P100 8.0X PCIe (16GB) or V100 PCIe 150 136.43 7.3X (16GB) GPUs
Average TPS 101.06 6.1X 131,072 particles 100 Force field/method: Hard dodecahedra 80.99 (via MC simulation) 4.5X This is a synethic benchmark to test scalability 3.6X and is much larger than anyone would 50 reasonably run 22.54
0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node
77 HOOMD-Blue hexagon on V100 vs P100 (PCIe 16GB)
120 Running HOOMD-Blue version 2.2.2 104.50 The blue node contains Dual Intel Xeon 100 E5-2690 [email protected] [3.5GHz Turbo] 89.48 19.0X (Broadwell) CPUs
80 The green nodes contain Dual Intel Xeon 16.3X E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe 59.15 60 (16GB) or V100 PCIe (16GB) GPUs 48.56 10.8X Average TPS 1,048,576 particles 40 8.8X 32.29 Force field/method: Hard hexagons 26.90 (via MC simulation) 5.9X Results published in: 16.99 http://dx.doi.org/10.1103/PhysRevX.7.021001 20 14.08 4.9X 5.50 2.6X 3.1X 0 1 Broadwell node 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1 node + 1x P100 2x P100 4x P100 8x P100 1x V100 2x V100 4x V100 8x V100 PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per PCIe per node node node node node node node node
78 LAMMPS 2017 October 2017 Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s PCIe
1.00 2,048,000 atoms
0.80 0.73 (Untuned on Volta) Running LAMMPS version 2017 0.60 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] 3.0X (Broadwell) CPUs
1/seconds 0.40 The green nodes contain Dual Intel 0.25 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + 0.20 Tesla V100 PCIe (16GB) GPUs
0.00 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
80 Atomic-Fluid Lennard-Jones 2.5 Cutoff on V100s SXM2
1.00
2,048,000 atoms 0.82 0.80
(Untuned on Volta) 3.3X Running LAMMPS version 2017 0.60 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs
1/seconds 0.40 The green nodes contain Dual Intel 0.25 Xeon E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs + 0.20 Tesla V100 SXM2 (16GB) GPUs
0.00 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)
81 Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s PCIe
0.80 2,048,000 atoms
0.60 0.60 (Untuned on Volta) 0.47 Running LAMMPS version 2017 0.45 10.0X The blue node contains Dual Intel Xeon 0.40 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs
1/seconds 7.8X The green nodes contain Dual Intel 7.5X Xeon E5-2690 [email protected] [3.5GHz 0.20 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs 0.06
0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)
82 Atomic-Fluid Lennard-Jones 5.0 Cutoff on V100s SXM2
0.80 2,048,000 atoms
0.60 0.55 0.56 (Untuned on Volta) 0.48 9.3X Running LAMMPS version 2017 9.2X The blue node contains Dual Intel Xeon 0.40 E5-2698 [email protected] [3.6GHz Turbo]
(Broadwell) CPUs 1/seconds 8.0X The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.20 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.06
0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)
83 Course-grain Water on V100s PCIe
0.020
2,048,000 atoms 0.016
0.015 (Untuned on Volta) Running LAMMPS version 2017 0.011 5.3X The blue node contains Dual Intel Xeon 0.010 E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs
1/seconds 0.007 3.7X The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.005 Turbo] (Broadwell) CPUs + 0.003 2.3X Tesla V100 PCIe (16GB) GPUs
0.000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)
84 Course-grain Water on V100s SXM2
0.025 2,048,000 atoms
0.020 0.020
(Untuned on Volta) 6.7X Running LAMMPS version 2017 0.015 0.014 The blue node contains Dual Intel Xeon E5-2698 [email protected] [3.6GHz Turbo] 4.7X (Broadwell) CPUs
1/seconds 0.010 0.009 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 3.0X Turbo] (Broadwell) CPUs + 0.005 0.003 Tesla V100 SXM2 (16GB) GPUs
0.000 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)
85 Gay-Berne on V100s PCIe
0.06 2,097,152 atoms 0.05 0.05
(Untuned on Volta) 0.04 Running LAMMPS version 2017
7.5X The blue node contains Dual Intel Xeon 0.03 E5-2690 [email protected] [3.5GHz Turbo]
(Broadwell) CPUs 1/seconds 0.02 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.01 Turbo] (Broadwell) CPUs + 0.01 Tesla V100 PCIe (16GB) GPUs
0.00 1 Broadwell node 1 node + 2x V100 PCIe per node (16GB)
86 Gay-Berne on V100s SXM2
0.06 2,097,152 atoms 0.05 0.05
5.0X (Untuned on Volta) 0.04 Running LAMMPS version 2017
The blue node contains Dual Intel Xeon 0.03 E5-2698 [email protected] [3.6GHz Turbo]
(Broadwell) CPUs 1/seconds 0.02 The green nodes contain Dual Intel Xeon E5-2698 [email protected] [3.6GHz 0.01 Turbo] (Broadwell) CPUs + 0.01 Tesla V100 SXM2 (16GB) GPUs
0.00 1 Broadwell node 1 node + 2x V100 SXM2 per node (16GB)
87 Rhodopsin on V100s PCIe
0.80 256,000 atoms
0.60 0.58 0.53 (Untuned on Volta) 3.4X Running LAMMPS version 2017 0.44 3.1X The blue node contains Dual Intel Xeon 0.40 E5-2690 [email protected] [3.5GHz Turbo]
(Broadwell) CPUs 1/seconds 2.6X The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz 0.20 0.17 Turbo] (Broadwell) CPUs + Tesla V100 PCIe (16GB) GPUs
0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 PCIe 4x V100 PCIe 8x V100 PCIe per node (16GB) per node (16GB) per node (16GB)
88 Rhodopsin on V100s SXM2
0.80 256,000 atoms 0.70 0.68
0.60 0.60 4.0X (Untuned on Volta) 0.50 Running LAMMPS version 2017 0.42 3.5X The blue node contains Dual Intel Xeon 0.40 E5-2698 [email protected] [3.6GHz Turbo] (Broadwell) CPUs 1/seconds 0.30 The green nodes contain Dual Intel 2.5X Xeon E5-2698 [email protected] [3.6GHz 0.20 0.17 Turbo] (Broadwell) CPUs + Tesla V100 SXM2 (16GB) GPUs 0.10
0.00 1 Broadwell node 1 node + 1 node + 1 node + 2x V100 SXM2 4x V100 SXM2 8x V100 SXM2 per node (16GB) per node (16GB) per node (16GB)
89 Recommended GPU Node Configuration for LAMMPS Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32
GTX Titan X, GPUs Tesla P100, V100
# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6+ GPU to CPU connection PCIe 3.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand Scale to thousands of nodes with same single node configuration 90 90 NAMD 2.12 July 2017 APOA1 on P100s PCIe
28 APOA1 24 22.58 22.85
20 Running NAMD version 2.12
16 The blue node contains Dual Intel Xeon
E5-2690 [email protected] [3.5GHz Turbo] ns/day 12 (Broadwell) CPUs The green nodes contain Dual Intel 8 Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 PCIe (16GB) GPUs 4 3.45
0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node
92 APOA1 on P100s SXM2
30 APOA1 25 23.87 22.98 23.44
20 Running NAMD version 2.12
The blue node contains Dual Intel Xeon 15 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs
10 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs 5 3.45
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
93 F1ATPASE on P100s PCIe
10 F1ATPASE 8 7.34 7.40 6.99 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 PCIe (16GB) GPUs 1.15
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe 4x P100 PCIe (16GB) per node (16GB) per node (16GB) per node
94 F1ATPASE on P100s SXM2
10 F1ATPASE 8 7.11 7.11 6.85 Running NAMD version 2.12 6 The blue node contains Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs 4 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 2 SXM2 GPUs 1.15
0 1 Broadwell node 1 node + 1 node + 1 node + 1x P100 SXM2 2x P100 SXM2 4x P100 SXM2 per node per node per node
95 STMV on P100s PCIe
3.0 STMV 2.5 2.32 2.15
2.0 Running NAMD version 2.12
The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs
1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 PCIe (16GB) GPUs 0.29
0.0 1 Broadwell node 1 node + 1 node + 1x P100 PCIe 2x P100 PCIe (16GB) per node (16GB) per node
96 STMV on P100s SXM2
3.0 STMV 2.5
2.077 2.0 Running NAMD version 2.12
The blue node contains Dual Intel Xeon 1.5 E5-2690 [email protected] [3.5GHz Turbo] ns/day (Broadwell) CPUs
1.0 The green nodes contain Dual Intel Xeon E5-2690 [email protected] [3.5GHz Turbo] (Broadwell) CPUs + Tesla P100 0.5 SXM2 GPUs 0.292
0.0 1 Broadwell node 1 node + 1x P100 SXM2 per node
97 Benefits of MD GPU-Accelerated Computing Why wouldn’t you want to turbocharge your research?
• 3x-8x Faster than CPU only systems in all tests (on average) • Most major compute intensive aspects of classical MD ported • Large performance boost and save “Big Money” on CPUs, networks • Energy usage cut by more than half • GPUs scale well within a node and/or over multiple nodes • K80 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive 98 Molecular Dynamics (MD) on GPUs Dec. 19, 2016 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included
Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot RMG BigDFT LSDalton Q-Chem TeraChem CP2K MOLCAS QMCPACK UNM GAMESS-US Mopac2012 Quantum VASP Espresso NWChem WL-LSMS
GPU Perf compared against dual multi-core x86 CPU socket. 100