Accélération de calculs de simulations des matériaux sur GPU
Francois Courteille Senior Solution Architect, Accelerated Computing
NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential1 Tesla, the NVIDIA compute platform Quantum Chemistry Applications Overview AGENDA Molecular Dynamics Applications Overview Material Science Applications at the start of the Pascal Era
2 GAMING AUTO ENTERPRISE HPC & CLOUD TECHNOLOGY
THE WORLD LEADER IN VISUAL COMPUTING
3 OUR TEN YEARS IN HPC GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
Google Outperform Stanford Builds AI Humans in ImageNet Machine using GPUs Fermi: World’s AlexNet beats expert code First HPC GPU by huge margin using GPUs
World’s First GPU Top500 System Discovered How H1N1 World’s First 3-D Mapping Mutates to Resist Drugs of Human Genome CUDA Launched
2006 2008 2010 2012 2014 20164 CREDENTIALS BUILT OVER TIME
Academia Games Finance TORCH CAFFE # of Applications Manufacturing Internet Oil & Gas 450 410 National Labs Automotive Defense 400 370 M & E THEANO MATCONVNET 350 300 287 MOCHA.JL PURINE 242 250 206 MINERVA MXNET* 200 300K 150 113 BIG SUR TENSORFLOW 100
50 WATSON CNTK 0 2011 2012 2013 2014 2015 2016
Majority of HPC Applications are 300K CUDA Developers, 100% of Deep Learning GPU-Accelerated, 410 and Growing 4x Growth in 4 years Frameworks are Accelerated
www.nvidia.com/appscatalog 5 END-TO-END TESLA PRODUCT FAMILY
FULLY INTEGRATED DL HYPERSCALE HPC MIXED-APPS HPC STRONG-SCALING HPC SUPERCOMPUTER
Tesla M4, M40 Tesla K80 Tesla P100 DGX-1
Hyperscale deployment for DL HPC data centers running mix Hyperscale & HPC data For customers who need to training, inference, video & of CPU and GPU workloads centers running apps that get going now with fully image processing scale to multiple GPUs integrated solution
6 TESLA FOR SIMULATION
LIBRARIES DIRECTIVES LANGUAGES
ACCELERATED COMPUTING TOOLKIT
TESLA ACCELERATED COMPUTING
8 OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS
MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more 10 green* = application where >90% of the workload is on GPU MD VS. QC ON GPUS
“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)
Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties
Forces calculated from simple empirical formulas Forces derived from electron wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods
Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on 11 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included
Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot TeraChem BigDFT LSDalton Q-Chem VASP CP2K MOLCAS QMCPACK WL-LSMS GAMESS-US Mopac2012 Quantum Espresso NWChem
GPU Perf compared against dual multi-core x86 CPU socket. 12 ABINIT
6th International ABINIT Developer Workshop
FROM RESEARCH TO INDUSTRY April 15-18, 2013 - Dinard, France
USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)
Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson
With a contribution from Yann Pouillon for the build system
www.cea.fr ABINIT ON GPU OUTLINE
Introduction Introduction to GPU computing Architecture, programming model Density-Functional Theory with plane waves How to port it on GPU ABINIT on GPU Implementation Performances of ABINIT v7.0 How To…
Compile Use
ABINIT on GPU| April 15, 2013 | PAGE 13 LOCAL OPERATOR: FAST FOURIER TRANSFORMS
! Our choice (for a first version): g r g r g v dr e v r c e eff eff g ' Replace the MPI plane wave level g ' ~ = i ~ () FFT by a FFT computation on GPU ( i ) ! Included in Cuda package : cuFFT library Interfaced with c and Fortran ! FTT A system dependent efficiency is expected ! Small FFT underuse the GPU and are dominated by ! the time required to transfer the data to/from the GPU
Within PAW, small plane wave basis are used
ABINIT on GPU| April 15, 2013 | PAGE 14 LOCAL OPERATOR: FAST FOURIER TRANSFORMS
! Implementation ! Interface Cuda / Fortran ! New routines that encapsulate cufft calls ! ! Development of 3 small cuda kernels Optimization of memory copy to overlap fourwf – option 2 computation and data transfer Performance highly Speedup dependent on FFT size
TEST CASE 107 gold atoms cell FFT size TGCC-Curie (Intel Westmere + Nvidia M2090)
ABINIT on GPU| April 15, 2013 | PAGE 15 LOCAL OPERATOR: FAST FOURIER TRANSFORMS
Improvement : multiple band optimization
! The Cuda kernels are more intensively used Requires less transfers
! See bandpp variable When Locally Optimal Block Preconditioned Conjugate Gradient algorithm is used
107 gold atoms 31 copper atoms bandpp fourwf on GPU bandpp fourwf on GPU 1 112,94 1 36,02 2 84,55 2 20,79 4 70,55 4 14,14
TEST CASE 8 61,91 10 9,16 TGCC-Curie 18 54,79 20 9,50 (Intel Westmere + Nvidia M2090) 72 50,55 100 7,94 ABINIT on GPU| April 15, 2013 | PAGE 16 NON LOCAL OPERATOR
! Implementation ~ ~ ! Development of 2 Cuda Kernels for p ψ = ∑ p g ' g ' ψ i g ' i ~ ! ψ ~ ψ Development of 2 Cuda Kernels for g VNL = ∑i , j g pi Dij p j
! d
Using collaborative memory zones as much as possible (threads linke • Non-local operator to the same plane wave lie in the same collaborative zone) Used• for Overlap: operator •• ForcesNon-l and stress tensor Performances (NL operator only) • PAW occupation matrix nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU In addition to all MPI 107 gold atoms 3142 1122 2.8 parallelization levels !
31 copper atom 85 42 2.0 • Need a large number of PW
s 3 UO2 atoms 63 36 1.8 • Need a large number of NL projectors
TGCC-Curie (Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17 Algorithm 1 lobpcg Locally Optimal Block Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m Preconditioned Conjugate (0) 0 conditioner; P = {P 1, . . . , P m} is initialized to 0. 1: for i=0,1, , .do. . Gradient 2: ⌥(i) = ⌥( (i) ) bandpp Total tim e Tim e Line ar Alge bra 3: R(i) = H (i) ⌥(i) O (i) 1 674 119 2 612 116 4: W(i) = KR(i) 10 1077 428 5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o 50 5816 5167 (i) (i) (i) (i) (i) (i) 1 ,P. .. , Pm , 1 , . .. , m , W1 , . . . , Wm 6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i) 7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for
Matrix-matrix and matrix-vector products Diagonalization/othogonalization within blocks
! These products are not parallelized ! Size of matrix increases when b andparallelization is used ! Size of vectors/matrix increases with the ___ .z number of band/plane-wave CPU cores. ! Need for- a free LAPACK packag e on GPU ! Our choice : use of cuBlas Our choice : use of MAGMA ABINIT on GPU| April 15, 2013 | PAGE 18 Matrix Algebra on GPU and
Home Multicore Architectures Overview Matrix Algebra on GPU and Multicore Latest MAGMA News Architectures 2013-03- 12 News Downloads The MAGMA project aims 1o oeveop a dense linear agebra library MAGMA MIC 1.0 Beta for Intel Xeon similar 1o LAPACK bu1 for heterogeneouslhybrd architectures, Phi Coprocessors Re~ased PublPeopcaletion s 2012-11- 14 starting with current 'Multcore-tGPU' systems.
Dcx:umentatb MAGMA 1.3 Released Pan rtners The MAGMA research is based on 1he dlea 1hat, 1o address the complex challenges of 1heemerging hybrd environments, optimal software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi ! A project by Innovating Computing Laboratory, eonna bthleis a pdleap'ca, ton wes a im1o ftoully oes eqnxp oit li n eatherage powebraar g1hornhmat eacs h aondf th e h ybr d Coprocessors Released cfraomponmeworksents fooffersr hybr. d manycore and GPU systems that can University of Tenessee 2012-10-24
clMAGMA 1.0 Released
2012-06-29 ! “Free of use” project MAGMA 1.2.1 Released
! First stable release 25-08-2011; ICL{>l3r Current version 1.3
lver27 2013 Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,......
! Computation is cut in order to offload License consuming parts on GPU
Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re .. ! Goal : address hybrid environments ing d. in conditbns are met:
ABINIT on GPU| April 15, 2013 | PAGE 19 ITERATIVE EIGENSOLVER
Speedup of cuBLAS/MAGMA code Performances Speedup on a single MPI process 4,00 3,50 3,00 2,50 215 silicon atoms 2,00 107 gold atoms 1,50 31 copper atoms 1,00 3 UO2 atoms 0,50 0,00 bandpp= TGCC-Curie 2 4 10 factor related to the size of (Intel Westmere + Nvidia M2090) blocks (bands) size ! Highly dependent on the physical system ! Depends only on the size of blocks used in the LOBPCG algorithm
ABINIT on GPU| April 15, 2013 | PAGE 20 OUTLINE
Introduction Introduction to GPU computing Architecture, programming model Density-Functional Theory with plane waves How to port it on GPU ABINIT on GPU Implementation Performances of ABINIT v7.0 How To…
Compile Use
ABINIT on GPU| April 15, 2013 | PAGE 21 PERFORMANCES ON A SINGLE GPU
Comparing architectures…
TGCC-Titane Intel Nehalem+ NVidia S1070 (Tesla) TGCC-Curie Intel Westmere + NVidia M2090 (Fermi)
Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup 107 gold atoms 4230 1857 2,3 5154 621 8.3 31 copper atom 153 102 1,5 235 64 3,7 s 85 98 0,9 86 73 1,2 5 BaTiO3 atoms
! Strongly dependent on architecture Titane : fast CPU+ old GPU ! BaTiO3 case is too small to take benefit from GPU
ABINIT on GPU| April 15, 2013 | PAGE 22 PERFORMANCES ON MULTIPLE GPU
TGCC-Curie Intel Westmere + NVidia M2090 (Fermi) 8 CPU (MPI only) + 8 GPU Time proc 0 (s) Curie CPU CPU+GPU Speedup 215 silicon atoms (1 iter 16160 3680 4,4 .) 866 621 4,7 107 gold atoms 49 64 1,5 31 copper atom200s CPU (MPI only) + 200 GPU Time proc 0 (s) Curie CPU CPU+GPU Speedup 215 silicon atoms (45 it 25856 6309 4,1 er.)
! Less efficient when the number of MPI processes increases (the load decreases on the GPU)
ABINIT on GPU| April 15, 2013 | PAGE 23 PERFORMANCES VS OTHER CODES
! Other DFT codes have been ported on GPU. Speedup of plane wave codes are similar
! Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2] ! ! VASP (plane waves) : x3/x4 [3,4]
! BigDFT (wavelets) : x5/x7 [5]
GPAW (real space) : x10/x15 [1,6]
[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013) [2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012) [3] Mainz et al, Computer Physics Communication 182, p1421 (2011) [4] Hacene et al,, Journal of Computational Chemistry 33, p2581(2012) [5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009) [6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)
ABINIT on GPU| April 15, 2013 | PAGE 24 BigDFT Gaussian April 4-7, 2016 | Silicon Valley
ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC
Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI) PREVIOUSLY Earlier Presentations
GRC Poster 2012 ACS Spring 2014 GTC Spring 2014 ( recording at http://on- demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 ) WATOC Fall 2014
GTC Spring 2016 (this full recording at http://mygtc.gputechconf.com/quicklink/4r13O5r; requires registration) 33 CURRENT STATUS Single Node
Implemented
Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)
First derivatives for the same as above
Second derivatives for the same as above Using only
OpenACC
CUDA library calls (BLAS)
6/29/2016 34 GAUSSIAN PARALLELISM MODEL CPU Cluster CPU Node
GPU
OpenACC OpenMP
35 CLOSING REMARKS
Significant Progress has been made in enabling Gaussian on GPUs with OpenACC OpenACC is increasingly becoming more versatile Significant work lies ahead to improve performance Expand feature set:
PBC, Solvation, MP2, ONIOM, triples-Corrections
36 EARLY PERFORMANCE RESULTS (CCSD(T)) Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node 18.0 Labels: Time in hrs 16.0 14.0 12.0 10.0 Method CCSD(t) 8.0 No. of Atoms 33 6.0 Basis Set 6-31G(d,p) 4.0 No. of Basis Funcs 315 2.0 24.8 3.7 24.8 3.6 22.7 1.3 2.1 2.3 No. Occ Orbitals 41 0.0 Total l913: "triples" Iterative CCSD No. Virt Orbitals 259 20 Threads 0 GPUs 6 Threads 6 GPUs
No. of Cycles 15 System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB No.GPUs: CCSD 6 Tesla iters K40m (15 SMPs @ 16875 MHz); 12 GB Global Memory VASP GPU VASP Collaboration
Collaborators
2013-2014 Project Scope Minimization algorithms to calculate electronic ground state Blocked Davidson (ALGO = NORMAL & FAST) RMM-DIIS (ALGO = VERYFAST & FAST) U of Chicago K-Points Optimization for critical step in exact exchange calculations Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard Target Workloads
Silica (“medium”)
7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24) RMM-DIIS (ALGO = VERYFAST)
Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy 500 atoms, 9 chemical species total Blocked Davidson (ALGO = NORMAL)
INTERFACE (“large”) Interface of platinum metal with water 108 Pt atoms, and 120 water molecules (468 atoms) Blocked Davidson & RMM-DIIS (ALGO = FAST) RUNTIME DISTRIBUTION FOR SILICA Time in sec for 1 K40 GPU + 1 IvyBridge core
original GPU port CPU Memcopy Gemm FFT Optimized GPU port Other
0 500 1000 1500 2000 2500 3000 3500 VASP 5.4.1 w/ Patch#1(Haswell & K80) March 2016 VASP Interface Benchmark
900 828.27 800 Interface *Lower is better Running VASP version 5.4.1 700 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 600 580.63
506.25 The green nodes contain Dual Intel 500 1.4X Xeon E5-2698 [email protected] (Haswell) 387.26 CPUs + Tesla K80 (autoboost) GPUs 400 1.6X Elapsed Elapsed time (s) 284.06 “[Old Data]” = pre-Bugfix: patch #1 for 300 2.1X vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 200 3.0X available at VASP.at)
100
0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 1 Node + 1 Node + 1x K80 per node per node 2x K80 per node 4x K80 per node [Old Data]
43 VASP Silica IFPEN Benchmark
600 542.87 Silica IFPEN 500 Running VASP version 5.4.1 *Lower is better 432.48 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 400 372.74
1.3X The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 300 258.14 CPUs + Tesla K80 (autoboost) GPUs 1.5X
Elapsed Elapsed time (s) 210.47 “[Old Data]” = pre-Bugfix: patch #1 for 200 2.1X vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 available at VASP.at) 100 2.6X
0 RMM-DIIS (ALGO=Veryfast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x K80 per node [Old Data]
44 VASP Si-Huge Benchmark
7000 6464.42 Si-Huge 6000 Running VASP version 5.4.1 *Lower is better The blue node contains Dual Intel Xeon 5000 E5-2698 [email protected] (Haswell) CPUs 4296.15
4000 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.5X CPUs + Tesla K80 (autoboost) GPUs 3000 2871.09 2865.92
Elapsed Elapsed time (s) “[Old Data]” = pre-Bugfix: patch #1 for 1987.94 vasp.5.4.1.05Feb15 which yield up to 2000 2.3X 2.3X 2.7X faster calculations. (patch #1 available at VASP.at)
1000 3.3X
0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x K80 per node [Old Data]
45 VASP SupportedSystems Benchmark
350 310.63 SupportedSystems 300 Running VASP version 5.4.1 *Lower is better 252.43 The blue node contains Dual Intel Xeon 250 228.90 E5-2698 [email protected] (Haswell) CPUs
1.2X 200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.4X 164.21 CPUs + Tesla K80 (autoboost) GPUs 150 129.99 Elapsed Elapsed time (s) 1.9X “[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to 100 2.7X faster calculations. (patch #1 2.4X available at VASP.at)
50
0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x Tesla K80 per [Old Data] node
46 VASP NiAl-MD Benchmark
800 722.82 NiAl-MD 700 Running VASP version 5.4.1 *Lower is better 600 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
500 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 400 CPUs + Tesla K80 (autoboost) GPUs
Elapsed Elapsed time (s) 300 “[Old Data]” = pre-Bugfix: patch #1 for 237.02 vasp.5.4.1.05Feb15 which yield up to 197.19 2.7X faster calculations. (patch #1 200 184.93 142.98 available at VASP.at) 3.0X 100 3.7X 3.9X 5.1X
0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1x K80 per 1 Node + 2x K80 per 1 Node + 1 Node + node node 2x K80 per node 4x K80 per node [Old Data]
47 VASP B.hR105 Benchmark
1400
1196.86 B.hR105 1200 Running VASP version 5.4.1
*Lower is better The blue node contains Dual Intel Xeon 1000 E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel 800 Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 604.05 600 “[Old Data]” = pre-Bugfix: patch #1 for Elapsed Elapsed time (s) 2.0X vasp.5.4.1.05Feb15 which yield up to
400 334.44 343.69 2.7X faster calculations. (patch #1 available at VASP.at) 236.13 200 3.6X 3.5X 5.1X Hybrid Functional with blocked Davicson 0 (ALGO=Normal) 1 Node 1 Node + 1x K80 per 1 Node + 2x K80 per 1 Node + 1 Node + node node 2x K80 per node 4x K80 per node [Old Data]
48 VASP B.aP107 Benchmark
30000 27993.62 Running VASP version 5.4.1 B.aP107 25000 The blue node contains Dual Intel Xeon *Lower is better E5-2698 [email protected] (Haswell) CPUs
20000 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell)
CPUs + Tesla K80 (autoboost) GPUs
15000 “[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 10000
Elapsed Elapsed time (s) available at VASP.at) 7494.36 7632.31
4432.31 4513.40 5000 3.7X 3.7X Hybrid functional calculation (exact 6.2X exchange) with blocked Davidson. No KPoint 6.3X parallelization. 0 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + Hybrid Functional with blocked Davicson 1x K80 per node node 2x K80 per node 4x K80 per node (ALGO=Normal) [Old Data]
49 GPU-ACCELERATED MOLECULAR DYNAMICS APPS
Green Lettering Indicates Performance Slides Included
GPUGrid.net MELD ACEMD GROMACS NAMD AMBER HALMD OpenMM CHARMM HOOMD-Blue PolyFTS DESMOND LAMMPS ESPResSO mdcore Folding@Home
GPU Perf compared against dual multi-core x86 CPU socket. 50 NAMD 2.11 NEW GPU FEATURES IN NAMD 2.11 Selected Text from the NAMD website
• GPU-accelerated simulations up to twice as fast as NAMD 2.10 • Pressure calculation with fixed atoms on GPU works as on CPU • Improved scaling for GPU-accelerated particle-mesh Ewald calculation
• CPU-side operations overlap better and are parallelized across cores. • Improved scaling for GPU-accelerated simulations
• Nonbonded force calculation results are streamed from the GPU for better overlap. • NVIDIA CUDA GPU-acceleration binaries for Mac OS X
53 NAMD 2.11 IS UP TO 2X FASTER
25 APoA1 (92,224 atoms) 20 2.0X 1.6X
15
10 1.2X Simulated Simulated Time (ns/day) 5
0 1 Node 2 Nodes 4 Nodes
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs 54 NAMD 2.11 APOA1 ON 1 AND 2 NODES
25 24.31 APoA1 Running NAMD version 2.11 4.7X (92,224 atoms) 19.73 20 The blue nodes contain Dual Intel E5- 2698 [email protected] (Haswell) CPUs 16.99 3.8X
The green nodes contain Dual Intel E5- 15 2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 11.67 6.1X
10
4.2X Simulated Simulated Time (ns/day) 5.22 5 2.77
0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80
55 NAMD 2.11 APOA1 ON 4 AND 8 NODES
30 27.83 27.74 APoA1 Running NAMD version 2.11 (92,224 atoms) 25 23.52 The blue nodes contain Dual Intel E5- 1.7X 1.6X 2698 [email protected] (Haswell) CPUs 20.64 2.3X 20 The green nodes contain Dual Intel E5- 16.85 2698 [email protected] (Haswell) CPUs + Tesla 2.0X K80 (autoboost) GPUs 15
10.27
Simulated Simulated Time (ns/day) 10
5
0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80 56 NAMD 2.11 STMV ON 1 AND 2 NODES
4 STMV Running NAMD version 2.11 3.27 (1,066,628 atoms) The blue nodes contain Dual Intel E5- 3 2698 [email protected] (Haswell) CPUs
7.1X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs (Haswell) + Tesla K80 (autoboost) GPUs 1.98 2 1.75 4.3X
7.6X Simulated Simulated Time (ns/day) 1.03 1
4.5X 0.46 0.23
0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80
57 NAMD 2.11 STMV ON 4 AND 8 NODES
8 Running NAMD version 2.11 STMV (1,066,628 atoms) 6.24 The blue nodes contain Dual Intel E5- 6 5.86 2698 [email protected] (Haswell) CPUs
3.6X The green nodes contain Dual Intel E5-
4.54 2698 [email protected] CPUs (Haswell) + Tesla 3.4X K80 (autoboost) GPUs 4 3.61 5.0X
4.0X Simulated Simulated Time (ns/day) 2 1.74
0.90
0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80
58 AMBER 14 JAC on K40s, K80s and M6000s
250 Running AMBER version 14 225.34 PME-JAC_NVE 219.83 200.34 The blue node contains Dual Intel E5- 200 2698 [email protected], 3.6GHz Turbo CPUs 6.0X 5.9X 174.34 5.4X 161.53 The green nodes contain Dual Intel E5-
2698 [email protected], 3.6GHz Turbo CPUs + 150 134.82 4.7X 121.30 either NVIDIA Tesla K40@875Mhz, Tesla 4.3X K80@562Mhz (autoboost), or Quadro 3.6X 100 M6000@987Mhz GPUs 3.2X
50 37.38 Simulated Simulated Time (ns/day)
0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000
60 Cellulose on K40s, K80s and M6000s
20 Running AMBER version 14 PME-Cellulose_NVE The blue node contains Dual Intel E5- 15.38 16 14.90 2698 [email protected], 3.6GHz Turbo CPUs 13.67 8.0X The green nodes contain Dual Intel E5-
11.76 12 7.7X 2698 [email protected], 3.6GHz Turbo CPUs + 10.49 7.1X either NVIDIA Tesla K40@875Mhz, Tesla 8.96 6.1X K80@562Mhz (autoboost), or Quadro 7.87 8 M6000@987Mhz GPUs 4.6X 5.4X 4.1X 4 Simulated Simulated Time (ns/day) 1.93
0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000
61 Factor IX on K40s, K80s and M6000s
80 Running AMBER version 14 PME-FactorIX_NVE 70 66.89 The blue node contains Dual Intel E5- 61.18 60.93 2698 [email protected], 3.6GHz Turbo CPUs 60 7.0X 50.70 6.4X The green nodes contain Dual Intel E5- 50 47.80 6.3X 2698 [email protected], 3.6GHz Turbo CPUs + 40.48 5.2X either NVIDIA Tesla K40@875Mhz, Tesla 40 5.0X K80@562Mhz (autoboost), or Quadro 33.59 M6000@987Mhz GPUs 30 4.2X 3.5X 20
Simulated Simulated Time (ns/day) 9.68 10
0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000
62 JAC on M40s
300 PME - JAC_NVE 246.15 Running AMBER version 14 250 230.18 The blue node contain Single Intel Xeon
11.7X E5-2698 [email protected] (Haswell) CPUs
200 The green nodes contain Single Intel 157.68 10.9X Xeon E5-2697 [email protected] (IvyBridge) 150 CPUs + Tesla M40 (autoboost) GPUs 7.5X
100 Simulated Simulated Time (ns/Day)
50 21.11
0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node
63 Cellulose on M40s
18
PME - Cellulose_NPT 15.90 16 Running AMBER version 14 14.40 14.9X 14 The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
13.5X 12 The green nodes contain Single Intel 10.12 10 Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 8 9.5X
6 Simulated Simulated Time (ns/Day)
4
2 1.07
0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node
64 Myoglobin on M40s
350 GB - Myoglobin 322.09 300.86 300 Running AMBER version 14
32.8X The blue node contain Single Intel Xeon 30.6X E5-2698 [email protected] (Haswell) CPUs 250 232.20
The green nodes contain Single Intel 200 Xeon E5-2697 [email protected] (IvyBridge) 23.6X CPUs + Tesla M40 (autoboost) GPUs 150
Simulated Simulated Time (ns/Day) 100
50
9.83 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node
65 Nucleosome on M40s
18 GB - Nucleosome 16.11 16 Running AMBER version 14
14 The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs 123.9X 12 The green nodes contain Single Intel 10 Xeon E5-2697 [email protected] (IvyBridge) 9.05 CPUs + Tesla M40 (autoboost) GPUs 8 69.6X
6 Simulated Simulated Time (ns/Day) 4.67 4 35.9X 2
0.13 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node
66 GROMACS 5.1 October 2015 3M Waters on K40s and K80s
4
Water [PME] 3M 3.23
3 2.72 2.76 4.0X Running GROMACS version 5.1 3.4X 3.4X The blue node contains Dual Intel E5- 2 1.88 1.85 2698 [email protected] CPUs
1.32 2.3X 2.3X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs + either NVIDIA 1 0.81 Tesla K40@875Mhz or Tesla
Simulated Simulated Time (ns/day) 1.6X K80@562Mhz (autoboost) GPUs
0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80
68 3M Waters on Titan X
4 Water [PME] 3M 2.99 3 Running GROMACS version 5.1 2.36 3.7X The blue node contains Dual Intel E5- 2 2698 [email protected] CPUs 1.53 2.9X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs + GeForce GTX 1 0.81 TitanX@1000Mhz GPUs
Simulated Simulated Time (ns/day) 1.9X
0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX
69 TESLA P100 ACCELERATORS
Tesla P100 5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP for NVLink-enabled Servers 720 GB/sec Memory Bandwidth, 16 GB
4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP Tesla P100 Config 1: 16 GB, 720 GB/sec for PCIe-Based Servers Config 2: 12 GB, 540 GB/sec
PAGE MIGRATION ENGINE
Te sla CPU P100 Simpler Parallel Programming Virtually Unlimited Data Size Performance w/ data locality Unified Memory
70 EXTRAORDINARY STRONG SCALING One Strong Node Faster Than Lots of Weak Nodes
AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing 50 1 Node VASP PERFORMANCE with 4 45 P100 Single P100 PCIe Node vs Lots of Weak Nodes GPUs 40 12x 8x P100 35 32 CPU Nodes
10x 30 48 CPU Nodes 4x P100 Comet 8x ns/day 25 Supercomputer
CPU Server Node Server CPU 6x 20 2x P100
15 4x
up vs 1 1 vs up - 10 2x 5 Speed 0x 0 0 4 8 12 16 20 24 28 32 0 20 40 60 80 100 # of Processors (CPUs and GPUs) # of CPU Server Nodes
CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms 72 MASSIVE LEAP IN PERFORMANCE
30x 2x K80 2x P100 (PCIe) 4x P100 (PCIe)
25x
20x Socket Broadwell Socket
15x
Dual Dual
up vs up -
10x Speed
5x
0x NAMD VASP HOOMD Blue AMBER CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo 74 Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests TCO OVERVIEW
EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT
1 RACK 20CPUs + 80 P100s / 32 kW Compute Non-Compute Servers, 15% 85%
7 RACKS VASP 960 CPUs / 260 kW
Rack, Cabling Compute Non- Servers, compute, 9 RACKS CAFFE ALEXNET 1280 CPUs / 350 kW 39% 61% Infrastructure
2 4 6 8 10 Networking (80kW) (160kW) (240kW) (320kW) (400 kW)
# of Racks (Power Required) 76 Sources: Microsoft Research on Datacenter Costs NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Accelerates Major AI Frameworks Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W
77 THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?
79