Accélération de calculs de simulations des matériaux sur GPU

Francois Courteille Senior Solution Architect, Accelerated Computing

NVIDIA TEGRA K1 Mar 2, 2014 Confidential1 Tesla, the NVIDIA compute platform Applications Overview AGENDA Applications Overview Material Science Applications at the start of the Pascal Era

2 GAMING AUTO ENTERPRISE HPC & CLOUD TECHNOLOGY

THE WORLD LEADER IN VISUAL COMPUTING

3 OUR TEN YEARS IN HPC GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

Google Outperform Stanford Builds AI Humans in ImageNet Machine using GPUs Fermi: World’s AlexNet beats expert code First HPC GPU by huge margin using GPUs

World’s First GPU Top500 System Discovered How H1N1 World’s First 3-D Mapping Mutates to Resist Drugs of Human Genome CUDA Launched

2006 2008 2010 2012 2014 20164 CREDENTIALS BUILT OVER TIME

Academia Games Finance TORCH CAFFE # of Applications Manufacturing Internet Oil & Gas 450 410 National Labs Automotive Defense 400 370 M & E THEANO MATCONVNET 350 300 287 MOCHA.JL PURINE 242 250 206 MINERVA MXNET* 200 300K 150 113 BIG SUR TENSORFLOW 100

50 WATSON CNTK 0 2011 2012 2013 2014 2015 2016

Majority of HPC Applications are 300K CUDA Developers, 100% of Deep Learning GPU-Accelerated, 410 and Growing 4x Growth in 4 years Frameworks are Accelerated

www.nvidia.com/appscatalog 5 END-TO-END TESLA PRODUCT FAMILY

FULLY INTEGRATED DL HYPERSCALE HPC MIXED-APPS HPC STRONG-SCALING HPC SUPERCOMPUTER

Tesla M4, M40 Tesla K80 Tesla P100 DGX-1

Hyperscale deployment for DL HPC data centers running mix Hyperscale & HPC data For customers who need to training, inference, video & of CPU and GPU workloads centers running apps that get going now with fully image processing scale to multiple GPUs integrated solution

6 TESLA FOR SIMULATION

LIBRARIES DIRECTIVES LANGUAGES

ACCELERATED COMPUTING TOOLKIT

TESLA ACCELERATED COMPUTING

8 OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS

MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, , ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, *, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, , ONETEP, Quantum Supercharger Library*, VASP & more 10 green* = application where >90% of the workload is on GPU MD VS. QC ON GPUS

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties

Forces calculated from simple empirical formulas Forces derived from wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on 11 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included

Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot TeraChem BigDFT LSDalton Q-Chem VASP CP2K MOLCAS QMCPACK WL-LSMS GAMESS-US Mopac2012 Quantum Espresso NWChem

GPU Perf compared against dual multi-core CPU socket. 12 ABINIT

6th International ABINIT Developer Workshop

FROM RESEARCH TO INDUSTRY April 15-18, 2013 - Dinard, France

USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)

Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm -S – 22 av. Galilée. 92350 Le Plessis-Robinson

With a contribution from Yann Pouillon for the build system

www.cea.fr ABINIT ON GPU OUTLINE

Introduction Introduction to GPU computing Architecture, programming model Density-Functional Theory with plane waves How to port it on GPU ABINIT on GPU Implementation Performances of ABINIT v7.0 How To…

Compile Use

ABINIT on GPU| April 15, 2013 | PAGE 13 LOCAL OPERATOR: FAST FOURIER TRANSFORMS

  ! Our choice (for a first version): g r  g r  g v  dr e v r c  e eff  eff   g '  Replace the MPI plane wave level        g ' ~ =  i   ~ ()    FFT by a FFT computation on GPU      ( i         ) ! Included in Cuda package : cuFFT library   Interfaced with c and ! FTT A system dependent efficiency is expected ! Small FFT underuse the GPU and are dominated by ! the time required to transfer the data to/from the GPU

Within PAW, small plane wave basis are used

ABINIT on GPU| April 15, 2013 | PAGE 14 LOCAL OPERATOR: FAST FOURIER TRANSFORMS

! Implementation ! Interface Cuda / Fortran ! New routines that encapsulate cufft calls ! ! Development of 3 small kernels Optimization of memory copy to overlap fourwf – option 2 computation and data transfer Performance highly Speedup dependent on FFT size

TEST CASE 107 gold atoms cell FFT size TGCC-Curie (Intel Westmere + Nvidia M2090)

ABINIT on GPU| April 15, 2013 | PAGE 15 LOCAL OPERATOR: FAST FOURIER TRANSFORMS

Improvement : multiple band optimization

! The Cuda kernels are more intensively used Requires less transfers

! See bandpp variable When Locally Optimal Block Preconditioned Conjugate Gradient algorithm is used

107 gold atoms 31 copper atoms bandpp fourwf on GPU bandpp fourwf on GPU 1 112,94 1 36,02 2 84,55 2 20,79 4 70,55 4 14,14

TEST CASE 8 61,91 10 9,16 TGCC-Curie 18 54,79 20 9,50 (Intel Westmere + Nvidia M2090) 72 50,55 100 7,94 ABINIT on GPU| April 15, 2013 | PAGE 16 NON LOCAL OPERATOR

! Implementation ~ ~   ! Development of 2 Cuda Kernels for p ψ = ∑ p g ' g ' ψ i g ' i    ~ ! ψ ~ ψ Development of 2 Cuda Kernels for g VNL = ∑i , j g pi Dij p j

! d

Using collaborative memory zones as much as possible (threads linke • Non-local operator to the same plane wave lie in the same collaborative zone) Used• for Overlap: operator •• ForcesNon-l and stress tensor Performances (NL operator only) • PAW occupation matrix nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU In addition to all MPI 107 gold atoms 3142 1122 2.8 parallelization levels !

31 copper atom 85 42 2.0 • Need a large number of PW

s 3 UO2 atoms 63 36 1.8 • Need a large number of NL projectors

TGCC-Curie (Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17 Algorithm 1 lobpcg Locally Optimal Block Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m Preconditioned Conjugate (0) 0 conditioner; P = {P 1, . . . , P m} is initialized to 0. 1: for i=0,1, , .do. . Gradient 2: ⌥(i) = ⌥( (i) ) bandpp Total tim e Tim e Line ar Alge bra 3: R(i) = H (i) ⌥(i) O (i) 1 674 119 2 612 116 4: W(i) = KR(i) 10 1077 428 5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o 50 5816 5167 (i) (i) (i) (i) (i) (i) 1 ,P. .. , Pm , 1 , . .. , m , W1 , . . . , Wm 6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i) 7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for

Matrix-matrix and matrix-vector products Diagonalization/othogonalization within blocks

! These products are not parallelized ! Size of matrix increases when b andparallelization is used ! Size of vectors/matrix increases with the ___ .z number of band/plane-wave CPU cores. ! Need for- a free LAPACK packag e on GPU ! Our choice : use of cuBlas Our choice : use of MAGMA ABINIT on GPU| April 15, 2013 | PAGE 18 Matrix Algebra on GPU and

Home Multicore Architectures Overview Matrix Algebra on GPU and Multicore Latest MAGMA News Architectures 2013-03- 12 News Downloads The MAGMA project aims 1o oeveop a dense linear agebra library MAGMA MIC 1.0 Beta for Intel Xeon similar 1o LAPACK bu1 for heterogeneouslhybrd architectures, Phi Coprocessors Re~ased PublPeopcaletion s 2012-11- 14 starting with current 'Multcore-tGPU' systems.

Dcx:umentatb MAGMA 1.3 Released Pan rtners The MAGMA research is based on 1he dlea 1hat, 1o address the complex challenges of 1heemerging hybrd environments, optimal software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi ! A project by Innovating Computing Laboratory, eonna bthleis a pdleap'ca, ton wes a im1o ftoully oes eqnxp oit li n eatherage powebraar g1hornhmat eacs h aondf th e h ybr d Coprocessors Released cfraomponmeworksents fooffersr hybr. d manycore and GPU systems that can University of Tenessee 2012-10-24

clMAGMA 1.0 Released

2012-06-29 ! “Free of use” project MAGMA 1.2.1 Released

! First stable release 25-08-2011; ICL{>l3r Current version 1.3

lver27 2013 Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,......

! Computation is cut in order to offload License consuming parts on GPU

Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re .. ! Goal : address hybrid environments ing d. in conditbns are met:

ABINIT on GPU| April 15, 2013 | PAGE 19 ITERATIVE EIGENSOLVER

Speedup of cuBLAS/MAGMA code Performances Speedup on a single MPI process 4,00 3,50 3,00 2,50 215 silicon atoms 2,00 107 gold atoms 1,50 31 copper atoms 1,00 3 UO2 atoms 0,50 0,00 bandpp= TGCC-Curie 2 4 10 factor related to the size of (Intel Westmere + Nvidia M2090) blocks (bands) size ! Highly dependent on the physical system ! Depends only on the size of blocks used in the LOBPCG algorithm

ABINIT on GPU| April 15, 2013 | PAGE 20 OUTLINE

Introduction Introduction to GPU computing Architecture, programming model Density-Functional Theory with plane waves How to port it on GPU ABINIT on GPU Implementation Performances of ABINIT v7.0 How To…

Compile Use

ABINIT on GPU| April 15, 2013 | PAGE 21 PERFORMANCES ON A SINGLE GPU

Comparing architectures…

TGCC-Titane Intel Nehalem+ NVidia S1070 (Tesla) TGCC-Curie Intel Westmere + NVidia M2090 (Fermi)

Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup 107 gold atoms 4230 1857 2,3 5154 621 8.3 31 copper atom 153 102 1,5 235 64 3,7 s 85 98 0,9 86 73 1,2 5 BaTiO3 atoms

! Strongly dependent on architecture Titane : fast CPU+ old GPU ! BaTiO3 case is too small to take benefit from GPU

ABINIT on GPU| April 15, 2013 | PAGE 22 PERFORMANCES ON MULTIPLE GPU

TGCC-Curie Intel Westmere + NVidia M2090 (Fermi) 8 CPU (MPI only) + 8 GPU Time proc 0 (s) Curie CPU CPU+GPU Speedup 215 silicon atoms (1 iter 16160 3680 4,4 .) 866 621 4,7 107 gold atoms 49 64 1,5 31 copper atom200s CPU (MPI only) + 200 GPU Time proc 0 (s) Curie CPU CPU+GPU Speedup 215 silicon atoms (45 it 25856 6309 4,1 er.)

! Less efficient when the number of MPI processes increases (the load decreases on the GPU)

ABINIT on GPU| April 15, 2013 | PAGE 23 PERFORMANCES VS OTHER CODES

! Other DFT codes have been ported on GPU. Speedup of plane wave codes are similar

! Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2] ! ! VASP (plane waves) : x3/x4 [3,4]

! BigDFT () : x5/x7 [5]

GPAW (real space) : x10/x15 [1,6]

[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013) [2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012) [3] Mainz et al, Computer Communication 182, p1421 (2011) [4] Hacene et al,, Journal of 33, p2581(2012) [5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009) [6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)

ABINIT on GPU| April 15, 2013 | PAGE 24 BigDFT Gaussian April 4-7, 2016 | Silicon Valley

ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC

Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI) PREVIOUSLY Earlier Presentations

GRC Poster 2012 ACS Spring 2014 GTC Spring 2014 ( recording at http://on- demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 ) WATOC Fall 2014

GTC Spring 2016 (this full recording at http://mygtc.gputechconf.com/quicklink/4r13O5r; requires registration) 33 CURRENT STATUS Single Node

Implemented

Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)

First derivatives for the same as above

Second derivatives for the same as above Using only

OpenACC

CUDA library calls (BLAS)

6/29/2016 34 GAUSSIAN PARALLELISM MODEL CPU Cluster CPU Node

GPU

OpenACC OpenMP

35 CLOSING REMARKS

Significant Progress has been made in enabling Gaussian on GPUs with OpenACC OpenACC is increasingly becoming more versatile Significant work lies ahead to improve performance Expand feature set:

PBC, Solvation, MP2, ONIOM, triples-Corrections

36 EARLY PERFORMANCE RESULTS (CCSD(T)) Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node 18.0 Labels: Time in hrs 16.0 14.0 12.0 10.0 Method CCSD(t) 8.0 No. of Atoms 33 6.0 6-31G(d,p) 4.0 No. of Basis Funcs 315 2.0 24.8 3.7 24.8 3.6 22.7 1.3 2.1 2.3 No. Occ Orbitals 41 0.0 Total l913: "triples" Iterative CCSD No. Virt Orbitals 259 20 Threads 0 GPUs 6 Threads 6 GPUs

No. of Cycles 15 System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB No.GPUs: CCSD 6 Tesla iters K40m (15 SMPs @ 16875 MHz); 12 GB Global Memory VASP GPU VASP Collaboration

Collaborators

2013-2014 Project Scope Minimization algorithms to calculate electronic ground state Blocked Davidson (ALGO = NORMAL & FAST) RMM-DIIS (ALGO = VERYFAST & FAST) U of Chicago K-Points Optimization for critical step in exact exchange calculations Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard Target Workloads

Silica (“medium”)

7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24) RMM-DIIS (ALGO = VERYFAST)

Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy 500 atoms, 9 chemical species total Blocked Davidson (ALGO = NORMAL)

INTERFACE (“large”) Interface of platinum metal with water 108 Pt atoms, and 120 water (468 atoms) Blocked Davidson & RMM-DIIS (ALGO = FAST) RUNTIME DISTRIBUTION FOR SILICA Time in sec for 1 K40 GPU + 1 IvyBridge core

original GPU port CPU Memcopy Gemm FFT Optimized GPU port Other

0 500 1000 1500 2000 2500 3000 3500 VASP 5.4.1 w/ Patch#1(Haswell & K80) March 2016 VASP Interface Benchmark

900 828.27 800 Interface *Lower is better Running VASP version 5.4.1 700 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 600 580.63

506.25 The green nodes contain Dual Intel 500 1.4X Xeon E5-2698 [email protected] (Haswell) 387.26 CPUs + Tesla K80 (autoboost) GPUs 400 1.6X Elapsed Elapsed time (s) 284.06 “[Old Data]” = pre-Bugfix: patch #1 for 300 2.1X vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 200 3.0X available at VASP.at)

100

0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 1 Node + 1 Node + 1x K80 per node per node 2x K80 per node 4x K80 per node [Old Data]

43 VASP Silica IFPEN Benchmark

600 542.87 Silica IFPEN 500 Running VASP version 5.4.1 *Lower is better 432.48 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs 400 372.74

1.3X The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 300 258.14 CPUs + Tesla K80 (autoboost) GPUs 1.5X

Elapsed Elapsed time (s) 210.47 “[Old Data]” = pre-Bugfix: patch #1 for 200 2.1X vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 available at VASP.at) 100 2.6X

0 RMM-DIIS (ALGO=Veryfast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x K80 per node [Old Data]

44 VASP Si-Huge Benchmark

7000 6464.42 Si-Huge 6000 Running VASP version 5.4.1 *Lower is better The blue node contains Dual Intel Xeon 5000 E5-2698 [email protected] (Haswell) CPUs 4296.15

4000 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.5X CPUs + Tesla K80 (autoboost) GPUs 3000 2871.09 2865.92

Elapsed Elapsed time (s) “[Old Data]” = pre-Bugfix: patch #1 for 1987.94 vasp.5.4.1.05Feb15 which yield up to 2000 2.3X 2.3X 2.7X faster calculations. (patch #1 available at VASP.at)

1000 3.3X

0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x K80 per node [Old Data]

45 VASP SupportedSystems Benchmark

350 310.63 SupportedSystems 300 Running VASP version 5.4.1 *Lower is better 252.43 The blue node contains Dual Intel Xeon 250 228.90 E5-2698 [email protected] (Haswell) CPUs

1.2X 200 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 1.4X 164.21 CPUs + Tesla K80 (autoboost) GPUs 150 129.99 Elapsed Elapsed time (s) 1.9X “[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to 100 2.7X faster calculations. (patch #1 2.4X available at VASP.at)

50

0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + 1x K80 per node node 2x K80 per node 4x Tesla K80 per [Old Data] node

46 VASP NiAl-MD Benchmark

800 722.82 NiAl-MD 700 Running VASP version 5.4.1 *Lower is better 600 The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

500 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) 400 CPUs + Tesla K80 (autoboost) GPUs

Elapsed Elapsed time (s) 300 “[Old Data]” = pre-Bugfix: patch #1 for 237.02 vasp.5.4.1.05Feb15 which yield up to 197.19 2.7X faster calculations. (patch #1 200 184.93 142.98 available at VASP.at) 3.0X 100 3.7X 3.9X 5.1X

0 Blocked Davidson + RMM-DIIS (ALGO=Fast) 1 Node 1 Node + 1x K80 per 1 Node + 2x K80 per 1 Node + 1 Node + node node 2x K80 per node 4x K80 per node [Old Data]

47 VASP B.hR105 Benchmark

1400

1196.86 B.hR105 1200 Running VASP version 5.4.1

*Lower is better The blue node contains Dual Intel Xeon 1000 E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel 800 Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 604.05 600 “[Old Data]” = pre-Bugfix: patch #1 for Elapsed Elapsed time (s) 2.0X vasp.5.4.1.05Feb15 which yield up to

400 334.44 343.69 2.7X faster calculations. (patch #1 available at VASP.at) 236.13 200 3.6X 3.5X 5.1X Hybrid Functional with blocked Davicson 0 (ALGO=Normal) 1 Node 1 Node + 1x K80 per 1 Node + 2x K80 per 1 Node + 1 Node + node node 2x K80 per node 4x K80 per node [Old Data]

48 VASP B.aP107 Benchmark

30000 27993.62 Running VASP version 5.4.1 B.aP107 25000 The blue node contains Dual Intel Xeon *Lower is better E5-2698 [email protected] (Haswell) CPUs

20000 The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell)

CPUs + Tesla K80 (autoboost) GPUs

15000 “[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to 2.7X faster calculations. (patch #1 10000

Elapsed Elapsed time (s) available at VASP.at) 7494.36 7632.31

4432.31 4513.40 5000 3.7X 3.7X Hybrid functional calculation (exact 6.2X exchange) with blocked Davidson. No KPoint 6.3X parallelization. 0 1 Node 1 Node + 1 Node + 2x K80 per 1 Node + 1 Node + Hybrid Functional with blocked Davicson 1x K80 per node node 2x K80 per node 4x K80 per node (ALGO=Normal) [Old Data]

49 GPU-ACCELERATED MOLECULAR DYNAMICS APPS

Green Lettering Indicates Performance Slides Included

GPUGrid.net MELD ACEMD GROMACS NAMD AMBER HALMD OpenMM CHARMM HOOMD-Blue PolyFTS DESMOND LAMMPS ESPResSO mdcore Folding@Home

GPU Perf compared against dual multi-core x86 CPU socket. 50 NAMD 2.11 NEW GPU FEATURES IN NAMD 2.11 Selected Text from the NAMD website

• GPU-accelerated simulations up to twice as fast as NAMD 2.10 • Pressure calculation with fixed atoms on GPU works as on CPU • Improved scaling for GPU-accelerated particle-mesh Ewald calculation

• CPU-side operations overlap better and are parallelized across cores. • Improved scaling for GPU-accelerated simulations

• Nonbonded force calculation results are streamed from the GPU for better overlap. • NVIDIA CUDA GPU-acceleration binaries for Mac OS X

53 NAMD 2.11 IS UP TO 2X FASTER

25 APoA1 (92,224 atoms) 20 2.0X 1.6X

15

10 1.2X Simulated Simulated Time (ns/day) 5

0 1 Node 2 Nodes 4 Nodes

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs 54 NAMD 2.11 APOA1 ON 1 AND 2 NODES

25 24.31 APoA1 Running NAMD version 2.11 4.7X (92,224 atoms) 19.73 20 The blue nodes contain Dual Intel E5- 2698 [email protected] (Haswell) CPUs 16.99 3.8X

The green nodes contain Dual Intel E5- 15 2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs 11.67 6.1X

10

4.2X Simulated Simulated Time (ns/day) 5.22 5 2.77

0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80

55 NAMD 2.11 APOA1 ON 4 AND 8 NODES

30 27.83 27.74 APoA1 Running NAMD version 2.11 (92,224 atoms) 25 23.52 The blue nodes contain Dual Intel E5- 1.7X 1.6X 2698 [email protected] (Haswell) CPUs 20.64 2.3X 20 The green nodes contain Dual Intel E5- 16.85 2698 [email protected] (Haswell) CPUs + Tesla 2.0X K80 (autoboost) GPUs 15

10.27

Simulated Simulated Time (ns/day) 10

5

0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80 56 NAMD 2.11 STMV ON 1 AND 2 NODES

4 STMV Running NAMD version 2.11 3.27 (1,066,628 atoms) The blue nodes contain Dual Intel E5- 3 2698 [email protected] (Haswell) CPUs

7.1X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs (Haswell) + Tesla K80 (autoboost) GPUs 1.98 2 1.75 4.3X

7.6X Simulated Simulated Time (ns/day) 1.03 1

4.5X 0.46 0.23

0 1 Node 1 Node + 1 Node + 2 Nodes 2 Nodes + 2 Nodes + 1x K80 2x K80 1x K80 2x K80

57 NAMD 2.11 STMV ON 4 AND 8 NODES

8 Running NAMD version 2.11 STMV (1,066,628 atoms) 6.24 The blue nodes contain Dual Intel E5- 6 5.86 2698 [email protected] (Haswell) CPUs

3.6X The green nodes contain Dual Intel E5-

4.54 2698 [email protected] CPUs (Haswell) + Tesla 3.4X K80 (autoboost) GPUs 4 3.61 5.0X

4.0X Simulated Simulated Time (ns/day) 2 1.74

0.90

0 4 Nodes 4 Nodes + 4 Nodes + 8 Nodes 8 Nodes + 8 Nodes + 1x K80 2x K80 1x K80 2x K80

58 AMBER 14 JAC on K40s, K80s and M6000s

250 Running AMBER version 14 225.34 PME-JAC_NVE 219.83 200.34 The blue node contains Dual Intel E5- 200 2698 [email protected], 3.6GHz Turbo CPUs 6.0X 5.9X 174.34 5.4X 161.53 The green nodes contain Dual Intel E5-

2698 [email protected], 3.6GHz Turbo CPUs + 150 134.82 4.7X 121.30 either NVIDIA Tesla K40@875Mhz, Tesla 4.3X K80@562Mhz (autoboost), or Quadro 3.6X 100 M6000@987Mhz GPUs 3.2X

50 37.38 Simulated Simulated Time (ns/day)

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

60 Cellulose on K40s, K80s and M6000s

20 Running AMBER version 14 PME-Cellulose_NVE The blue node contains Dual Intel E5- 15.38 16 14.90 2698 [email protected], 3.6GHz Turbo CPUs 13.67 8.0X The green nodes contain Dual Intel E5-

11.76 12 7.7X 2698 [email protected], 3.6GHz Turbo CPUs + 10.49 7.1X either NVIDIA Tesla K40@875Mhz, Tesla 8.96 6.1X K80@562Mhz (autoboost), or Quadro 7.87 8 M6000@987Mhz GPUs 4.6X 5.4X 4.1X 4 Simulated Simulated Time (ns/day) 1.93

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

61 Factor IX on K40s, K80s and M6000s

80 Running AMBER version 14 PME-FactorIX_NVE 70 66.89 The blue node contains Dual Intel E5- 61.18 60.93 2698 [email protected], 3.6GHz Turbo CPUs 60 7.0X 50.70 6.4X The green nodes contain Dual Intel E5- 50 47.80 6.3X 2698 [email protected], 3.6GHz Turbo CPUs + 40.48 5.2X either NVIDIA Tesla K40@875Mhz, Tesla 40 5.0X K80@562Mhz (autoboost), or Quadro 33.59 M6000@987Mhz GPUs 30 4.2X 3.5X 20

Simulated Simulated Time (ns/day) 9.68 10

0 1 Haswell 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node 1 CPU Node Node + 1x K40 + 0.5x K80 + 1x K80 + 1x M6000 + 2x K40 + 2x K80 + 2x M6000

62 JAC on M40s

300 PME - JAC_NVE 246.15 Running AMBER version 14 250 230.18 The blue node contain Single Intel Xeon

11.7X E5-2698 [email protected] (Haswell) CPUs

200 The green nodes contain Single Intel 157.68 10.9X Xeon E5-2697 [email protected] (IvyBridge) 150 CPUs + Tesla M40 (autoboost) GPUs 7.5X

100 Simulated Simulated Time (ns/Day)

50 21.11

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

63 Cellulose on M40s

18

PME - Cellulose_NPT 15.90 16 Running AMBER version 14 14.40 14.9X 14 The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

13.5X 12 The green nodes contain Single Intel 10.12 10 Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 8 9.5X

6 Simulated Simulated Time (ns/Day)

4

2 1.07

0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

64 Myoglobin on M40s

350 GB - Myoglobin 322.09 300.86 300 Running AMBER version 14

32.8X The blue node contain Single Intel Xeon 30.6X E5-2698 [email protected] (Haswell) CPUs 250 232.20

The green nodes contain Single Intel 200 Xeon E5-2697 [email protected] (IvyBridge) 23.6X CPUs + Tesla M40 (autoboost) GPUs 150

Simulated Simulated Time (ns/Day) 100

50

9.83 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

65 Nucleosome on M40s

18 GB - Nucleosome 16.11 16 Running AMBER version 14

14 The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs 123.9X 12 The green nodes contain Single Intel 10 Xeon E5-2697 [email protected] (IvyBridge) 9.05 CPUs + Tesla M40 (autoboost) GPUs 8 69.6X

6 Simulated Simulated Time (ns/Day) 4.67 4 35.9X 2

0.13 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node

66 GROMACS 5.1 October 2015 3M Waters on K40s and K80s

4

Water [PME] 3M 3.23

3 2.72 2.76 4.0X Running GROMACS version 5.1 3.4X 3.4X The blue node contains Dual Intel E5- 2 1.88 1.85 2698 [email protected] CPUs

1.32 2.3X 2.3X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs + either NVIDIA 1 0.81 Tesla K40@875Mhz or Tesla

Simulated Simulated Time (ns/day) 1.6X K80@562Mhz (autoboost) GPUs

0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80

68 3M Waters on Titan X

4 Water [PME] 3M 2.99 3 Running GROMACS version 5.1 2.36 3.7X The blue node contains Dual Intel E5- 2 2698 [email protected] CPUs 1.53 2.9X The green nodes contain Dual Intel E5- 2698 [email protected] CPUs + GeForce GTX 1 0.81 TitanX@1000Mhz GPUs

Simulated Simulated Time (ns/day) 1.9X

0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX

69 TESLA P100 ACCELERATORS

Tesla P100 5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP for NVLink-enabled Servers 720 GB/sec Memory Bandwidth, 16 GB

4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP Tesla P100 Config 1: 16 GB, 720 GB/sec for PCIe-Based Servers Config 2: 12 GB, 540 GB/sec

PAGE MIGRATION ENGINE

Te sla CPU P100 Simpler Parallel Programming Virtually Unlimited Data Size Performance w/ data locality Unified Memory

70 EXTRAORDINARY STRONG SCALING One Strong Node Faster Than Lots of Weak Nodes

AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing 50 1 Node VASP PERFORMANCE with 4 45 P100 Single P100 PCIe Node vs Lots of Weak Nodes GPUs 40 12x 8x P100 35 32 CPU Nodes

10x 30 48 CPU Nodes 4x P100 Comet 8x ns/day 25 Supercomputer

CPU Server Node Server CPU 6x 20 2x P100

15 4x

up vs 1 1 vs up - 10 2x 5 Speed 0x 0 0 4 8 12 16 20 24 28 32 0 20 40 60 80 100 # of Processors (CPUs and GPUs) # of CPU Server Nodes

CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms 72 MASSIVE LEAP IN PERFORMANCE

30x 2x K80 2x P100 (PCIe) 4x P100 (PCIe)

25x

20x Socket Broadwell Socket

15x

Dual Dual

up vs up -

10x Speed

5x

0x NAMD VASP HOOMD Blue AMBER CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo 74 Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests TCO OVERVIEW

EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT

1 RACK 20CPUs + 80 P100s / 32 kW Compute Non-Compute Servers, 15% 85%

7 RACKS VASP 960 CPUs / 260 kW

Rack, Cabling Compute Non- Servers, compute, 9 RACKS CAFFE ALEXNET 1280 CPUs / 350 kW 39% 61% Infrastructure

2 4 6 8 10 Networking (80kW) (160kW) (240kW) (320kW) (400 kW)

# of Racks (Power Required) 76 Sources: Microsoft Research on Datacenter Costs NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Accelerates Major AI Frameworks Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W

77 THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?

79