February 4, 2013 Molecular Dynamics (MD) Applications

update

Updated: February 4, 2013 Molecular Dynamics (MD) Applications

Features Application GPU Perf Release Status Notes/Benchmarks Supported

> 100 ns/day AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB Released JAC NVE on 2X http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent Multi-GPU, multi-node K20s htm#Benchmarks

2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#po CHARMM Solvent via OpenMM Single & multi-GPU in single node CPUs stjump

Two-body Forces, Link-cell Source only, Results Published Release V 4.03 Pairs, Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwa DL_POLY Multi-GPU, multi-node Shake VV re/DL_POLY/25526.aspx

165 ns/Day Released Implicit (5x), Explicit (2x) DHFR on Release 4.6; 1st Multi-GPU support GROMACS Multi-GPU, multi-node 4X C2075s

Lennard-Jones, Gay-Berne, Released. http://lammps.sandia.gov/bench.html#desktop 3.5-18x on Titan LAMMPS Tersoff & many more potentials Multi-GPU, multi-node and http://lammps.sandia.gov/bench.html#titan

4.0 ns/days Released Full electrostatics with PME and F1-ATPase on 100M atom capable NAMD 2.9 NAMD most simulation features 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported

4-29X Released, Version 1.8.51 Simulations (on 1060 GPU) Agile Molecule, Inc. Abalone (on 1060 GPU) Single GPU

Computation of non-valent 4-29X Released, Version 1.1.4 Agile Molecule, Inc. Ascalaph interactions (on 1060 GPU) Single GPU

150 ns/day DHFR on Released Production bio-molecular dynamics (MD) Written for use only on GPUs ACEMD 1x K20 Single and multi-GPUs software specially optimized to run on GPUs

Powerful distributed computing Depends upon Released; http://folding.stanford.edu molecular dynamics system; Folding@Home number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding

High-performance all-atom Depends upon Released; http://www.gpugrid.net/ biomolecular simulations; GPUGrid.net number of GPUs NVIDIA GPUs only explicit solvent and binding Simple fluids and binary mixtures (pair potentials, high- Up to 66x on 2090 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercool HALMD precision NVE and NVT, dynamic vs. 1 CPU core Single GPU ed-binary-mixture-kob-andersen correlations)

Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013

Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics ns/day Explicit: 18- OpenMM custom forces Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes

Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.0.5 www.abinit.org 1.3-2.7X Abinit diagonalization / Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development SIAL programming language and 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, (1 CPU core to current release 1.6.0 2012/BigDFT-Formalism.pdf and BigDFT part of Abinit GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. TBD TBD Spring 2013 release Casino html Multi-GPU support http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development 2-7X content/training/ascc_2012/friday/ACSS_2012_V CP2K library) Multi-GPU support andeVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Algorithm, Hartree-Fock, MP2 Next release Q4 2012. GAMESS-US 2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html and CCSD in Q4 2012 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within calculations using Hartree Fock ab Release in 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 initio methods and density 8x GAMESS-UK Multi-GPU support 63 functional theory. Supports organics & inorganics.

Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support Announced Aug. 29, 2011 Gaussian Collaboration http://www.gaussian.com/g_press/nvidia_press.htm

Electrostatic poisson equation, Released orthonormalizing of vectors, 8x Multi-GPU support https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.html, GPAW residual minimization method Samuli Hakala (CSC Finland) & Chris O’Grady (SLAC)

(rmm-diis)

Under development Schrodinger, Inc. Investigating GPU acceleration TBD Multi-GPU support Jaguar http://www.schrodinger.com/kb/278

Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8

Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development www.molpro.net density fitted local correlation MOLPRO projected Multiple GPU Hans-Joachim Werner methods (DF-RHF, DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

pseudodiagonalization, full Under Development Academic port. diagonalization, and density 3.8-14X MOPAC2009 Single GPU http://openmopac.net matrix assembling

Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting March 2013 CCSD & EOMCCSD task 3-10X projected And http://www.olcf.ornl.gov/wp- NWChem Multiple GPUs schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

Density functional theory (DFT) First principles materials code that computes Released plane wave pseudopotential 6-10X the behavior of the electron structures of PEtot Multi-GPU calculations materials

http://www.q- RI-MP2 8x-14x Released, Version 4.0 Q-CHEM chem.com/doc_for_web/qchem_manual_4.0.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK

Created by Irish Centre for PWscf package: linear algebra Released Quantum High-End Computing (matrix multiply), explicit 2.5-3.5x Version 5.0 http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs Multiple GPUs and http://www.quantum-espresso.org/

Completely redesigned to 44-650X vs. Released exploit GPU parallelism. YouTube: http://youtu.be/EJODzk6RFxE?hd=1 and “Full GPU-based solution” GAMESS CPU Version 1.5 TeraChem http://www.olcf.ornl.gov/wp- version Multi-GPU/single node content/training/electronic-structure- 2012/Luehr-ESCMA.pdf

2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University functionals including exact VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores

3x NICS Electronic Structure Determination Workshop 2012: Generalized Wang-Landau Under development with 32 GPUs vs. http://www.olcf.ornl.gov/wp- WL-LSMS method Multi-GPU support content/training/electronic-structure- 32 (16-core) CPUs 2012/Eisenbach_OakRidge_February.pdf GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, ―Docking‖ and Related Applications Growing

Related Features GPU Perf Release Status Notes Applications Supported

3D visualization of volumetric Released, Version 5.3.3 Visualization from Visage Imaging. Next release, 5.4, will use Amira 5® 70x GPU for general purpose processing in some functions data and surfaces Single GPU http://www.visageimaging.com/overview.html

Allows fast processing of large Available upon request to High-Throughput parallel blind Virtual Screening, BINDSURF 100X http://www.biomedcentral.com/1471-2105/13/S14/S13 ligand databases authors; single GPU

Empirical Free Released 6.5-13.4X University of Bristol BUDE Energy Forcefield Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm

Released, Suite 2011 GPU accelerated application 3.75-5000X Schrodinger, Inc. Core Hopping Single and multi-GPUs. http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released 800-3000X Open Eyes Scientific Software FastROCS searching/comparison Single and multi-GPUs. http://www.eyesopen.com/fastrocs

Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x http://pymol.org/ Spheres: 753% increase Single GPUs Ribbon: 426% increase

High quality rendering, large structures (100 million atoms), 100-125X or greater Visualization from University of Illinois at Urbana-Champaign analysis and visualization tasks, multiple VMD Released, Version 1.9 http://www.ks.uiuc.edu/Research/vmd/ GPU support for display of molecular on kernels orbitals

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup

Alignment of short sequencing Version 0.6.2 – 3/2012 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node

Parallel search of Smith- Version 2.0.8 – Q1/2012 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node

Parallel, accurate long read Version 1.0.40 – 6/2012 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU

Protein alignment according to Version 2.2.26 – 3/2012 http://eudoxus.cheme.cmu.edu/gpublast/gpu 3-4x GPU-BLAST BLASTP Single GPU blast.html Parallel local and global Version 2.3.2 – Q1/2012 http://www.mpihmmer.org/installguideGPUH search of Hidden Markov 60-100x GPU-HMMER Multi-GPU, multi-node MMER.htm Models

Scalable motif discovery Version 3.0.12 https://sites.google.com/site/yongchaosoftwa 4-10x mCUDA-MEME algorithm based on MEME Multi-GPU, multi-node re/mcuda-meme

Hardware and software for Released. reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly

Version 1.11 – 5/2012 Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Version 0.1-1 – 3/2012 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models

GPU Perf compared against same or similar code running on single CPU machine Performance measured internally or independently

MD Average Speedups

The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs (8

Cores per CPU) and 1 or 2 NVIDIA K10, K20, or

8 K20X GPUs.

4 Performance Performance Relative to CPU Only 2

0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration. Molecular Dynamics (MD) Applications

Features Application GPU Perf Release Status Notes/Benchmarks Supported

2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#po CHARMM Solvent via OpenMM Single & multi-GPU in single node CPUs stjump

165 ns/Day Released Implicit (5x), Explicit (2x) DHFR on Release 4.6; 1st Multi-GPU support GROMACS Multi-GPU, multi-node 4X C2075s

4-29X Released, Version 1.8.51 Simulations (on 1060 GPU) Agile Molecule, Inc. Abalone (on 1060 GPU) Single GPU

Computation of non-valent 4-29X Released, Version 1.1.4 Agile Molecule, Inc. Ascalaph interactions (on 1060 GPU) Single GPU

150 ns/day DHFR on Released Production bio-molecular dynamics (MD) Written for use only on GPUs ACEMD 1x K20 Single and multi-GPUs software specially optimized to run on GPUs

Kepler 2X faster Released, Version 0.11.2 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue than Fermi Single and multi-GPU on 1 node Multi-GPU w/ MPI in March 2013

Implicit: 127-213 Implicit and explicit solvent, Released Version 4.1.1 Library and application for molecular dynamics ns/day Explicit: 18- OpenMM custom forces Multi-GPU on high-performance 55 ns/day DHFR GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Built from Ground Up for GPUs Computational Chemistry

What Study disease & discover drugs Predict drug and protein interactions

GPU READY Speed of simulations is critical APPLICATIONS Why Enables study of: Abalone ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESS GROMACS How GPUs increase throughput & accelerate simulations LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem • AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333 AMBER 12 GPU Support Revision 12.2 1/22/2013

15 Kepler - Our Fastest Family of GPUs Yet

30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU).

7.4x The green nodes contain Dual E5-2687W CPUs (8

20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU 6.6x

15.00

11.85 5.6x

Nanoseconds Day / 10.00 3.5x 5.00 3.42

0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 16 K10 Accelerates Simulations of All Sizes

30 Running AMBER 12 GPU Support Revision 12.1

The blue node contains Dual E5-2687W CPUs 25 24.00 (8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs (8 19.98 20 Cores per CPU) and 1x NVIDIA K10 GPU

Speedup Speedup Compared to CPU Only 5.50 5.53 5.04 5 2.00

0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00

The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU)

20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

15.00

10.00 7.28

6.50 6.56 Speedup Compared toCPU Only

5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance

18 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1

30 28.59 The blue node contains Dual E5-2687W CPUs

(8 Cores per CPU).

25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU

10 8.30

7.15 7.43 Speedup Compared toCPU Only

5 2.79

0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB

Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance

19 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 K10 Strong Scaling over Nodes

Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off

6 The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU)

5 The green nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) plus 2x NVIDIA K10 GPUs

4 2.4x

3 CPU Only 3.6x With GPU

Nanoseconds Day / 2 5.1x 1 Cellulose

0 1 2 4 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes Kepler – Universally Faster

9 Running AMBER 12 GPU Support Revision 12.1

8 The CPU Only node contains Dual E5-2687W CPUs

(8 Cores per CPU). 7 The Kepler nodes contain Dual E5-2687W CPUs (8 6 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs

5 JAC Factor IX 4 Cellulose

2 Speedups Speedups Compared CPU to Only

0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose

The Kepler GPUs accelerated all simulations, up to 8x K10 Extreme Performance

Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU).

97.99 100 The green node contain Dual E5-2687W CPUs (8

Cores per CPU) and 2x NVIDIA K10 GPUs

Nanoseconds Day / 40

20 12.47

0 DHFR 1 Node 1 Node Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance K20 Extreme Performance

DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 120

The blue node contains 2x Intel E5-2687W CPUs 95.59 (8 Cores per CPU) 100

Each green node contains 2x Intel E5-2687W

CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU 80

40 Nanoseconds Day /

20 12.47

0 1 Node 1 Node DHFR

Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance

23 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Replace 8 Nodes with 1 K20 GPU

90.00 35000 Running AMBER 12 GPU Support Revision 12.1 $32,000.00 81.09 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x Intel 70.00 E5-2687W CPUs (8 Cores per CPU) 65.00 25000 Each green node contains 2x Intel E5-2687W 60.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPU 50.00 20000 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node 40.00 15000 configuration. Contact your preferred HW vendor for actual pricing. 30.00 10000 20.00 $6,500.00

5000 10.00

0.00 0 Nanoseconds/Day Cost DHFR Cut down simulation costs to ¼ and gain higher performance

24 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Replace 7 Nodes with 1 K10 GPU

Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x Intel 70 $30,000.00 E5-2687W CPUs (8 Cores per CPU)

The green node contains 2x Intel E5-2687W 60 $25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA K10 GPU 50 $20,000.00 Note: Typical CPU and GPU node pricing used. 40 Pricing may vary depending on node configuration. Contact your preferred HW vendor $15,000.00 for actual pricing.

30 Nanoseconds Day / $10,000.00 20 $7,000

10 $5,000.00

0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled DHFR Cut down simulation costs to ¼ and increase performance by 70%

25 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Extra CPUs decrease Performance

Cellulose NVE Running AMBER 12 GPU Support Revision 12.1

8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8

6 Cores per CPU)

4 1 E5-2687W 2 E5-2687W

3 Nanoseconds Day /

CPUs CPUs GPUs 2

1 CPU 2 GPUs 2 CPU 1 2 1

0 Cellulose CPU Only CPU with dual K20s

When used with GPUs, dual CPU sockets perform worse than single CPU sockets. Kepler - Greener Science

Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs (8 2000 Cores per CPU) and 1x NVIDIA K10, K20, or K20X

Lower is better GPUs (235W each).

1500 Energy Expended

1000 = Power x Time Energy Energy Expended (kJ)

500

0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 4+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075

1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs)

GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB

Network configuration Infiniband QDR or better

28 Scale to multiple nodes with same single node configuration AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Benefits of GPU AMBER Accelerated Computing

Faster than CPU only systems in all tests

Most major compute intensive aspects of classical MD ported

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive 29 AMBER Benchmark Report, Revision 2.0, dated Nov. 5, 2012 NAMD 2.9 Kepler - Our Fastest Family of GPUs Yet

4.50 ApoA1 4.00 Running NAMD version 2.9 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 The green nodes contain Dual E5-2687W CPUs (8 2.9x Cores per CPU) and either 1x NVIDIA M2090, 1x K10

3.00 or 1x K20 for the GPU

2.63 2.6x 2.50 2.5x 2.00

Nanoseconds/Day 1.9x 1.50 1.37

1.00

0.50

0.00 Apolipoprotein A1 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node 31 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Accelerates Simulations of All Sizes

3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU)

Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

1.5

1 Speedup Speedup Compared to CPU Only

0.5

0 CPU All Molecules ApoA1 F1-ATPase STMV Apolipoprotein A1

Gain 2.5x throughput/performance by adding just 1 GPU when compared to dual CPU performance

32 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Kepler – Universally Faster

6 Running NAMD version 2.9

The CPU Only node contains Dual E5-2687W CPUs 5 (8 Cores per CPU).

5.1x The Kepler nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1 or two NVIDIA K10, K20, or 4 4.7x 4.3x K20X GPUs.

F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x

2.4x Speedup Speedup Compared to CPU Only

0 F1-ATPase CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X | Kepler nodes use Dual CPUs |

The Kepler GPUs accelerate all simulations, up to 5x Average acceleration printed in bars Outstanding Strong Scaling with Multi-STMV

Running NAMD version 2.9 Each blue XE6 CPU node contains 1x AMD 100 STMV on Hundreds of Nodes 1600 Opteron (16 Cores per CPU). 1.2

Fermi XK6 Each green XK6 CPU+GPU node contains 1x AMD 1600 Opteron (16 Cores per CPU) 1 and an additional 1x NVIDIA X2090 GPU. CPU XK6 2.7x

0.8

2.9x 0.6

Nanoseconds Day / 0.4

0.2 3.6x 3.8x Concatenation of 100 0 Satellite Tobacco Mosaic Virus 32 64 128 256 512 640 768 # of Nodes Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers Replace 3 Nodes with 1 2090 GPU

Running NAMD version 2.9 Each blue node contains 2x Intel Xeon X5550 CPUs F1-ATPase (4 Cores per CPU). 4 CPU Nodes 0.8 9000 0.74 The green node contains 2x Intel Xeon X5550 CPUs $8,000 1 CPU Node +8000 (4 Cores per CPU) and 1x NVIDIA M2090 GPU 0.7 1x M2090 GPUs 0.63 7000 Note: Typical CPU and GPU node pricing used. Pricing 0.6 may vary depending on node configuration. Contact your 6000 preferred HW vendor for actual pricing. 0.5 5000 0.4 $4,000 4000 0.3 3000 0.2 2000

0.1 1000

0 0 Nanoseconds/Day Cost

Speedup of 1.2x for 50% the cost F1-ATPase K20 - Greener: Twice The Science Per Watt

1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU).

Each green node contains 2x Intel Xeon X5550

800000 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA Lower is better K20 GPUs (225W per GPU)

600000 Energy Expended

Energy Energy Expended (kJ) 400000 = Power x Time

200000

0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs

36 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9

The blue node contains Dual E5-2687W CPUs 200000 (150W each, 8 Cores per CPU).

Lower is better The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or 150000 K20X GPUs (235W each). Energy Expended

100000 = Power x Time Energy Energy Expended (kJ)

50000

0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

Cut down energy usage by ½ with GPUs

Satellite Tobacco Mosaic Virus Recommended GPU Node Configuration for NAMD Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

38 Scale to multiple nodes with same single node configuration NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012 Summary/Conclusions Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive 39 NAMD Benchmark Report, Revision 2.0, dated Nov. 5, 2012 LAMMPS, Jan. 2013 or later More Science for Your Money

Embedded Atom Model Blue node uses 2x E5-2687W (8 Cores 6 and 150W per CPU). 5.5 Green nodes have 2x E5-2687W and 1 5 or 2 NVIDIA K10, K20, or K20X GPUs (235W). 4.5

3.3 2.92 3 2.47

2 1.7 Speedup Speedup Compared to CPU Only 1

0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes. K20X, the Fastest GPU Yet

7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU).

6 Green nodes have 2x E5-2687W and 2

NVIDIA M2090s or K20X GPUs (235W). 5

2 Speedup Speedup Relative to CPU Alone

0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X

Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s Get a CPU Rebate to Fund Part of Your GPU Budget

Acceleration in Loop Time Computation by Additional GPUs Running NAMD version 2.9 20 18.2 The blue node contains Dual X5670 CPUs 18 (6 Cores per CPU). 16

The green nodes contain Dual X5570 CPUs 14 12.9 (4 Cores per CPU) and 1-4 NVIDIA M2090 GPUs. 12 9.88 10

6 5.31 Normalized to CPU Only 4

0 1 Node 1 Node + 1x M2090 1 Node + 2x M2090 1 Node + 3x M2090 1 Node + 4x M2090

Increase performance 18x when compared to CPU-only nodes

Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs! Excellent Strong Scaling on Large Clusters

LAMMPS Gay-Berne 134M Atoms

600 GPU Accelerated XK6 500

CPU only XE6 400 3.55x 300

200

LoopTime (seconds) 3.48x 3.45x 100

0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes

Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090 GPUs Sustain 5x Performance for Weak Scaling

Weak Scaling with 32K Atoms per Node 45

30 6.7x 5.8x 4.8x 25

15 LoopTime (seconds) 10

0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone

Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090 Faster, Greener — Worth It!

Energy Consumed in one loop of EAM 140

120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only

100

60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs Energy Energy Expended (kJ) 40

0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.36.

Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer

W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory

NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012 Early Kepler Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00

4.00 XK6 3

)

s )

( 2.00 s XK6+GPU

Atomic Fluid (

1.00 e

m 2 i

XK7+GPU m i

T 0.50 T 0.25 XK6 0.13 1 0.06 XK6+GPU 0.03 0

1 2 4 8 16 32 64 128 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 3.0 8.00 XK7+GPU 2.5 4.00

2.0

)

2.00 (

)

s e ( XK6 1.5

Bulk Copper

m e

1.00 i

T m

i 1.0 T 0.50 XK6+GPU 0.5 0.25 0.0

0.13 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 2 4 8 16 32 64 128 1 Early Kepler Benchmarks on Titan 64.00 32 32.00 XK7+GPU 16

16.00

)

s (

8 s

8.00 (

Protein e

XK6 e

i m

4.00 i

T 4 T 2.00 XK6+GPU 2

1.00

0.50

1 4 6 4 6 6

1 6 5 9 8

1 2 4 8 16 32 64 128 Nodes 2

2 0

6 1 128.00 1 16 64.00 14 XK7+GPU 32.00 12

16.00

10 )

)

(

( 8.00

XK6 e

e 8

Liquid Crystal 4.00 m

i T

T 2.00 6 1.00 XK6+GPU 4 0.50 2 0.25 0 0.13 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 1 2 4 8 16 32 64 128 2 0 0 3 1 4 6 1 Early Titan XK6/XK7 Benchmarks

18 Speedup with Acceleration on XK6/XK7 Nodes 16 1 Node = 32K Particles 14 900 Nodes = 29M Particles 12 10 8 6 4 2 0 Atomic Fluid (cutoff Atomic Fluid (cutoff Bulk Copper Protein Liquid Crystal = 2.5σ) = 5.0σ) XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82 XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70 XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60 XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14 Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

51 Scale to multiple nodes with same single node configuration GROMACS 4.6 Final, Pre-Beta and 4.6 Beta Kepler - Our Fastest Family of GPUs Yet

14 GROMACS 4.6 Final Release Waters (192K atoms)

10 Running GROMACS 4.6 Final Release 3.0x 2.8x 2.8x 8 The blue nodes contains either single or dual E5-2687W CPUs (8 Cores per CPU).

ns/Day 6 The green nodes contain either single or dual E5-2687W CPUs (8 Cores per CPU) 1.7x 1.7x and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU 4

0 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X

Single Sandybridge CPU per node with single K10, K20 or K20X produces best performance

53 Great Scaling in Small Systems

25.00 Running GROMACS 4.6 pre-beta with CUDA 4.1 21.68 Each blue node contains 1x Intel X5550 CPU 20.00 3.2x (95W TDP, 4 Cores per CPU)

3.2x Each green node contains 1x Intel X5550 CPU (95W TDP, 4 Cores per CPU) and 1x NVIDIA 15.00 M2090 (225W TDP per GPU) 13.01

CPU Only 3.6x 10.00 With GPU 8.36

Nanoseconds Day / 3.6x 3.7x 5.00 Benchmark systems: RNAse in water with 16,816 atoms in truncated dodecahedron box 0.00 1 2 3 Number of Nodes

Get up to 3.7x performance compared to CPU-only nodes Additional Strong Scaling on Larger System 128K Water Molecules 160 Running GROMACS 4.6 pre-beta with CUDA 4.1

Each blue node contains 1x Intel X5670 (95W 140 TDP, 6 Cores per CPU)

120 Each green node contains 1x Intel X5670 (95W 2x TDP, 6 Cores per CPU) and 1x NVIDIA M2070 100 (225W TDP per GPU)

80 CPU Only With GPU 60

Nanoseconds Day / 2.8x 40

20 3.1x

0 8 16 32 64 128 Number of Nodes

Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance when compared to CPU-only nodes Replace 3 Nodes with 2 GPUs

ADH in Water (134K Atoms) Running GROMACS 4.6 pre-beta with CUDA 4.1 4 CPU Nodes 9 9000 The blue node contains 2x Intel X5550 CPUs 8.36 $8,000 1 CPU Node + (95W TDP, 4 Cores per CPU) 8 2x M2090 GPUs8000 The green node contains 2x Intel X5550 CPUs 6.7 7 $6,500 7000 (95W TDP, 4 Cores per CPU) and 2x NVIDIA M2090s as the GPU (225W TDP per 6 6000 GPU)

5 5000 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node 4 4000 configuration. Contact your preferred HW vendor for actual pricing. 3 3000

2 2000

1 1000

0 0 Nanoseconds/Day Cost

Save thousands of dollars and perform 25% faster Greener Science

ADH in Water (134K Atoms) Running GROMACS 4.6 with CUDA 4.1 12000 The blue nodes contain 2x Intel X5550 CPUs (95W TDP, 4 Cores per CPU) 10000 The green node contains 2x Intel X5550 CPUs,

Lower is better 4 Cores per CPU) and 2x NVIDIA M2090s GPUs Consumed) 8000 (225W TDP per GPU)

6000

KiloJoules (

4000 Energy Expended = Power x Time

2000 Energy Energy Expended

0 4 Nodes 1 Node + 2x M2090 (760 Watts) (640 Watts)

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy The Power of Kepler

RNase Solvated Protein 24k Atoms 140 Running GROMACS version 4.6 beta 120 The grey nodes contain 1 or 2 E5-2687W CPUs (150W each, 8 Cores per CPU) and 1 or 2 100 NVIDIA M2090s.

The green nodes contain 1 or 2 E5-2687W 80 CPUs (8 Cores per CPU) and 1 or 2 NVIDIA M2090 K20X GPUs (235W each).

60 K20X

0 1 CPU + 1 GPU 1 CPU + 2 GPU 2 CPU + 1 GPU 2 CPU + 2 GPU

Upgrading an M2090 to a K20X increases performance 10-45%

Ribonuclease K20X – Fast RNase Solvated Protein 24k Atoms 120

Running GROMACS version 4.6 beta 100 The blue nodes contain 1 or 2 E5-2687W CPUs (150W each, 8 Cores per CPU).

80 The green nodes contain 1 or 2 E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs (235W each). 60 CPU Only With 1 K20X

Nanoseconds Day / 40

0 1 CPU 2 CPUs

Adding a K20X increases performance by up to 3x

Ribonuclease K20X, the Fastest Yet

192K Water Molecules 16 Running GROMACS version 4.6-beta2 and 14 CUDA 5.0.35

12 The blue node contains 2 E5-2687W CPUs (150W each, 8 Cores per CPU).

10 The green nodes contain 2 E5-2687W CPUs (8 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs 8 (235W each).

6 Nanoseconds Day /

0 CPU CPU + K20X CPU + 2x K20X

Using K20X nodes increases performance by 2.5x

Water Try GPU accelerated GROMACS 4.6 for free – www.nvidia.com/GPUTestDrive Recommended GPU Node Configuration for GROMACS Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075 1x Kepler-based GPUs (K20X, K20 or K10): need fast Sandy # of GPUs per CPU socket Bridge or perhaps the very fastest Westmeres, or high-end AMD Opterons

GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

61 Scale to multiple nodes with same single node configuration CHARMM Release C37b1 GPUs Outperform CPUs

Daresbury Crambin 19.6k Atoms 70 Running CHARMM release C37b1 60 The blue nodes contains 44 X5667 CPUs (95W, 4 Cores per CPU). 50

The green nodes contain 2 X5667 CPUs and 1 40 or 2 NVIDIA C2070 GPUs (238W each).

Note: Typical CPU and GPU node pricing used. 30 Pricing may vary depending on node configuration. Contact your preferred HW vendor

Nanoseconds Day / for actual pricing. 20

0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 $44,000 $3000 $4000

1 GPU = 15 CPUs More Bang for your Buck

Daresbury Crambin 19.6k Atom 12 Running CHARMM release C37b1

10 The blue nodes contains 44 X5667 CPUs

(95W, 4 Cores per CPU).

8 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W).

6 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW vendor 4 for actual pricing.

Scaled Scaled Performance / Price

0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs delivers 10.6x the performance for the same cost Greener Science with NVIDIA Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms 18000

16000 Running CHARMM release C37b1

14000 The blue nodes contains 64 X5667 CPUs

(95W, 4 Cores per CPU).

12000 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W each). 10000 Lower is better Note: Typical CPU and GPU node pricing used. 8000 Pricing may vary depending on node configuration. Contact your preferred HW vendor for actual pricing. 6000 Energy Energy Expended (kJ)

4000 Energy Expended 2000 = Power x Time 0 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs will decrease energy use by 75% www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications

Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released since Version 6.12 www.abinit.org 1.3-2.7X Abinit diagonalization / Multi-GPU support orthogonalization Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development SIAL programming language and 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support SIP runtime environment 2012/deumens_ESaccel_2012.pdf Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X Released June 2009, content/training/electronic-structure- DFT; Daubechies wavelets, (1 CPU core to current release 1.6 2012/BigDFT-Formalism.pdf and BigDFT part of Abinit GPU kernel) Multi-GPU support http://www.olcf.ornl.gov/wp- content/training/electronic-structure- 2012/BigDFT-HPC-tues.pdf Under development, http://www.tcm.phy.cam.ac.uk/~mdt26/casino. TBD TBD Spring 2013 release Casino html Multi-GPU support http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development 2-7X content/training/ascc_2012/friday/ACSS_2012_V CP2K library) Multi-GPU support andeVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Algorithm, Hartree-Fock, MP2 Next release Q4 2012. GAMESS-US 2.3-2.9x HF Multi-GPU support http://www.msg.ameslab.gov/gamess/index.html and CCSD in Q4 2012 GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within calculations using Hartree Fock ab Release in Summer 2012 http://www.ncbi.nlm.nih.gov/pubmed/215419 initio methods and density 8x GAMESS-UK Multi-GPU support 63 functional theory. Supports organics & inorganics. Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support Announced Aug. 29, 2011 Gaussian Collaboration http://www.gaussian.com/g_press/nvidia_press.htm

(rmm-diis) Under development Schrodinger, Inc. Investigating GPU acceleration TBD Multi-GPU support Jaguar http://www.schrodinger.com/kb/278

3x NICS Electronic Structure Determination Workshop 2012: with 32 GPUs vs. Under development Generalized Wang-Landau method http://www.olcf.ornl.gov/wp- LSMS 32 (16-core) Multi-GPU support content/training/electronic-structure- CPUs 2012/Eisenbach_OakRidge_February.pdf Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU. Additional GPU www.molcas.org support coming in Version 8 Density-fitted MP2 (DF-MP2), 1.7-2.3X Under development www.molpro.net density fitted local correlation MOLPRO projected Multiple GPU Hans-Joachim Werner methods (DF-RHF, DF-KS), DFT

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

pseudodiagonalization, full Under Development Academic port. diagonalization, and density 3.8-14X MOPAC2009 Single GPU http://openmopac.net matrix assembling

Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Release targeting end of 2012 CCSD & EOMCCSD task 3-10X projected And http://www.olcf.ornl.gov/wp- NWChem Multiple GPUs schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf

Octopus DFT and TDDFT TBD Released http://www.tddft.org/programs/octopus/

http://www.q- RI-MP2 8x-14x Released, Version 4.0 Q-CHEM chem.com/doc_for_web/qchem_manual_4.0.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison BigDFT

CP2K Kepler, it’s faster

12 Running CP2K version 12413-trunk on CUDA

5.0.36

10 The blue node contains 2 E5-2687W CPUs (150W, 8 Cores per CPU).

8 The green nodes contain 2 E5-2687W CPUs and 1 or 2 NVIDIA K10, K20, or K20X GPUs (235W each). 6

4 Performance Performance Relative to CPU Only 2

0 CPU Only CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Using GPUs delivers up to 12.6x the performance per node Strong Scaling

XK6 With GPUs

7 XK6 Without GPUs Conducted on Cray XK6 Using matrix-matrix multiplication 6 NREP=6 and N=159,000 with 50% occupation

GPUCores - 5 3x 4 2.9x 3

2 2.3x

Speedup Relative to256 non 1

0 256 512 768 # of Cores used Speedups increase as more nodes are added, up to 3x at 768 nodes Kepler, keeping the planet Green 350

300 Running CP2K version 12413-trunk on CUDA 5.0.36

250 The blue node contains 2 E5-2687W CPUs

(150W, 8 Cores per CPU).

200 The green nodes contain 2 E5-2687W CPUs and 1 or 2 NVIDIA K20 GPUs (235W each). Lower is better 150 Energy Expended

= Power x Time Energy Energy Expended (kJ) 100

0 CPU Only CPU + K20 CPU + 2x K20 Using K20s will lower energy use by over 75% for the same simulation GAUSSIAN Gaussian Key quantum chemistry code ACS Fall 2011 press release Joint collaboration between Gaussian, NVDA and PGI for GPU acceleration: http://www.gaussian.com/g_press/nvidia_press.htm No such release exists for Intel MIC or AMD GPUs Mike Frisch quote: “Calculations using Gaussian are limited primarily by the available computing resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the development of hardware, compiler technology and application software among the three companies, the new application will bring the speed and cost-effectiveness of GPUs to the challenging problems and applications that Gaussian’s customers need to address.”

NVIDIA Confidential

GAMESS GAMESS Partnership Overview Mark Gordon and Andrey Asadchev, key developers of GAMESS, in collaboration with NVIDIA. Mark Gordon is a recipient of a NVIDIA Professor Partnership Award.

Quantum Chemistry one of major consumers of CPU cycles at national supercomputer centers

NVIDIA developer resources fully allocated to GAMESS code We like to push the envelope as much as we can in the direction of highly scalable efficient codes.“ GPU technology seems like a good way to achieve this goal. Also, since we are associated with a DOE Laboratory, energy efficiency is important, and this is another reason to explore quantum chemistry on GPUs. Prof. Mark Gordon Distinguished Professor, Department” of Chemistry, Iowa State University and Director, Applied Mathematical Sciences Program, AMES Laboratory 84 GAMESS August 2011 GPU Performance First GPU supported GAMESS release via "libqc", a library for fast quantum chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software 2e- AO integrals and their assembly into a closed shell Fock matrix

2.0

4x E5640 CPUs 4x E5640 CPUs + 4x Tesla C2070s

1.0

GAMESS Aug. 2011 Release Relative Release 2011 Aug. GAMESS

Performance for Two Small Molecules Small Two for Performance

0.0 Ginkgolide (53 atoms) Vancomycin (176 atoms) Upcoming GAMESS Q4 2012 Release

Multi-nodes with multi-GPUs supported Rys Quadrature Hartree-Fock 8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup. See 2012 publication Møller–Plesset perturbation theory (MP2): Preliminary code completed Paper in development Coupled Cluster SD(T): CCSD code completed, (T) in progress

GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F

Hartree-Fock GPU Speedups* 3.5

2.9 3.0 Adding 1x 2070 GPU speeds up computations 2.5 2.5 2.5 2.4 by 2.3x to 2.9x 2.3 2.3 2.3

2.0

Speedup 1.5

1.0

0.5

* A. Asadchev, M.S. Gordon, “New Multithreaded Hybrid CPU/GPU Approach to 0.0 Hartree-Fock,” Journal of Chemical Theory and Taxol 6-31G Taxol 6-31G(d) Taxol 6- Taxol 6- Valinomycin 6- Valinomycin 6- Valinomycin 6- Computation (2012) 31G(2d,2p) 31++G(d,p) 31G 31G(d) 31G(2d,2p) NVIDIA CONFIDENTIAL 87 GPAW Used with permission from Samuli Hakala

92 93 94 95 96 97 NWChem NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes

System: cluster consisting of dual-socket nodes constructed from:

• 8-core AMD Interlagos processors • 64 GB of memory • Tesla M2090 (Fermi) GPUs

The nodes are connected using a high-performance QDR Infiniband interconnect

Courtesy of Kowolski, K., Bhaskaran-Nair, at al @ PNNL, JCTC (submitted) Quantum Espresso/PWscf Kepler, fast science

AUsurf 14 Running Quantum Espresso version 5.0-build7 on CUDA 5.0.36 12 The blue node contains 2 E5-2687W CPUs (150W, 8 Cores per CPU). 10 The green nodes contain 2 E5-2687W CPUs and 1 or 2 NVIDIA M2090 or K10 GPUs (225W 8 and 235W respectively).

4 Performance Performance Relative to CPU Only 2

0 CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10

Using K10s delivers up to 11.7x the performance per node over CPUs And 1.7x the performance when compared to M2090s Extreme Performance/Price from 1 GPU

4 Simulations run on FERMI @ ICHEC. 3.5 A 6-Core 2.66 GHz Intel X5650 was 3 used for the CPU

2.5 An NVIDIA C2050 was used for the GPU

1.5

CPU+ Scaled Scaled to CPU Only 1 GPU CPU 0.5 Only

0 Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)

Calcite structure

Adding a GPU can improve performance by 3.7x while only increasing price by 25% Extreme Performance/Price from 1 GPU

4 Price and Performance scaled to the CPU only system Simulations run on FERMI @ ICHEC. 3.5 A 6-Core 2.66 GHz Intel X5650 was 3 used for the CPU

An NVIDIA C2050 was used for the 2.5 GPU

1.5 CPU+ GPU 1 CPU 0.5 Only

0 Price: Performance: (AUSURF112, k- Performance: (AUSURF112, point) gamma-point)

Calculation done for a gold surface of 112 atoms

Adding a GPU can improve performance by 3.5x while only increasing price by 25% Replace 72 CPUs with 8 GPUs

Simulations run on PLX @ CINECA.

250 Intel 6-Core 2.66 GHz X5550 were LSMO-BFO (120 Atoms) 8 K-points used for the CPUs

223 219 NVIDIA M2070s were used for the

200 GPUs

150

100 Elapsed Elapsed Time(minutes)

0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

The GPU Accelerated setup performs faster and costs 24% less QE/PWscf - Green Science

LSMO-BFO (120 Atoms) 8 K-points Simulations run on PLX @ CINECA. 12000 Intel 6-Core 2.66 GHz X5550 were used for the CPUs

10000

NVIDIA M2070s were used for the

Lower is better GPUs 8000

6000

4000 PowerConsumption (Watts)

2000

0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

Over a year, the lower power consumption would save $4300 on energy bills NVIDIA GPUs Use Less Energy

Energy Consumption on Different Tests Simulations run on FERMI @ ICHEC. 0.6 A 6-Core 2.66 GHz Intel X5650 was CPU Only used for the CPU

0.5 An NVIDIA C2050 was used for the

CPU+GPU GPU

0.4

Lower is better 0.3 -58%

0.2 PowerConsumption [kW/h]

0.1 -54% -57%

0 Shilu-3 AUSURF112 Water-on-Calcite In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only QE/PWscf - Great Strong Scaling in Parallel

CdSe-159 Walltime of 1 full SCF Simulations run on STONEY @ ICHEC. 35000 Two quad core 2.87 GHz Intel X5560s 30000 were used in each node

Lower is better CPU Two NVIDIA M2090s were used in 25000 each node for the CPU+GPU test CPU+GPU 2.5x

20000

(s) ime ime

T 15000

10000 2.2x 5000 2.1x 2.2x

0 2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112) Nodes (Total CPU Cores) 159 Cadmium Selenide nanodots

Speedups up to 2.5x with GPU Accelerations QE/PWscf - More Powerful Strong Scaling

GeSnTe134 Walltime of full SCF

4500 Simulations run on PLX @ CINECA. CPU 4000 Two 6-Core 2.4 GHz Intel E5645s were CPU+GPU used in each node 3500 Two NVIDIA M2070s were used in 1.6x Lower is better each node for the CPU+GPU test

3000

2500

2000 Time(s) 2.3x 1500 2.4x 2.1x 1000

500

0 4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528) Nodes (Total CPU Cores) Accelerate your cluster by up to 2.1x with NVIDIA GPUs

Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive TeraChem TeraChem Supercomputer Speeds on GPUs Time for SCF Step 100 TeraChem running on 8 C2050s on 1 node 90 NWChem running on 4096 Quad Core CPUs 80 In the Chinook Supercomputer

70 Giant Fullerene C240 Molecule

40 Time(Seconds) 30

0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Similar performance from just a handful of GPUs TeraChem Bang for the Buck Performance/Price TeraChem running on 8 C2050s on 1 node

600 NWChem running on 4096 Quad Core CPUs

In the Chinook Supercomputer

493 500 Giant Fullerene C240 Molecule

Note: Typical CPU and GPU node pricing 400 used. Pricing may vary depending on node configuration. Contact your preferred HW vendor for actual pricing.

300

200

100 Price/Performance Price/Performance relativeSupercomputer to 1 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Dollars spent on GPUs do 500x more science than those spent on CPUs Kepler’s Even Better

Olestra BLYP 453 Atoms B3LYP/6-31G(d) 800 2000 TeraChem running on C2050 and K20C

700 1800 First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 1600 600 1400

500

1200

400 1000

Seconds Seconds 300 800

600 200 400 100 200

0 0 C2050 K20C C2050 K20C

Kepler performs 2x faster than Tesla Viz, ―Docking‖ and Related Applications Growing

Related Features GPU Perf Release Status Notes Applications Supported

Empirical Free Released 6.5-13.4X University of Bristol BUDE Energy Forcefield Single GPU http://www.bris.ac.uk/biochemistry/cpfg/bude/bude.htm

Released, Suite 2011 GPU accelerated application 3.75-5000X Schrodinger, Inc. Core Hopping Single and multi-GPUs. http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released 800-3000X Open Eyes Scientific Software FastROCS searching/comparison Single and multi-GPUs. http://www.eyesopen.com/fastrocs

Lines: 460% increase Cartoons: 1246% increase Released, Version 1.5 PyMol Surface: 1746% increase 1700x http://pymol.org/ Spheres: 753% increase Single GPUs Ribbon: 426% increase

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison FastROCS

OpenEye Japan Hideyuki Sato, Ph.D.

400000

300000

200000 Second

100000 Shape Overlays per Shape Overlays 0 CPU GPU Riding Moore’s Law

2000000

1800000 1600000 1400000 1200000 1000000 800000 600000 400000

Shape Overlays per Second per Overlays Shape 200000 0 C1060 C2050 C2075 C2090 K10 K20 FastROCS scaling across 4x K10s (2 physical GPUs per K10) 53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule) 9000000

8000000 7000000 6000000 5000000 4000000 3000000

2000000 Conformers per Second per Conformers 1000000 0 1 2 3 4 5 6 7 8 Number of individual K10 GPUs (Note, each K10 has 2 physical GPUs on the board) Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive 11 8 GPU Test Drive Experience GPU Acceleration

For Computational Chemistry Researchers, Biophysicists

Preconfigured with Molecular Dynamics Apps

Remotely Hosted GPU Servers

Free & Easy – Sign up, Log in and See Results

www.nvidia.com/gputestdrive

11 9