Outline of tutorial at MSI

I. Intro to GPUs

A. MSI plans to add significantly more GPUs in the future; the current system is described at: https://www.msi.umn.edu/hpc/cascade

B. MSI currently has: i. 8 Westmere nodes with 4 Fermi M2070 GPGPUs/node (Cascade queue) ii. 4 SandyBridge nodes with 2 Kepler K20m GPGPUs/node (Kepler queue)

II. Computational and Biology Applications

A. Applications, including: i. AMBER, NAMD, LAMMPS, GROMACS, CHARMM, , DL_POLY, OpenMM ii. GPU only codes: ACEMD, HOOMD-Blue

B. , including: i. Abinit, BigDFT, CP2K, , GAMESS, GPAW, NWChem, Quantum Espresso, -CHEM, VASP & more ii. GPU only code: TeraChem (ParaChem)

C. Bioinformatics i. NVBIO (nvBowtie), BWA-MEM (Both currently in development) update

Updated: Oct. 21, 2013 Founded 1993 Invented GPU 1999 – Computer Graphics Visual Computing, Supercomputing, Cloud & Mobile Computing NVIDIA - Core Technologies and Brands

GPU Mobile Cloud

® ® GeForce Tegra GRID Quadro® , Tesla® Accelerated Computing Multi-core plus Many-cores

GPU Accelerator CPU Optimized for Many Optimized for Parallel Tasks Serial Tasks

3-10X+ Comp Thruput 7X Memory Bandwidth 5x Energy Efficiency How GPU Acceleration Works

Application Code

Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU

+ GPUs : Two Year Heart Beat

32 Volta Stacked DRAM 16 Maxwell Unified Virtual Memory 8 Kepler Dynamic Parallelism 4

Fermi 2 FP64

DP GFLOPS GFLOPS per DP Watt 1

Tesla 0.5 CUDA

2008 2010 2012 2014 Kepler Features Make GPU Coding Easier Hyper-Q Dynamic Parallelism Speedup Legacy MPI Apps Less Back-Forth, Simpler Code FERMI 1 Work Queue CPU Fermi GPU CPU Kepler GPU

KEPLER 32 Concurrent Work Queues Developer Momentum Continues to Grow

100M 430M CUDA –Capable GPUs CUDA-Capable GPUs

150K 1.6M CUDA Downloads CUDA Downloads

1 50 Supercomputer Supercomputers

60 640 University Courses University Courses

4,000 37,000 Academic Papers Academic Papers

2008 2013 Explosive Growth of GPU Accelerated Apps

# of Apps Top Scientific Apps 200 61% Increase Molecular AMBER LAMMPS CHARMM NAMD Dynamics GROMACS DL_POLY 150 Quantum QMCPACK Gaussian 40% Increase Quantum Espresso NWChem Chemistry GAMESS-US VASP

CAM-SE 100 Climate & COSMO NIM GEOS-5 Weather WRF

Chroma GTS 50 Denovo ENZO GTC MILC

ANSYS Mechanical ANSYS Fluent 0 CAE MSC Nastran OpenFOAM 2010 2011 2012 SIMULIA Abaqus LS-DYNA

Accelerated, In Development Overview of Life & Material Accelerated Apps MD: All key codes are available AMBER, CHARMM, DESMOND, DL_POLY, GROMACS, LAMMPS, NAMD GPU only codes: ACEMD, HOOMD-Blue Great multi-GPU performance Focus: scaling to large numbers of GPUs / nodes

QC: All key codes are ported/optimizing: Active GPU acceleration projects: Abinit, BigDFT, CP2K, GAMESS, Gaussian, GPAW, NWChem, Quantum Espresso, VASP & more GPU only code: TeraChem Analytical instruments – actively recruiting Bioinformatics – market development

GPU Test Drive Experience GPU Acceleration

For Researchers, Biophysicists

Preconfigured with Molecular Dynamics Apps

Remotely Hosted GPU Servers

Free & Easy – Sign up, Log in and See Results

www.nvidia.com/gputestdrive

11 Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported

AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks

2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump

Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx

165 ns/Day DHFR Released Release 4.6.3; 1st Multi-GPU support Implicit (5x), Explicit (2x) on GROMACS Multi-GPU, multi-node www..org 4X C2075s

http://lammps.sandia.gov/bench.html#desktop Lennard-Jones, Gay-Berne, 3.5-18x on Released and LAMMPS Tersoff & many more potentials ORNL Titan Multi-GPU, multi-node http://lammps.sandia.gov/bench.html#titan

4.0 ns/day Released Full electrostatics with PME and NAMD 2.9 F1-ATPase on 100M atom capable NAMD most simulation features http://www.ks.uiuc.edu/Research/namd/ 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported

Production biomolecular dynamics (MD) software specially 150 ns/day DHFR on 1x Released Written for use only on GPUs optimized to run on GPUs ACEMD K20 Single and Multi-GPUs http://www.acellera.com/

Bonded, pair, excluded interactions; Van Released, Version der Waals, electrostatic, nonbonded far TBD http://www.schrodinger.com/productpage/14/3/ DESMOND 3.4.0/0.7.2 interactions

Powerful distributed computing molecular Depends upon number of Released http://folding.stanford.edu dynamics system; implicit solvent and Folding@Home GPUs GPUs and CPUs GPUs get 4X the points of CPUs folding

High-performance all-atom biomolecular Depends upon number of http://www.gpugrid.net/ Released GPUGrid.net simulations; explicit solvent and binding GPUs

Simple fluids and binary mixtures (pair Up to 66x on 2090 vs. 1 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled-binary- potentials, high-precision NVE and NVT, HALMD CPU core Single GPU mixture-kob-andersen dynamic correlations)

Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in late 2013

mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html

Implicit: 127-213 ns/day Library and application for molecular dynamics on high- Implicit and explicit solvent, custom Released, Version 5.2 Explicit: 18-55 performance OpenMM forces Multi-GPU ns/day DHFR https://simtk.org/home/openmm_suite GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

Local Hamiltonian, non-local Released; Version 7.4.1 www..org Hamiltonian, LOBPCG algorithm, 1.3-2.7X ABINIT Multi-GPU support diagonalization / orthogonalization

Integrating scheduling GPU into SIAL http://www.olcf.ornl.gov/wp- Under development programming language and SIP 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support runtime environment 2012/deumens_ESaccel_2012.pdf

Available Q1 2014 Fock Matrix, Hessians TBD www.scm.com ADF Multi-GPU support

DFT; Daubechies , Released, Version 1.7.0 2-5X http://bigdft.org BigDFT part of Abinit Multi-GPU support

Code for performing quantum Monte Carlo (QMC) electronic structure Under development TBD http://www.tcm.phy.cam.ac.uk/~mdt26/casino.html Casino calculations for finite and periodic Multi-GPU support systems

CASTEP TBD TBD Under development http://www.castep.org/Main/HomePage

http://www.olcf.ornl.gov/wp- Released DBCSR (spare matrix multiply library) 2-7X content/training/ascc_2012/friday/ACSS_2012_VandeVon CP2K Multi-GPU support dele_s.pdf

Libqc with Rys Quadrature Algorithm, 1.3-1.6X, Released http://www.msg.ameslab.gov/gamess/index.html GAMESS-US Hartree-Fock, MP2 and CCSD 2.3-2.9x HF Multi-GPU support GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within calculations using Hartree Fock ab initio methods and Released, Version 7.0 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK density functional theory. Supports Multi-GPU support organics & inorganics.

Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Gaussian Collaboration

Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.ht GPAW orthonormalizing of vectors, residual 8x Multi-GPU support ml, Samuli Hakala (CSC Finland) & minimization method (rmm-diis) Chris O’Grady (SLAC)

Under development Schrodinger, Inc. Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 http://www.schrodinger.com/productpage/14/7/32/

http://on- Released demand.gputechconf.com/gtc/2013/presentations/S31 CU_BLAS, SP2 algorithm TBD LATTE Multi-GPU support 95-Fast-Quantum-Molecular-Dynamics-in-LATTE.pdf

Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU support www..org coming in Version 8

Density-fitted MP2 (DF-MP2), density Under development www..net fitted local correlation methods (DF-RHF, 1.7-2.3X projected MOLPRO Multiple GPU Hans-Joachim Werner DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported

Pseudodiagonalization, full Released Academic port. diagonalization, and density matrix 3.8-14X MOPAC2013 available Q1 2014 MOPAC2012 http://openmopac.net assembling Single GPU

Development GPGPU benchmarks: www.- sw.org Triples part of Reg-CCSD(T), CCSD Released, Version 6.3 7-8X And http://www.olcf.ornl.gov/wp- NWChem & EOMCCSD task schedulers Multiple GPUs content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf

Full GPU support for ground-state, real-time calculations; Kohn-Sham Hamiltonian, orthogonalization, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ subspace diagonalization, poisson solver, time propagation

ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/

Density functional theory (DFT) Released First principles materials code that computes the plane wave 6-10X PEtot Multi-GPU behavior of the structures of materials calculations

Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php/GPU _version_of_QMCPACK

Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php and computational kernels, 3D FFTs Espresso/PWscf http://www.quantum-espresso.org/

Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5k “Full GPU-based solution” GAMESS CPU http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure-2012/Luehr- ESCMA.pdf

2x Hybrid Hartree-Fock DFT 2 GPUs Available on request functionals including exact By Carnegie Mellon University VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores

3x NICS Electronic Structure Determination Workshop 2012: Under development http://www.olcf.ornl.gov/wp- Generalized Wang-Landau method with 32 GPUs vs. 32 WL-LSMS Multi-GPU support content/training/electronic-structure- (16-core) CPUs 2012/Eisenbach_OakRidge_February.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

3D of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html

High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU

University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm

Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

Interactive Released Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support

Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution

PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/

Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase

High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html

Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme

Aligns multiple query sequences MUMmer GPU 3-10x http://sourceforge.net/apps/mediawiki/mummer against reference sequence in Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel

Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers Released, Version 1.12 Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models GPU Perf compared against same or similar code running on single CPU machine Performance measured internally or independently

MD Average Speedups

The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs

(8 Cores per CPU) and 1 or 2 NVIDIA K10, K20, 8 or K20X GPUs.

6

4 Performance Performance Relative to CPU Only 2

0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X

Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported

AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks

2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump

Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx

165 ns/Day DHFR Released Release 4.6.3; 1st Multi-GPU support Implicit (5x), Explicit (2x) on GROMACS Multi-GPU, multi-node www.gromacs.org 4X C2075s

http://lammps.sandia.gov/bench.html#desktop Lennard-Jones, Gay-Berne, 3.5-18x on Released and LAMMPS Tersoff & many more potentials ORNL Titan Multi-GPU, multi-node http://lammps.sandia.gov/bench.html#titan

4.0 ns/day Released Full electrostatics with PME and NAMD 2.9 F1-ATPase on 100M atom capable NAMD most simulation features http://www.ks.uiuc.edu/Research/namd/ 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported

Production biomolecular dynamics (MD) software specially 150 ns/day DHFR on 1x Released Written for use only on GPUs optimized to run on GPUs ACEMD K20 Single and Multi-GPUs http://www.acellera.com/

Bonded, pair, excluded interactions; Van Released, Version der Waals, electrostatic, nonbonded far TBD http://www.schrodinger.com/productpage/14/3/ DESMOND 3.4.0/0.7.2 interactions

Powerful distributed computing molecular Depends upon number of Released http://folding.stanford.edu dynamics system; implicit solvent and Folding@Home GPUs GPUs and CPUs GPUs get 4X the points of CPUs folding

High-performance all-atom biomolecular Depends upon number of http://www.gpugrid.net/ Released GPUGrid.net simulations; explicit solvent and binding GPUs

Simple fluids and binary mixtures (pair Up to 66x on 2090 vs. 1 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled-binary- potentials, high-precision NVE and NVT, HALMD CPU core Single GPU mixture-kob-andersen dynamic correlations)

Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in late 2013

mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html

Implicit: 127-213 ns/day Library and application for molecular dynamics on high- Implicit and explicit solvent, custom Released, Version 5.1 Explicit: 18-55 performance OpenMM forces Multi-GPU ns/day DHFR https://simtk.org/home/openmm_suite GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Built from Ground Up for GPUs Computational Chemistry

What Study disease & discover drugs Predict drug and protein interactions

Speed of simulations is critical GPU READY Why APPLICATIONS Enables study of: ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESS GROMACS How GPUs increase throughput & accelerate simulations LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem

• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333 AMBER 12

26 Ross Walker (AMBER developer) video

27 Performance

AMBER 12

SAN DIEGO SUPERCOMPUTER CENTER 28 Explicit Solvent Performance (JAC-DHFR NVE Production Benchmark)

2xK20X 102.13 1xK20X 89.13 2xK20 95.59 1xK20 81.09 4xK10 (2 card) 97.95 3xK10 (1.5 card) 82.70 2xK10 (1 card) 67.02 1xK10 (0.5 card) 52.94 Single < $1000 GPU 2xM2090 (1 node) 58.28 1xM2090 43.74 1xGTX Titan 111.34 4xGTX680 (1 node) 118.88 3xGTX680 (1 node) 101.31 2xGTX680 (1 node) 86.84 1xGTX680 72.49 1xGTX580 54.46 96xE5-2670 (6 nodes) 41.55 64xE5-2670 (4 nodes) 58.19 48xE5-2670 (3 nodes) 50.25 32xE5-2670 (2 nodes) 38.49 16xE5-2670 (1 node) 21.13 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00

Kepler K20 GPU Kepler K10 GPU M2090 GPU GTX GPU Gordon CPU

SAN DIEGO SUPERCOMPUTER CENTER Throughput in ns/day 29 Protein Folding Simulation With AMBER Accelerated By GPUs 2.95x Faster

168 ns/day 585 ns/day

2x8xE5-2670 CPU 2x8xE5-2670 CPU + Tesla K20X GPU

Data courtesy of AMBER.org AMBER Performance Example (TRP Cage) CPU 2x8xE5-2670 GPU K20

SAN DIEGO SUPERCOMPUTER CENTER 31 Historical Single Node / Single GPU AMBER Performance (DHFR Production NVE) PMEMD Peformance 120

100

80

GPU

60 CPU ns/day Linear (GPU) Linear (CPU) 40

20

0 2007 2008 2009 2010 2011 2012 2013 2014

SAN DIEGO SUPERCOMPUTER CENTER 32 Interactive MD? • Single nodes are now fast enough, GPU enabled cloud nodes actually make sense as a back end now.

SAN DIEGO SUPERCOMPUTER CENTER 33 Cost Comparison 4 simultaneous simulations, 23,000 atoms, 250ns each, 5 days maximum time to solution.

Traditional Cluster GPU Workstation Nodes Required 12 1 (4 GPUs) Interconnect QDR IB None Time to complete simulations 4.98 days 2.25 days

Power Consumption 5.7 kW (681.3 kWh) 1.0 kW (54.0 kWh) System Cost (per day) $96,800 ($88.40) $5200 ($4.75)

Simulation Cost (681.3 * 0.18) + (88.40 * 4.98) (54.0 * 0.18) + (4.75 * 2.25)

$562.87 $20.41

>25x cheaper AND solution obtained in less than half the time

SAN DIEGO SUPERCOMPUTER CENTER 34 Kepler - Our Fastest Family of GPUs Yet

30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU). 7.4x The green nodes contain Dual E5-2687W CPUs

20.00 18.90 (8 Cores per CPU) and either 1x NVIDIA M2090, 6.6x 1x K10 or 1x K20 for the GPU

15.00

11.85 5.6x

Nanoseconds Day / 10.00 3.5x 5.00 3.42

0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 35 K10 Accelerates Simulations of All Sizes

30 Running AMBER 12 GPU Support Revision 12.1

The blue node contains Dual E5-2687W CPUs 24.00 25 (8 Cores per CPU).

19.98 The green nodes contain Dual E5-2687W CPUs 20 (8 Cores per CPU) and 1x NVIDIA K10 GPU

15

10

Speedup Speedup Compared to CPU Only 5.50 5.53 5.04 5 2.00

0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00

The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) 20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

15.00

10.00 7.28

6.50 6.56 Speedup Speedup Compared to CPU Only

5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance

37 K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1

30 28.59 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU 20

15

10 8.30

7.15 7.43 Speedup Speedup Compared to CPU Only

5 2.79

0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB

Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance

38 K10 Strong Scaling over Nodes

Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off 6

The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) 5

The green nodes contains 2x Intel X5670

CPUs (6 Cores per CPU) plus 2x NVIDIA 4 K10 GPUs 2.4x

3 CPU Only 3.6x With GPU

Nanoseconds Day / 2 5.1x 1

0 1 2 4 Cellulose Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes

Cellulose (NVE) Running AMBER 12 GPU Support 9 Revision 18

8 The blue nodes contain Dual E5- 2687W CPUs (8 Cores per CPU). 7 1.2x The green nodes contain Dual E5- 6 1.9x 2687W CPUs (8 Cores per CPU) and 5 2x NVIDIA K20 GPUs 3.1x CPU Only 4 5.3x With GPU

Nanoseconds/Day 3

2

1 Cellulose 0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes

Factor IX Running AMBER 12 GPU Support 35 Revision 18

The blue nodes contain Dual E5- 30 1.3x 2687W CPUs (8 Cores per CPU). 1.8x 25 The green nodes contain Dual E5- 2.8x 2687W CPUs (8 Cores per CPU) and 20 2x NVIDIA K20 GPUs 4.7x CPU Only 15

With GPU Nanoseconds/Day

10

5 Factor IX

0 1 2 4 8 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes

JAC (NVE) Running AMBER 12 GPU Support 140 Revision 18

120 The blue nodes contain Dual E5- 2687W CPUs (8 Cores per CPU).

100

The green nodes contain Dual E5- 1.7x 1.8x 2687W CPUs (8 Cores per CPU) and 80 2x NVIDIA K20 GPUs 2.4x CPU Only 60

4x With GPU Nanoseconds/Day 40

20

0 DHFR 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes

Cellulose (NVE) Running AMBER 12 GPU Support Revision 18 9 The blue nodes contain Dual E5- 8 2687W CPUs (8 Cores per CPU). 1.3x 7 The green nodes contain Dual E5-

2x 2687W CPUs (8 Cores per CPU) and

6 2x NVIDIA K20X GPUs 3.3x 5

5.8x CPU Only 4 With GPU

Nanoseconds/Day 3

2

1 Cellulose 0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes

Factor IX Running AMBER 12 GPU Support Revision 18 40 The blue nodes contain Dual E5- 35 2687W CPUs (8 Cores per CPU).

1.4x 30 The green nodes contain Dual E5- 2687W CPUs (8 Cores per CPU) and 1.9x 25 2x NVIDIA K20X GPUs 2.9x 20 5.1x CPU Only With GPU

15 Nanoseconds/Day

10

5 Factor IX

0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes

JAC (NVE) Running AMBER 12 GPU Support Revision 18 140 The blue nodes contain Dual E5- 120 2687W CPUs (8 Cores per CPU).

The green nodes contain Dual E5- 100 1.7x 2687W CPUs (8 Cores per CPU) and 1.9x 2x NVIDIA K20X GPUs 80 2.5x 4.3x CPU Only 60

With GPU Nanoseconds/Day 40

20 DHFR

0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes Kepler – Universally Faster

9 Running AMBER 12 GPU Support Revision 12.1

8 The CPU Only node contains Dual E5-2687W

CPUs (8 Cores per CPU). 7

The Kepler nodes contain Dual E5-2687W CPUs 6 (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs 5 JAC Factor IX 4 Cellulose

3

2 Speedups Speedups Compared CPU to Only

1

0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose

The Kepler GPUs accelerated all simulations, up to 8x K10 Extreme Performance

Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 97.99 100 The green node contain Dual E5-2687W CPUs

(8 Cores per CPU) and 2x NVIDIA K10 GPUs

80

60

Nanoseconds Day / 40

20 12.47

0 DHFR 1 Node 1 Node Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance K20 Extreme Performance

DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 120 SPFP with CUDA 4.2.9 ECC Off

The blue node contains 2x Intel E5-2687W CPUs 95.59 100 (8 Cores per CPU)

Each green node contains 2x Intel E5-2687W

80 CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU

60

40 Nanoseconds Day /

20 12.47

0 1 Node 1 Node DHFR

Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance

48 Replace 8 Nodes with 1 K20 GPU

90.00 35000 $32,000.00 Running AMBER 12 GPU Support Revision 81.09 12.1 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x 70.00 65.00 Intel E5-2687W CPUs (8 Cores per CPU) 25000 60.00 Each green node contains 2x Intel E5- 2687W CPUs (8 Cores per CPU) plus 1x 50.00 20000 NVIDIA K20 GPU

Note: Typical CPU and GPU node pricing 40.00 15000 used. Pricing may vary depending on node configuration. Contact your preferred HW 30.00 vendor for actual pricing. 10000 20.00 $6,500.00

5000 10.00

0.00 0 Nanoseconds/Day Cost Cut down simulation costs to ¼ and gain higher performance DHFR

49 Replace 7 Nodes with 1 K10 GPU

Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x 70 $30,000.00 Intel E5-2687W CPUs (8 Cores per CPU)

60 The green node contains 2x Intel E5-2687W $25,000.00

CPUs (8 Cores per CPU) plus 1x NVIDIA 50 K10 GPU $20,000.00 40 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW 30

vendor for actual pricing. Nanoseconds Day / $10,000.00 20 $7,000

10 $5,000.00

0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled

Cut down simulation costs to ¼ and increase performance by 70% DHFR

50 Extra CPUs decrease Performance

Cellulose NVE Running AMBER 12 GPU Support Revision 12.1

8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8

6 Cores per CPU)

5

4 1 E5-2687W 2 E5-2687W

3 Nanoseconds Day /

2

CPUs 2 GPUs 2 CPUs

1 CPU 2 GPUs 2 CPU 1 2 1

0 Cellulose CPU Only CPU with dual K20s

When used with GPUs, dual CPU sockets perform worse than single CPU sockets. Kepler - Greener Science

Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU).

The green nodes contain Dual E5-2687W CPUs 2000 Lower is better (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs (235W each).

1500 Energy Expended

1000 = Power x Time Energy Energy Expended (kJ)

500

0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2

Cores per CPU socket 4+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+ System memory per node (GB) 16

GPUs Kepler K10, K20, K20X

1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs)

GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB

Network configuration Infiniband QDR or better 53 Scale to multiple nodes with same single node configuration Benefits of GPU AMBER Accelerated Computing

Faster than CPU only systems in all tests

Most major compute intensive aspects of classical MD ported

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive 54 NAMD 2.9 K20 Accelerates Simulations of All Sizes

3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU)

Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

1.5

1 Speedup Speedup Compared to CPU Only

0.5

0 CPU All Molecules ApoA1 F1-ATPase STMV

Gain 2.5x throughput/performance by adding just 1 GPU Apolipoprotein A1 when compared to dual CPU performance

56 Kepler - Our Fastest Family of GPUs Yet

4.50

ApoA1 4.00 Running NAMD version 2.9 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 2.9x The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 1x NVIDIA M2090, 3.00 1x K10 or 1x K20 for the GPU. 2.63 2.6x 2.50 2.5x 2.00

Nanoseconds/Day 1.9x 1.50 1.37

1.00

0.50

0.00 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X Apolipoprotein A1 M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node 57 Kepler - Our Fastest Family of GPUs Yet

1.4

F1-ATPase 1.25 Running NAMD version 2.9 1.2 The blue node contains Dual E5-2687W CPUs 1.12 (8 Cores per CPU).

1 0.96 The green nodes contain Dual E5-2687W 2.7x CPUs (8 Cores per CPU) and either 1x NVIDIA 0.83 M2090, 1x K10 or 1x K20 for the GPU. 0.8 2.4x

2.1x 0.6

0.46 1.8x Nanoseconds/Day

0.4

0.2

F1-ATPase 0 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X

GPU speedup/throughput increased from 1.8x (with M2090) to 2.7x (with K20X) when compared to a CPU only node 58 Kepler - Our Fastest Family of GPUs Yet

0.4

STMV 0.35 Running NAMD version 2.9 0.35 The blue node contains Dual E5-2687W CPUs 0.31 (8 Cores per CPU).

0.29 0.3 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 1x NVIDIA

3.0x 0.25 M2090, 1x K10 or 1x K20 for the GPU. 0.22 2.7x 0.2 2.5x

Nanoseconds/Day 0.15 0.12 1.9x

0.1

0.05

0 Satellite Tobacco Mosaic Virus 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X GPU speedup/throughput increased from 1.9x (with M2090) to 3.0x (with K20X) when compared to a CPU only node 59 Kepler - Our Fastest Family of GPUs Yet

8 Running NAMD version 2.9 ApoA1 7.14 The blue node contains Dual E5-2687W CPUs 7 6.67 (8 Cores per CPU). 6.25 The green nodes contain Dual E5-2687W 6 5.2x CPUs (8 Cores per CPU) and either 2x NVIDIA 5.00 M2090, K10, K20, or K20X for the GPU. 5 4.9x

4 4.6x

Nanoseconds/Day 3 3.6x

2 1.37

1 Apolipoprotein A1 0 2 CPU Node 2 CPU Node + 2xM2090 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X

60 Kepler - Our Fastest Family of GPUs Yet

2.5 Running NAMD version 2.9 F1-ATPase 2.22 The blue node contains Dual E5-2687W CPUs 2.04 (8 Cores per CPU).

2 The green nodes contain Dual E5-2687W 1.79 CPUs (8 Cores per CPU) and either 2x NVIDIA 4.8x M2090, K10, K20, or K20X for the GPU.

1.56

1.5 4.4x

3.9x

1 3.4x Nanoseconds/Day

0.46 0.5

F1-ATPase

0 2 CPU Node 2 CPU Node + 2xM2090 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X

61 Kepler - Our Fastest Family of GPUs Yet

0.7 Running NAMD version 2.9 STMV 0.61 The blue node contains Dual E5-2687W CPUs 0.6 (8 Cores per CPU). 0.56 0.52 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 2x NVIDIA 0.5 5.3x M2090, K10, K20, or K20X for the GPU.

0.41 0.4 4.9x

0.3 4.5x Nanoseconds/Day 3.6x 0.2

0.12 0.1

Satellite Tobacco Mosaic Virus

0 2 CPU Node 2 CPU Node + 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X 2xM2090

62 Kepler – Universally Faster

6 Running NAMD version 2.9

The CPU Only node contains Dual E5-2687W 5 CPUs (8 Cores per CPU). 5.1x The Kepler nodes contain Dual E5-2687W CPUs 4 4.7x (8 Cores per CPU) and 1 or two NVIDIA K10, 4.3x K20, or K20X GPUs. F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x

2.4x Speedup Speedup Compared to CPU Only

1

0 CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X | Kepler nodes use Dual CPUs | F1-ATPase The Kepler GPUs accelerate all simulations, up to 5.1x Average acceleration printed in bars ApoA1 Relative Performance 6

5.2 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.1 4 3.6 The green nodes contain Single or Dual E5-2687W CPUs (8 Cores per CPU) Relative to and either 1x or 2x NVIDIA M2090 or 2x CPU 2.9 3 2.7 K20X for the GPU.

1.9 2

1.0 1 0.4

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU Apolipoprotein A1 E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket

64 F1-ATPase Relative Performance 6

5.1 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.0 4 The green nodes contain Single or Dual 3.4 E5-2687W CPUs (8 Cores per CPU) Relative to and either 1x or 2x NVIDIA M2090 or 2.9 2x CPU 3 2.7 K20X for the GPU.

2 1.8

1.0 1 0.4

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + F1-ATPase 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket

65 STMV Relative Performance 6

5.1 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.0 4 The green nodes contain Single or Dual 3.4 E5-2687W CPUs (8 Cores per CPU) Relative to 2.9 and either 1x or 2x NVIDIA M2090 or 2x CPU 3 2.7 K20X for the GPU.

2 1.8

1.0 1 0.4

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + Satellite Tobacco Mosaic 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU Virus E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket

66 NAMD CPU and GPU Scaling

Courtesy of Jim Phillips @ Beckman Institute UIUC

67 Outstanding Strong Scaling with Multi-STMV

Running NAMD version 2.9 100 STMV on Hundreds of Nodes Each blue XE6 CPU node contains 1x 1.2 AMD 1600 Opteron (16 Cores per CPU). Fermi XK6 1 Each green XK6 CPU+GPU node CPU XK6 2.7x contains 1x AMD 1600 Opteron (16 Cores per CPU) and an additional 1x 0.8 NVIDIA X2090 GPU.

2.9x 0.6

Nanoseconds Day / 0.4

0.2 3.6x 3.8x 0 32 64 128 256 512 640 768 # of Nodes Concatenation of 100 Satellite Tobacco Mosaic Virus Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers Replace 3 Nodes with 1 2090 GPU

Running NAMD version 2.9 F1-ATPase Each blue node contains 2x Intel Xeon X5550 CPUs (4 Cores per CPU). 4 CPU Nodes 0.8 9000 0.74 $8,000 The green node contains 2x Intel Xeon X5550 1 CPU Node +8000 0.7 1x M2090 GPUs CPUs (4 Cores per CPU) and 1x NVIDIA M2090 0.63 GPU 7000 0.6 Note: Typical CPU and GPU node pricing used. 6000 0.5 Pricing may vary depending on node configuration. 5000 Contact your preferred HW vendor for actual 0.4 $4,000 pricing. 4000 0.3 3000 0.2 2000

0.1 1000

0 0 Nanoseconds/Day Cost

Speedup of 1.2x for 50% the cost F1-ATPase K20 - Greener: Twice The Science Per Watt

1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU).

800000 Each green node contains 2x Intel Xeon Lower is better X5550 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA K20 GPUs (225W per GPU)

600000 Energy Expended

Energy Energy Expended (kJ) 400000 = Power x Time

200000

0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs

70 Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9

The blue node contains Dual E5-2687W 200000 CPUs (150W each, 8 Cores per CPU).

Lower is better The green nodes contain Dual E5-2687W 150000 CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or K20X GPUs (235W each).

100000 Energy Expended

Energy Energy Expended (kJ) = Power x Time

50000

0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs

Cut down energy usage by ½ with GPUs

Satellite Tobacco Mosaic Virus Recommended GPU Node Configuration for NAMD Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

72 Scale to multiple nodes with same single node configuration Summary/Conclusions Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with small marginal price increase

Energy usage cut in half

GPUs scale very well within a node and over multiple nodes

Tesla K20 GPU is our fastest and lowest power high performance GPU to date

Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive 73 LAMMPS More Science for Your Money (LAMMPS) Embedded Atom Model 6 5.5 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 5 4.5 Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K10, K20, or K20X GPUs (235W). 4

3.3 2.92 3 2.47

2 1.7 Speedup Speedup Compared to CPU Only 1

0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes. K20X, the Fastest GPU Yet (LAMMPS)

7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU).

6 Green nodes have 2x E5-2687W and 2

NVIDIA M2090s or K20X GPUs (235W). 5

4

3

2 Speedup Speedup Relative to CPU Alone

1

0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X

Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s Get a CPU Rebate to Fund Part of Your GPU Budget

Acceleration in LAMMPS Loop Time Computation by Additional GPUs 20 The blue node contains Dual X5670 CPUs 18.2 (6 Cores per CPU). 18

16 The green nodes contain Dual X5570 CPUs (4 Cores per CPU) and 1-4 NVIDIA M2090 14 12.9 GPUs. 12 9.88 10

8

6 5.31 Normalized to CPU Only 4

2

0 1 Node 1 Node + 1x M2090 1 Node + 2x M2090 1 Node + 3x M2090 1 Node + 4x M2090

Increase performance 18x when compared to CPU-only nodes

Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs! Excellent Strong Scaling on Large Clusters

LAMMPS Gay-Berne 134M Atoms

600 GPU Accelerated XK6 500

CPU only XE6 400 3.55x 300

200

LoopTime (seconds) 3.48x 3.45x 100

0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes

Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090 GPUs Sustain 5x Performance for Weak Scaling

LAMMPS Weak Scaling with 32K Atoms per Node 45

40

35

30 6.7x 5.8x 4.8x 25

20

15 LoopTime (seconds) 10

5

0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone

Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090 Faster, Greener — Worth It! (LAMMPS)

Energy Consumed in one loop of EAM 140

120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only

100

80

60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs Energy Energy Expended (kJ) 40

20

0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X

Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.35.

Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive Accelerate LAMMPS Simulations with GPUs

7 6.6 “Summary of best speedups 6 versus running on a single XK7 CPU for CPU-only and 5 accelerated runs. Simulation is 400 timesteps for a 1

4 million molecule droplet. The

XK7 w/out GPU speedups are calculated

XK7 w/ GPU based on the single node Speedup 3 loop time of 440.3 seconds.”

2

1.0 1 *Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers 0 1 Node – Three-Body Potentials,” Computer Physics Communications (2013, submitted) Accelerate LAMMPS Simulations with GPUs

250

211 “Summary of best speedups versus running on a single 200 XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1

150 million molecule droplet. The

speedups are calculated XK7 w/out GPU based on the single node

XK7 w/ GPU Speedup 100 loop time of 440.3 seconds.”

50 41.6

*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers 0 – Three-Body Potentials,” Computer 64 Nodes Physics Communications (2013, submitted) Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer

W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory

NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012 Early Kepler LAMMPS Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00

4.00 XK6 3

)

s )

( 2.00 s XK6+GPU

Atomic Fluid (

e

1.00 e

m 2 i

XK7+GPU m i

T 0.50 T 0.25 XK6 0.13 1 0.06 XK6+GPU 0.03 0

1 2 4 8 16 32 64 128 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 3.0 8.00 XK7+GPU 2.5 4.00

2.0

)

s

2.00 (

)

s e ( XK6 1.5

Bulk Copper

m e

1.00 i

T m

i 1.0 T 0.50 XK6+GPU 0.5 0.25 0.0

0.13 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 2 4 8 16 32 64 128 1 Early Kepler LAMMPS Benchmarks on Titan 64.00 32 32.00 XK7+GPU 16

16.00

)

)

s (

8 s

8.00 (

Protein e

XK6 e

m

i m

4.00 i

T 4 T 2.00 XK6+GPU 2

1.00

0.50

1

1 4 6 4 6 6

4

4

1 6 5 9 8

1 2 4 8 16 32 64 128 Nodes 2

2 0

3

0

4

6 1 128.00 1 16 64.00 14 32.00 XK7+GPU 12

16.00

) )

s 10 s

( 8.00

(

e

Liquid Crystal 4.00 XK6 8 e

m

m

i

i T 2.00 6 T 1.00 4 0.50 XK6+GPU 0.25 2 0.13 0

1 4 6 4 6 4 6 4 1 2 4 8 16 32 64 128 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 Early Titan XK6/XK7 LAMMPS Benchmarks

18 Speedup with Acceleration on XK6/XK7 Nodes 16 1 Node = 32K Particles 14 900 Nodes = 29M Particles 12 10 8 6 4 2 0 Atomic Fluid (cutoff Atomic Fluid (cutoff Bulk Copper Protein Liquid Crystal = 2.5σ) = 5.0σ) XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82 XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70 XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60 XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14 CPU and GPU Scaling in Water Simulations (LAMMPS)

“Simulation rates for a fixed- size 256K molecule water box in parallel (Left) and a scaled-size water box with 32K molecules per node (Right). Parallel simulation rates with ideal efficiency would have a slope equal to that of the green line in each plot.”

*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted) Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32

GPUs Kepler K10, K20, K20X

# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

88 Scale to multiple nodes with same single node configuration GROMACS 4.6.1 and 4.6.3 Erik Lindahl (GROMACS developer) video Kepler – Our Fastest Family of GPUs Yet

16 GROMACS 4.6.1 Water (192K Atoms) Running GROMACS 4.6.1

14 The blue nodes contains either single or dual E5-2687W CPUs (8 Cores per CPU). 12

3.1x The green nodes contain either single or 10 2.8x dual E5-2687W CPUs (8 Cores per

CPU) and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU

8 3.2x ns/Day

6 1.7x 1.7x

4

2

0 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X

Single Sandy Bridge CPU per node with single K10, K20 or K20X produces best performance

91 K20X, The Fastest Yet

192K Water Molecules 25 Running GROMACS version 4.6.1 and CUDA 5.0.35 22.27 The blue node contains 2 E5-2687W CPUs 20 (150W each, 8 Cores per CPU). 3.1x The green nodes contain 2 E5-2687W CPUs (8 15 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs 13.65 (235W each).

10 1.9x

Nanoseconds/Day 7.26

5

0 CPU CPU + K20 CPU + 2x K20X

Using K20X nodes increases performance by 3.1x Water K20 Increases Performance

Water (192K Atoms) 70 Running GROMACS 4.6.3 66.29 The blue nodes contain 1x E5-2687W CPU 60 (150W TDP, 8 Cores per CPU).

2.7x The green nodes contain 1x E5-2687W CPU 50 (8 Cores per CPU) and 1x NVIDIA K20 GPU

(235W TDP per GPU). 39.12 40 1 CPU/Node (1 CPU + 1x K20)/Node 30 2.8x

Nanoseconds/Day 21.46 20 2.9x 11.70 10 2.7x

0 1 2 4 8

Get up to 2.9x performance compared to CPU-only nodes 2x K20 Increases Performance

Water (192K Atoms) 80 Running GROMACS 4.6.3 75.53 The blue nodes contain 1x E5-2687W CPU 70 (150W TDP, 8 Cores per CPU).

60 The green nodes contain 1x E5-2687W CPU 3.1x (8 Cores per CPU) and 2x NVIDIA K20 GPU

(235W TDP per GPU). 50 48.02

40 1 CPU/Node 3.5x (1 CPU + 2x K20)/Node

30

Nanoseconds/Day 26.84

20 15.02 3.7x 3.5x 10

0 1 2 4 8

Get up to 3.7x performance compared to CPU-only nodes Great Scaling in Large Systems

Water (3072K Atoms) 6 Running GROMACS 4.6.3

The blue nodes contain 1x E5-2687W CPU 5.00 (150W TDP, 8 Cores per CPU). 5

The green nodes contain 1x E5-2687W CPU (8 Cores per CPU) and 1x NVIDIA K20 GPU 4 (235W TDP per GPU). 2.9x

3 1 CPU/Node (1 CPU + 1x K20)/Node

2.35 Nanoseconds/Day 2 2.8x 1.19 1 0.70 2.8x 2.9x

0 1 2 4 8 Get up to 2.9x performance compared to CPU-only nodes Great Scaling in Large Systems

Water (3072K Atoms) 7 Running GROMACS 4.6.3 6.46 The blue nodes contain 1x E5-2687W CPU 6 (150W TDP, 8 Cores per CPU).

The green nodes contain 1x E5-2687W CPU 5 (8 Cores per CPU) and 2x NVIDIA K20 GPU

3.8x (235W TDP per GPU).

4 3.34 1 CPU/Node (1 CPU + 2x K20)/Node 3

Nanoseconds/Day 4x 2 1.64

3.9x 1 0.84 3.5x

0 1 2 4 8

Get up to 4x performance compared to CPU-only nodes Higher Performance with GPUs

Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 M2090/node The blue nodes contain single E5-2687W CPUs (8 Cores per CPU).

1 M2090/node The green nodes contain single E5-2687W ns/day CPUs (8 Cores per CPU) and either 1x or (log scale) 2x NVIDIA M2090 for the GPU. Single E5-2687w/node 1 1 2 4 8

0.1 # Nodes (log scale) Higher Performance with GPUs Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 M2090/node

The blue nodes contain dual E5-2687W 1 M2090/node CPUs (8 Cores per CPU).

The green nodes contain dual E5-2687W Dual E5-2687w/node CPUs (8 Cores per CPU) and either 1x or 2x NVIDIA M2090 for the GPU. ns/day (log scale) 1 1 2 4 8

0.1 # Nodes (log scale) Relative Performance Water (3072K Atoms) 2.5 Running GROMACS 4.6.3 2.2 The blue nodes contain either single or dual E5- 2 2687W CPUs (8 Cores per CPU).

The green nodes contain either single or dual E5- 1.5 2687W CPUs (8 Cores per CPU) and either 1x or 2x NVIDIA M2090 for the GPU. Relative to 1.2 2x CPU 1.0 1

0.6 0.5

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket GROMACS Scaling; 1 CPU & 2 GPUs /node

Water (192K Atoms) Running GROMACS 4.6.3 100 The blue nodes contain single E5-2687W 2 K20/node CPUs (8 Cores per CPU).

The green nodes contain single E5- 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU. Single E5-2687

ns/day 10 (log scale)

1 1 2 4 8 Nodes (log scale) GROMACS Scaling; 2 CPUs & 2 GPUs /node

Water (192K Atoms) 100 2 K20/node Running GROMACS 4.6.3

The blue nodes contain dual E5-2687W CPUs (8 Cores per CPU).

Dual E5-2687 The green nodes contain dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.

ns/day 10 (log scale)

1 1 2 4 8 # Nodes (log scale) Relative Performance 3.5 Water (192K Atoms) Running GROMACS 4.6.3 3 2.9 The blue nodes contain either single or dual E5-2687W CPUs (8 2.5 Cores per CPU).

2.0 2.0 The green nodes contain either 2 single or dual E5-2687W CPUs (8 Relative to 1.7 Cores per CPU) and either 1x or 2x 2x CPU 1.6 1.5 NVIDIA M2090 and K20 for the GPU. 1.1 1.0 1

0.6 0.5

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20 M2090 K20 CPU Single-Socket Dual-Socket GROMACS Scaling; 1 CPU & 2 GPUs /node

Water (3072K Atoms)

10 Running GROMACS 4.6.3 2 K20/node The blue nodes contain single E5-2687W CPUs (8 Cores per CPU).

The green nodes contain single E5- 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.

Single E5-2687 ns/day 1 (log scale) 1 2 4 8

0.1 Nodes (log scale) GROMACS Scaling; 2 CPUs & 2 GPUs /node Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 K20/node The blue nodes contain dual E5-2687W CPUs (8 Cores per CPU).

The green nodes contain dual E5- Dual E5-2687 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.

ns/day 1 (log scale) 1 2 4 8

0.1 # Nodes (log scale) Relative Performance

3 Water (3072K Atoms) 2.8 Running GROMACS 4.6.3

2.5 The blue nodes contain either 2.2 single or dual E5-2687W CPUs (8 2.0 Cores per CPU). 2 1.8 The green nodes contain either Relative to 1.7 2x CPU single or dual E5-2687W CPUs (8 1.5 Cores per CPU) and either 1x or 2x NVIDIA M2090 and K20 for the 1.2 GPU. 1.0 1

0.6 0.5

0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20 M2090 K20 CPU Single-Socket Dual-Socket K20X Scaling over 8 Nodes Water (192K Atoms) Running GROMACS 4.6.3 100 The blue nodes contain Dual E5-2687W 90 CPUs (8 Cores per CPU).

80 The green nodes contain Dual E5-2687W 70 CPUs (8 Cores per CPU) and 2x NVIDIA 2.3x K20X GPUs 60

50 2.7x CPU Only

40 With GPU Nanoseconds/Day 30 3x 20 3x 10

0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes Water (3072K Atoms) Running GROMACS 4.6.3 9 The blue nodes contain Dual E5-2687W 8 CPUs (8 Cores per CPU).

7 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA

6 K20X GPUs 2.8x 5

CPU Only 4 2.9x With GPU Nanoseconds/Day 3

2 2.8x 1 2.8x

0 1 2 4 8 Number of Nodes

GPUs significantly outperform CPUs while scaling over multiple nodes Additional Strong Scaling on Larger System 128K Water Molecules 160 Running GROMACS 4.6 pre-beta with CUDA 4.1 140 Each blue node contains 1x Intel X5670 (95W TDP, 6 Cores per CPU) 120

2x Each green node contains 1x Intel X5670 100 (95W TDP, 6 Cores per CPU) and 1x NVIDIA M2070 (225W TDP per GPU) 80 CPU Only With GPU 60

Nanoseconds Day / 2.8x 40

20 3.1x

0 8 16 32 64 128 Number of Nodes

Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance when compared to CPU-only nodes Replace 3 Nodes with 2 GPUs

Running GROMACS 4.6 pre-beta with ADH in Water (134K Atoms) CUDA 4.1

4 CPU Nodes 9 9000 The blue node contains 2x Intel X5550 8.36 CPUs (95W TDP, 4 Cores per CPU) $8,000 1 CPU Node + 8 2x M2090 GPUs8000 The green node contains 2x Intel X5550 6.7 7 $6,500 7000 CPUs (95W TDP, 4 Cores per CPU) and 2x NVIDIA M2090s as the GPU (225W TDP 6 6000 per GPU)

5 5000 Note: Typical CPU and GPU node pricing 4 4000 used. Pricing may vary depending on node configuration. Contact your preferred HW 3 3000 vendor for actual pricing.

2 2000

1 1000

0 0 Nanoseconds/Day Cost

Save thousands of dollars and perform 25% faster Greener Science

ADH in Water (134K Atoms) Running GROMACS 4.6 with CUDA 4.1 12000

The blue nodes contain 2x Intel X5550 CPUs (95W TDP, 4 Cores per CPU) 10000

Lower is better The green node contains 2x Intel X5550 CPUs, Consumed) 8000 4 Cores per CPU) and 2x NVIDIA M2090s GPUs (225W TDP per GPU)

6000

KiloJoules (

4000 Energy Expended = Power x Time

2000 Energy Energy Expended

0 4 Nodes 1 Node + 2x M2090 (760 Watts) (640 Watts)

In simulating each nanosecond, the GPU-accelerated system uses 33% less energy Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K10, K20, K20X

1x Kepler-based GPUs (K20X, K20 or K10): need fast Sandy # of GPUs per CPU socket Bridge or perhaps the very fastest Westmeres, or high-end AMD Opterons

GPU memory preference (GB) 6

GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher

111 Network configuration Gemini, InfiniBand CHARMM Release C37b1 GPU Acceleration via OpenMM (GPUs supported on single node) GPUs Outperform CPUs

Daresbury Crambin 19.6k Atoms 70 Running CHARMM release C37b1 60 The blue nodes contains 44 X5667 CPUs 50 (95W, 4 Cores per CPU).

The green nodes contain 2 X5667 CPUs and 40 1 or 2 NVIDIA C2070 GPUs (238W each).

30 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node

Nanoseconds Day / configuration. Contact your preferred HW 20 vendor for actual pricing.

10

0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 $44,000 $3000 $4000

1 GPU = 15 CPUs More Bang for your Buck

Daresbury Crambin 19.6k Atom 12 Running CHARMM release C37b1

10 The blue nodes contains 44 X5667 CPUs (95W, 4 Cores per CPU).

8 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W). 6 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node 4 configuration. Contact your preferred HW vendor for actual pricing. Scaled Scaled Performance / Price 2

0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs delivers 10.6x the performance for the same cost Greener Science with NVIDIA Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms 18000

16000 Running CHARMM release C37b1

14000 The blue nodes contains 64 X5667 CPUs

(95W, 4 Cores per CPU).

12000 The green nodes contain 2 X5667 CPUs and 10000 1 or 2 NVIDIA C2070 GPUs (238W each). Lower is better 8000 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW 6000 Energy Energy Expended (kJ) vendor for actual pricing.

4000 Energy Expended 2000 = Power x Time 0 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070

Using GPUs will decrease energy use by 75% CHARMM Future Release with Native GPU Improvements (Multiple GPUs supported on multiple nodes) Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications

Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

Local Hamiltonian, non-local Released; Version 7.4.1 www.abinit.org Hamiltonian, LOBPCG algorithm, 1.3-2.7X ABINIT Multi-GPU support diagonalization / orthogonalization

Integrating scheduling GPU into SIAL http://www.olcf.ornl.gov/wp- Under development programming language and SIP 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support runtime environment 2012/deumens_ESaccel_2012.pdf

Available Q1 2014 Fock Matrix, Hessians TBD www.scm.com ADF Multi-GPU support

DFT; Daubechies wavelets, Released, Version 1.7.0 2-5X http://bigdft.org BigDFT part of Abinit Multi-GPU support

Code for performing quantum Monte Carlo (QMC) electronic structure Under development TBD http://www.tcm.phy.cam.ac.uk/~mdt26/casino.html Casino calculations for finite and periodic Multi-GPU support systems

CASTEP TBD TBD Under development http://www.castep.org/Main/HomePage

http://www.olcf.ornl.gov/wp- Released DBCSR (spare matrix multiply library) 2-7X content/training/ascc_2012/friday/ACSS_2012_VandeVon CP2K Multi-GPU support dele_s.pdf

Libqc with Rys Quadrature Algorithm, 1.3-1.6X, Released http://www.msg.ameslab.gov/gamess/index.html GAMESS-US Hartree-Fock, MP2 and CCSD 2.3-2.9x HF Multi-GPU support GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within calculations using Hartree Fock ab initio methods and Released, Version 7.0 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK density functional theory. Supports Multi-GPU support organics & inorganics.

Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Gaussian Collaboration

Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.ht GPAW orthonormalizing of vectors, residual 8x Multi-GPU support ml, Samuli Hakala (CSC Finland) & minimization method (rmm-diis) Chris O’Grady (SLAC)

Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 http://www.schrodinger.com/productpage/14/7/32/

http://on- Released demand.gputechconf.com/gtc/2013/presentations/S31 CU_BLAS, SP2 algorithm TBD LATTE Multi-GPU support 95-Fast-Quantum-Molecular-Dynamics-in-LATTE.pdf

Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU support www.molcas.org coming in Version 8

Density-fitted MP2 (DF-MP2), density Under development www.molpro.net fitted local correlation methods (DF-RHF, 1.7-2.3X projected MOLPRO Multiple GPU Hans-Joachim Werner DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported

Pseudodiagonalization, full Released Academic port. diagonalization, and density matrix 3.8-14X MOPAC2013 available Q1 2014 MOPAC2012 http://openmopac.net assembling Single GPU

Development GPGPU benchmarks: www.nwchem- sw.org Triples part of Reg-CCSD(T), CCSD Released, Version 6.3 7-8X And http://www.olcf.ornl.gov/wp- NWChem & EOMCCSD task schedulers Multiple GPUs content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf

Full GPU support for ground-state, real-time calculations; Kohn-Sham Octopus Hamiltonian, orthogonalization, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ subspace diagonalization, poisson solver, time propagation

ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/

Density functional theory (DFT) Released First principles materials code that computes the plane wave pseudopotential 6-10X PEtot Multi-GPU behavior of the electron structures of materials calculations

Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php/GPU _version_of_QMCPACK

Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php and computational kernels, 3D FFTs Espresso/PWscf http://www.quantum-espresso.org/

Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5k “Full GPU-based solution” GAMESS CPU http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure-2012/Luehr- ESCMA.pdf

2x Hybrid Hartree-Fock DFT 2 GPUs Available on request functionals including exact By Carnegie Mellon University VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores

3x NICS Electronic Structure Determination Workshop 2012: Under development http://www.olcf.ornl.gov/wp- Generalized Wang-Landau method with 32 GPUs vs. 32 WL-LSMS Multi-GPU support content/training/electronic-structure- (16-core) CPUs 2012/Eisenbach_OakRidge_February.pdf

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison ABINIT ABINIT on GPUS

Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA

For ground-state calculations, the part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs BigDFT Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA GAUSSIAN Gaussian Key quantum chemistry code ACS Fall 2011 press release Joint collaboration between Gaussian, NVDA and PGI for GPU acceleration: http://www.gaussian.com/g_press/nvidia_press.htm No such release exists for Intel MIC or AMD GPUs Mike Frisch quote: “Calculations using Gaussian are limited primarily by the available computing resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the development of hardware, compiler technology and application software among the three companies, the new application will bring the speed and cost-effectiveness of GPUs to the challenging problems and applications that Gaussian’s customers need to address.”

GAMESS GAMESS Partnership Overview Mark Gordon and Andrey Asadchev, key developers of GAMESS, in collaboration with NVIDIA. Mark Gordon is a recipient of a NVIDIA Professor Partnership Award.

Quantum Chemistry one of major consumers of CPU cycles at national supercomputer centers

NVIDIA developer resources fully allocated to GAMESS code We like to push the envelope as much as we can in the direction of highly scalable efficient codes.“ GPU technology seems like a good way to achieve this goal. Also, since we are associated with a DOE Laboratory, energy efficiency is important, and this is another reason to explore quantum chemistry on GPUs. Prof. Mark Gordon Distinguished Professor, Department” of Chemistry, Iowa State University and Director, Applied Mathematical Sciences Program, AMES Laboratory 140 GAMESS August 2011 GPU Performance First GPU supported GAMESS release via "libqc", a library for fast quantum chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software 2e- AO integrals and their assembly into a closed shell Fock matrix

2.0

4x E5640 CPUs 4x E5640 CPUs + 4x Tesla C2070s

1.0

GAMESS Aug. 2011 Release Relative Release 2011 Aug. GAMESS

Performance for Two Small Molecules Small Two for Performance

0.0 Ginkgolide (53 atoms) Vancomycin (176 atoms) Upcoming GAMESS Q4 2013 Release

Multi-nodes with multi-GPUs supported Rys Quadrature Hartree-Fock 8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup. See 2012 publication Møller–Plesset perturbation theory (MP2): Preliminary code completed Paper in development SD(T): CCSD code completed, (T) in progress

142

GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F

Hartree-Fock GPU Speedups* 3.5

2.9 3.0 Adding 1x 2070 GPU speeds up computations 2.5 2.5 2.5 2.4 by 2.3x to 2.9x 2.3 2.3 2.3

2.0

Speedup 1.5

1.0

0.5

* A. Asadchev, M.S. Gordon, “New Multithreaded Hybrid CPU/GPU Approach to 0.0 Hartree-Fock,” Journal of Chemical Theory and Taxol 6-31G Taxol 6-31G(d) Taxol 6- Taxol 6- Valinomycin 6- Valinomycin 6- Valinomycin 6- Computation (2012) 31G(2d,2p) 31++G(d,p) 31G 31G(d) 31G(2d,2p)

143 GPAW Increase Performance with Kepler

3.5 Running GPAW 10258

The blue nodes contain 1x E5-2687W CPU (8 3.0 3 Cores per CPU).

2.7 2.5 The green nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU) and 1x or 2x NVIDIA K20X for the GPU.

2

1.6 1.5 1.5 1.4

1 1 1

1 Speedup Compared toCPU Only

0.5

0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler

3 Running GPAW 10258

The blue nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU).

The green nodes contain 1x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for 2 the GPU. 2.4x

1.5 2.2x

1

1.7x Speedup Speedup Compared to CPU Only

0.5

0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler

1.6 Running GPAW 10258

The blue nodes contain 2x E5-2687W CPUs (8 1.4 Cores per CPU).

1.2 The green nodes contain 2x E5-2687W CPUs (8 1.4x Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU. 1 1.4x 0.8 1.3x 0.6

0.4 Speedup Compared toCPU Only

0.2

0 Silicon K=1 Silicon K=2 Silicon K=3 Used with permission from Samuli Hakala

152 153 154 155 156 157 NWChem NWChem 6.3 Release with GPU Acceleration

Addresses large complex and challenging molecular-scale scientific problems in the areas of catalysis, materials, geochemistry and biochemistry on highly scalable, parallel computing platforms to obtain the fastest time-to-solution Researchers can for the first time be able to perform large scale coupled cluster with perturbative triples calculations utilizing the NVIDIA GPU technology. A highly scalable multi-reference coupled cluster capability will also be available in NWChem 6.3. The software, released under the Educational Community License 2.0, can be downloaded from the NWChem website at www.nwchem-sw.org NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes

System: cluster consisting of dual-socket nodes constructed from:

• 8-core AMD Interlagos processors • 64 GB of memory • Tesla M2090 (Fermi) GPUs

The nodes are connected using a high-performance QDR Infiniband interconnect

Courtesy of Kowolski, K., Bhaskaran-Nair, at al @ PNNL, JCTC (submitted) Kepler, Faster Performance (NWChem)

180 165 Uracil 160

140

)

120

100 81 80

60 54 Time to Solution (seconds Solution to Time 40

20 Uracil Molecule 0 CPU Only CPU + 1x K20X CPU + 2x K20X

Performance improves by 2x with one GPU and by 3.1x with 2 GPUs Quantum Espresso/PWscf Kepler, fast science

AUsurf 14 Running Quantum Espresso version 5.0- build7 on CUDA 5.0.36 12

The blue node contains 2 E5-2687W CPUs 10 (150W, 8 Cores per CPU).

The green nodes contain 2 E5-2687W CPUs 8 and 1 or 2 NVIDIA M2090 or K10 GPUs (225W and 235W respectively). 6

4 Performance Performance Relative to CPU Only 2

0 CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10

Using K10s delivers up to 11.7x the performance per node over CPUs And 1.7x the performance when compared to M2090s Extreme Performance/Price from 1 GPU

4 Quantum Espresso Simulations run 3.5 on FERMI @ ICHEC.

3 A 6-Core 2.66 GHz Intel X5650 was used for the CPU 2.5 An NVIDIA C2050 was used for the 2 GPU

1.5

CPU+ Scaled Scaled to CPU Only 1 GPU CPU 0.5 Only

0 Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)

Calcite structure

Adding a GPU can improve performance by 3.7x while only increasing price by 25% Extreme Performance/Price from 1 GPU

4 Price and Performance scaled to the CPU only system Quantum Espresso Simulations run 3.5 on FERMI @ ICHEC.

3 A 6-Core 2.66 GHz Intel X5650 was used for the CPU 2.5 An NVIDIA C2050 was used for the 2 GPU

1.5 CPU+ GPU 1 CPU 0.5 Only

0 Price: Performance: (AUSURF112, k- Performance: (AUSURF112, point) gamma-point)

Calculation done for a gold surface of 112 atoms

Adding a GPU can improve performance by 3.5x while only increasing price by 25% Replace 72 CPUs with 8 GPUs

Quantum Espresso Simulations run on PLX @ CINECA. 250

LSMO-BFO (120 Atoms) 8 K-points Intel 6-Core 2.66 GHz X5550 were used for the CPUs 223 219 200 NVIDIA M2070s were used for the

GPUs

150

100 Elapsed Elapsed Time(minutes)

50

0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

The GPU Accelerated setup performs faster and costs 24% less QE/PWscf - Green Science

LSMO-BFO (120 Atoms) 8 K-points Quantum Espresso Simulations run 12000 on PLX @ CINECA.

Intel 6-Core 2.66 GHz X5550 were 10000 used for the CPUs

Lower is better NVIDIA M2070s were used for the 8000 GPUs

6000

4000 PowerConsumption (Watts)

2000

0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)

Over a year, the lower power consumption would save $4300 on energy bills NVIDIA GPUs Use Less Energy

Energy Consumption on Different Tests Quantum Espresso Simulations run 0.6 on FERMI @ ICHEC.

CPU Only A 6-Core 2.66 GHz Intel X5650 was used for the CPU 0.5

CPU+GPU An NVIDIA C2050 was used for the

GPU 0.4

Lower is better 0.3 -58%

0.2 PowerConsumption [kW/h]

0.1 -54% -57%

0 Shilu-3 AUSURF112 Water-on-Calcite In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only QE/PWscf - Great Strong Scaling in Parallel

CdSe-159 Walltime of 1 full SCF Quantum Espresso Simulations run 35000 on STONEY @ ICHEC.

30000 Two quad core 2.87 GHz Intel Lower is better X5560s were used in each node CPU 25000 CPU+GPU Two NVIDIA M2090s were used in 2.5x each node for the CPU+GPU test

20000

(s) ime ime

T 15000

10000 2.2x 5000 2.1x 2.2x

0 2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112) Nodes (Total CPU Cores)

Speedups up to 2.5x with GPU Accelerations 159 Cadmium Selenide nanodots QE/PWscf - More Powerful Strong Scaling

GeSnTe134 Walltime of full SCF

4500 Quantum Espresso Simulations run CPU on PLX @ CINECA. 4000 CPU+GPU Two 6-Core 2.4 GHz Intel E5645s 3500 were used in each node 1.6x Lower is better 3000 Two NVIDIA M2070s were used in

each node for the CPU+GPU test 2500

2000 Time(s) 2.3x 1500 2.4x 2.1x 1000

500

0 4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528) Nodes (Total CPU Cores) Accelerate your cluster by up to 2.1x with NVIDIA GPUs

Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive TeraChem TeraChem Supercomputer Speeds on GPUs Time for SCF Step 100 TeraChem running on 8 C2050s on 1 node 90 NWChem running on 4096 Quad Core CPUs 80 In the Chinook Supercomputer

70 Giant Fullerene C240 Molecule

60

50

40 Time(Seconds) 30

20

10

0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Similar performance from just a handful of GPUs TeraChem Bang for the Buck Performance/Price TeraChem running on 8 C2050s on 1 node

600 NWChem running on 4096 Quad Core

CPUs 493 In the Chinook Supercomputer 500 Giant Fullerene C240 Molecule

400 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW 300 vendor for actual pricing.

200

100 Price/Performance Price/Performance relativeSupercomputer to 1 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)

Dollars spent on GPUs do 500x more science than those spent on CPUs Kepler’s Even Better

Olestra BLYP 453 Atoms B3LYP/6-31G(d) 800 2000 TeraChem running on C2050 and K20C

700 1800 First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 1600 600 1400

500

1200

400 1000

Seconds Seconds 300 800

600 200 400 100 200

0 0 C2050 K20C C2050 K20C

Kepler performs 2x faster than Tesla Viz, “Docking” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

3D visualization of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html

High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU

University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm

Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

Interactive Molecule Released Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support

Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution

PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/

Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase

High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison John Stone (VMD developer) video FastROCS

OpenEye Japan Hideyuki Sato, Ph.D.

November 13, 2013 © 2012 OpenEye Scientific Software ROCS on the GPU: FastROCS

400000

300000

200000 Second

100000 Shape Overlays per Shape Overlays 0 CPU GPU

November 13, 2013 © 2012 OpenEye Scientific Software FastROCS scaling across 4x K10s (2 physical GPUs per K10) 53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule) 9000000

8000000 7000000 6000000 5000000 4000000 3000000

2000000 Conformers per Second per Conformers 1000000 0 1 2 3 4 5 6 7 8 Number of individual K10 GPUs (Note, each K10 has 2 physical GPUs on the board)

November 13, 2013 © 2012 OpenEye Scientific Software “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html

Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme

Aligns multiple query sequences MUMmer GPU 3-10x http://sourceforge.net/apps/mediawiki/mummer against reference sequence in Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel

Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers Released, Version 1.12 Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models GPU Perf compared against same or similar code running on single CPU machine Performance measured internally or independently NVBIO: Efficient Sequence Analysis on GPUs

Jacopo Pantaleoni NVIDIA nvBowtie: re-engineering Bowtie2 on NVBIO nvBowtie: re-engineering Bowtie2 on NVBIO

We took Bowtie2’s algorithms and rewrote them on top of NVBIO

Supports the full spectrum of features

Single End End-to-End Best Mapping

Paired Ends Local All Mapping

Favors accuracy over speed

© NVIDIA 2013 - ISMB

nvBowtie - Architecture Overview

CPU CPU background background input thread output thread seed map prefetch consume read score alignment batches batches

FASTQ no SAM / (GZIP) finished? BAM decompress / reformat / reformat yes compress

reseed? CPU CPU yes CPU CPU no

CPU CPU CPU GPU traceback CPU CPU

© NVIDIA 2013 - ISMB nvBowtie - Simulated Reads

wgsim: 1M x 100bp reads - (wgsim -r 0.01 -1 100 -S11 -d0 -e0)

nvBowtie: K20

Bowtie2: 6-core SB

© NVIDIA 2013 - ISMB nvBowtie - Simulated Reads

Same sensitivity/accuracy at 3-4x the speed

Greater sensitivity/accuracy at same speed

© NVIDIA 2013 - ISMB nvBowtie - Real Datasets

Ion Proton Ion Proton 100M x 175bp (8-350) end-to-end 100M x 175bp (8-350) local

speedup 4.3x speedup 7.6x

alignment rate +0.5% alignment rate -0.6% disagreement 0.002% disagreement 0.03%

- -

Illumina HiSeq 2000 Illumina HiSeq 2000 10M x 100bp x 2 end-to-end 10M x 100bp x 2 local speedup 2.4x speedup 2.6x

alignment rate = alignment rate = disagreement 0.006% disagreement 0.022%

ERR161544 ERR161544

© NVIDIA 2013 - ISMB nvBowtie

Same quality @ higher speed: Best -mapping 2-7x faster on K20 than Bowtie2 on 6-core SB

Opens new doors: All-mapping on K20 comparable to Best-mapping on 6-core SB

Possibly even faster: With more GPU memory, we could likely get 1.5x faster

© NVIDIA 2013 - ISMB

Coming Soon: NVBIO: [email protected] [email protected] BWA-MEM, SeqAn: CRAM, [email protected] [email protected] Variant Calling … Benefits of GPU Accelerated Computing

Faster than CPU only systems in all tests

Large performance boost with marginal price increase

Energy usage cut by more than half

GPUs scale well within a node and over multiple nodes

K20 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive 192 3 Ways to Accelerate Applications

Applications

OpenACC Programming Libraries Directives Languages

“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND

Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Board Games Path Finding

Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode GPU Programming Languages

Numerical analytics R, MATLAB, Mathematica, LabVIEW

Fortran OpenACC, CUDA

C OpenACC, CUDA C

C++ Thrust, CUDA C++

Python PyCUDA, Anaconda Accelerate

C# GPU.NET Easiest Way to Learn CUDA

Learn from the Best

Anywhere, Any Time

It’s Free! 50K 127 $$ Registered Countries

Engage with an Active Community Develop on GeForce, Deploy on Tesla

GeForce GTX Titan Tesla K20X/K20

Designed for Gamers & Developers Designed for Cluster Deployment

1+ Teraflop Double Precision Performance ECC Dynamic Parallelism 24x7 Runtime Hyper-Q for CUDA Streams GPU Monitoring Cluster Management Available Everywhere! GPUDirect-RDMA Hyper-Q for MPI Develop on any GPU released in the last 3 years; 3 Year Warranty Macs, PCs, Workstations, Clusters, Supercomputers Integrated OEM Systems, Professional Support TITAN: World’s Fastest Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.59 Petaflops on Linpack 90% of Performance from GPUs Test Drive K20 GPUs! Experience The Acceleration www.nvidia.com/GPUTestDrive

Run Computational Chemistry Apps on Tesla K20 GPUs today

Try Preconfigured Apps: AMBER, NAMD, GROMACS, LAMMPS, Quantum Espresso, TeraChem Or Load Your Own

Sign up for FREE GPU Test Drive on remotely hosted clusters