LIFE AND MATERIAL SCIENCES Mark Berger; mberger@.com Founded 1993 Invented GPU 1999 – Computer Graphics Visual Computing, Supercomputing, Cloud & Mobile Computing NVIDIA - Core Technologies and Brands

GPU Mobile Cloud

® ® GeForce Tegra GRID Quadro® , Tesla® Accelerated Computing Multi-core plus Many-cores

GPU Accelerator CPU Optimized for Many Optimized for Parallel Tasks Serial Tasks

3-10X+ Comp Thruput 7X Memory Bandwidth 5x Energy Efficiency How GPU Acceleration Works

Application Code

Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU

+ GPUs : Two Year Heart Beat

32 Volta Stacked DRAM 16 Maxwell Unified Virtual Memory 8 Kepler Dynamic Parallelism 4

Fermi 2 FP64

DP GFLOPS GFLOPS per DP Watt 1

Tesla 0.5 CUDA

2008 2010 2012 2014 Kepler Features Make GPU Coding Easier Hyper- Dynamic Parallelism Speedup Legacy MPI Apps Less Back-Forth, Simpler Code FERMI 1 Work Queue CPU Fermi GPU CPU Kepler GPU

KEPLER 32 Concurrent Work Queues Developer Momentum Continues to Grow

100M 430M CUDA –Capable GPUs CUDA-Capable GPUs

150K 1.6M CUDA Downloads CUDA Downloads

1 50 Supercomputer Supercomputers

60 640 University Courses University Courses

4,000 37,000 Academic Papers Academic Papers

2008 2013 Explosive Growth of GPU Accelerated Apps

# of Apps Top Scientific Apps 200 61% Increase Molecular AMBER LAMMPS CHARMM NAMD Dynamics GROMACS DL_POLY 150 Quantum QMCPACK 40% Increase Quantum Espresso NWChem GAMESS-US VASP

CAM-SE 100 Climate & COSMO NIM GEOS-5 Weather WRF

Chroma GTS 50 Denovo ENZO GTC MILC

ANSYS Mechanical ANSYS Fluent 0 CAE MSC Nastran OpenFOAM 2010 2011 2012 SIMULIA Abaqus LS-DYNA

Accelerated, In Development NVIDIA GPU Life Science Focus : All codes are available AMBER, CHARMM, , DL_POLY, GROMACS, LAMMPS, NAMD Great multi-GPU performance GPU codes: ACEMD, HOOMD-Blue Focus: scaling to large numbers of GPUs

Quantum Chemistry: key codes ported or optimizing Active GPU acceleration projects: VASP, NWChem, Gaussian, GAMESS, ABINIT, Quantum Espresso, BigDFT, CP2K, GPAW, etc. GPU code: TeraChem Analytical and Medical Imaging Instruments Genomics/Bioinformatics

Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported

AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks

2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump

Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx

165 ns/Day DHFR Released Implicit (5x), Explicit (2x) on Release 4.6.2; 1st Multi-GPU support GROMACS Multi-GPU, multi-node 4X C2075s

Lennard-Jones, Gay-Berne, Released http://lammps.sandia.gov/bench.html#desktop 3.5-18x on Titan LAMMPS Tersoff & many more potentials Multi-GPU, multi-node and http://lammps.sandia.gov/bench.html#titan

4.0 ns/day Released Full electrostatics with PME and F1-ATPase on 100M atom capable NAMD 2.9 NAMD most simulation features 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported

150 ns/day DHFR on Released Production bio-molecular dynamics (MD) Written for use only on GPUs ACEMD 1x K20 Single and Multi-GPUs software specially optimized to run on GPUs

http://www.deshawresearch.com/resources_whatd TBD TBD Under development DESMOND esmonddoes.html

Powerful distributed computing Depends upon Released http://folding.stanford.edu molecular dynamics system; Folding@Home number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding

High-performance all-atom Depends upon Released http://www.gpugrid.net/ biomolecular simulations; explicit GPUGrid.net number of GPUs NVIDIA GPUs only solvent and binding

Simple fluids and binary mixtures (pair potentials, high-precision Up to 66x on 2090 vs. Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled HALMD NVE and NVT, dynamic 1 CPU core Single GPU -binary-mixture-kob-andersen correlations)

Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in September 2013

mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html GPU Perf compared against Multi-core x86 CPU socket. Implicit: 127-213 GPU Perf benchmarked on GPU supported features Implicit and explicit solvent, Released, Version 5.1 Library and application for molecular dynamics ns/day Explicit: 18- OpenMM custom forces Multi-GPU and may be a kernelon high to-performance kernel perf comparison 55 ns/day DHFR AMBER 12 GPU Support Revision 12.2 1/22/2013

13 Kepler - Our Fastest Family of GPUs Yet

30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU).

7.4x The green nodes contain Dual E5-2687W CPUs (8

20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU 6.6x

15.00

11.85 5.6x

Nanoseconds Day / 10.00 3.5x 5.00 3.42

0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 14 K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00

The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU)

20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs

15.00

10.00 7.28

6.50 6.56 Speedup Speedup Compared to CPU Only

5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance

15 Applications

Application Features Supported GPU Perf Release Status Notes

Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.2.2 www..org 1.3-2.7X Abinit diagonalization / Multi-GPU support orthogonalization

Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development SIAL programming language and 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support SIP runtime environment 2012/deumens_ESaccel_2012.pdf

Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support

http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X DFT; Daubechies , Released, Version 1.6.0 content/training/electronic-structure-2012/BigDFT- (1 CPU core to BigDFT part of Abinit Multi-GPU support Formalism.pdf and http://www.olcf.ornl.gov/wp- GPU kernel) content/training/electronic-structure-2012/BigDFT- HPC-tues.pdf

Code for performing quantum Monte Carlo (QMC) electronic Under development http://www.tcm.phy.cam.ac.uk/~mdt26/casino.ht TBD Casino structure calculations for finite Multi-GPU support ml and periodic systems

CASTEP TBD TBD Under development GPU Perf comparedhttp://www.castep.org/Main/HomePage against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development and may be a kernel to kernel perf comparison 2-7X content/training/ascc_2012/friday/ACSS_2012_Van CP2K library) Multi-GPU support deVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Algorithm, Hartree-Fock, MP2 and http://www.msg.ameslab.gov/gamess/index.html GAMESS-US 2.3-2.9x HF Multi-GPU support CCSD Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes

(ss|ss) type integrals within calculations using Hartree Fock ab Released, Version 7.0 initio methods and density functional 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK Multi-GPU support theory. Supports organics & inorganics.

Joint PGI, NVIDIA & Gaussian Under development http://www.gaussian.com/g_press/nvidia_press. TBD Gaussian Collaboration Multi-GPU support htm

Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/g GPAW orthonormalizing of vectors, residual 8x Multi-GPU support pu.html, Samuli Hakala (CSC Finland) & Chris minimization method (rmm-diis) O’Grady (SLAC)

Schrodinger, Inc. Under development http://www.schrodinger.com/kb/278 Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/productpage/14/7

/32/ http://on- demand.gputechconf.com/gtc/2013/presentation Released CU_BLAS, SP2 algorithm TBD s/S3195-Fast-Quantum-Molecular-Dynamics-in- LATTE Multi-GPU support LATTE.pdf

Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU www..org support coming in Version 8 GPU Perf compared against Multi-core x86 CPU socket. Density-fitted MP2 (DF-MP2), density GPU Perf benchmarked on GPU supported features 1.7-2.3X Under development www..net fitted local correlation methods (DF- and may be a kernel to kernel perf comparison MOLPRO projected Multiple GPU Hans-Joachim Werner RHF, DF-KS), DFT Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported

Pseudodiagonalization, full Under development Academic port. diagonalization, and density 3.8-14X MOPAC2012 Single GPU http://openmopac.net matrix assembling

Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.-sw.org Released, Version 6.3 CCSD & EOMCCSD task 3-10X projected And http://www.olcf.ornl.gov/wp- NWChem Multiple GPUs schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf

Full GPU support for ground- state, real-time calculations; Kohn-Sham Hamiltonian, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ orthogonalization, subspace diagonalization, poisson solver, time propagation

ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/

Density functional theory (DFT) First principles materials code that computes Released plane wave 6-10X the behavior of the structures of PEtot Multi-GPU calculations materials GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x and may be a kernel to kernel perf comparison Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf Quantum Chemistry Applications

Features Application GPU Perf Release Status Notes Supported

NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK

Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs and http://www.quantum-espresso.org/

Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5 http://youtu.be/EJODzk6RFxE?hd=1 and “Full GPU-based solution” GAMESS CPU TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure- 2012/Luehr-ESCMA.pdf

2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University functionals including exact VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores

3x NICS Electronic Structure Determination Workshop 2012: Generalized Wang-Landau Under development GPU Perf compared against Multi-core x86 CPU socket. with 32 GPUs vs. http://www.olcf.ornl.gov/wp- WL-LSMS method Multi-GPU support GPU Perfcontent/training/electronic benchmarked on GPU-structure supported- features 32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdf may be a kernel to kernel perf comparison Viz, “” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

3D of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html

High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU

University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm

Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/

Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing

Related Features Supported GPU Perf Release Status Notes Applications

Interactive Under development Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support

Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution

PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/

Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase

High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup

Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node

Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU

Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html

Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme

Aligns multiple query sequences http://sourceforge.net/apps/mediawiki/mummer MUMmer GPU against reference sequence in 3-10x Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel

Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly

Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers GPU Perf compared against same or similar code running on single CPU machine Released, Version 1.12 Performance measured internally or independently Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models 3 Ways to Accelerate Applications

Applications

OpenACC Programming Libraries Directives Languages

“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility GPU Accelerated Libraries “Drop-in” Acceleration for your Applications

Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND

Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Board Path Finding Games

Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode GPU Programming Languages

Numerical analytics R, MATLAB, Mathematica, LabVIEW

Fortran OpenACC, CUDA Fortran

C OpenACC, CUDA

C++ Thrust, CUDA C++

Python PyCUDA, Anaconda Accelerate

C# GPU.NET Easiest Way to Learn CUDA

Learn from the Best

Anywhere, Any Time

It’s Free! 50K 127 $$ Registered Countries

Engage with an Active Community Develop on GeForce, Deploy on Tesla

GeForce GTX Titan Tesla K20X/K20

Designed for Gamers & Developers Designed for Cluster Deployment

1+ Teraflop Double Precision Performance ECC Dynamic Parallelism 24x7 Runtime Hyper-Q for CUDA Streams GPU Monitoring Cluster Management Available Everywhere! GPUDirect-RDMA Hyper-Q for MPI Develop on any GPU released in the last 3 years; 3 Year Warranty Macs, PCs, Workstations, Clusters, Supercomputers Integrated OEM Systems, Professional Support TITAN: World’s Fastest Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.59 Petaflops on Linpack 90% of Performance from GPUs Test Drive K20 GPUs! Experience The Acceleration

Run on Tesla K20 GPUs today

Sign up for FREE GPU Test Drive on remotely hosted clusters www.nvidia.com/GPUTestDrive LIFE AND MATERIAL SCIENCES Mark Berger; [email protected]