LIFE AND MATERIAL SCIENCES Mark Berger; mberger@nvidia.com Founded 1993 Invented GPU 1999 – Computer Graphics Visual Computing, Supercomputing, Cloud & Mobile Computing NVIDIA - Core Technologies and Brands
GPU Mobile Cloud
® ® GeForce Tegra GRID Quadro® , Tesla® Accelerated Computing Multi-core plus Many-cores
GPU Accelerator CPU Optimized for Many Optimized for Parallel Tasks Serial Tasks
3-10X+ Comp Thruput 7X Memory Bandwidth 5x Energy Efficiency How GPU Acceleration Works
Application Code
Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU
+ GPUs : Two Year Heart Beat
32 Volta Stacked DRAM 16 Maxwell Unified Virtual Memory 8 Kepler Dynamic Parallelism 4
Fermi 2 FP64
DP GFLOPS GFLOPS per DP Watt 1
Tesla 0.5 CUDA
2008 2010 2012 2014 Kepler Features Make GPU Coding Easier Hyper-Q Dynamic Parallelism Speedup Legacy MPI Apps Less Back-Forth, Simpler Code FERMI 1 Work Queue CPU Fermi GPU CPU Kepler GPU
KEPLER 32 Concurrent Work Queues Developer Momentum Continues to Grow
100M 430M CUDA –Capable GPUs CUDA-Capable GPUs
150K 1.6M CUDA Downloads CUDA Downloads
1 50 Supercomputer Supercomputers
60 640 University Courses University Courses
4,000 37,000 Academic Papers Academic Papers
2008 2013 Explosive Growth of GPU Accelerated Apps
# of Apps Top Scientific Apps 200 61% Increase Molecular AMBER LAMMPS CHARMM NAMD Dynamics GROMACS DL_POLY 150 Quantum QMCPACK Gaussian 40% Increase Quantum Espresso NWChem Chemistry GAMESS-US VASP
CAM-SE 100 Climate & COSMO NIM GEOS-5 Weather WRF
Chroma GTS 50 Physics Denovo ENZO GTC MILC
ANSYS Mechanical ANSYS Fluent 0 CAE MSC Nastran OpenFOAM 2010 2011 2012 SIMULIA Abaqus LS-DYNA
Accelerated, In Development NVIDIA GPU Life Science Focus Molecular Dynamics: All codes are available AMBER, CHARMM, DESMOND, DL_POLY, GROMACS, LAMMPS, NAMD Great multi-GPU performance GPU codes: ACEMD, HOOMD-Blue Focus: scaling to large numbers of GPUs
Quantum Chemistry: key codes ported or optimizing Active GPU acceleration projects: VASP, NWChem, Gaussian, GAMESS, ABINIT, Quantum Espresso, BigDFT, CP2K, GPAW, etc. GPU code: TeraChem Analytical and Medical Imaging Instruments Genomics/Bioinformatics
Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported
AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks
2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump
Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx
165 ns/Day DHFR Released Implicit (5x), Explicit (2x) on Release 4.6.2; 1st Multi-GPU support GROMACS Multi-GPU, multi-node 4X C2075s
Lennard-Jones, Gay-Berne, Released http://lammps.sandia.gov/bench.html#desktop 3.5-18x on Titan LAMMPS Tersoff & many more potentials Multi-GPU, multi-node and http://lammps.sandia.gov/bench.html#titan
4.0 ns/day Released Full electrostatics with PME and F1-ATPase on 100M atom capable NAMD 2.9 NAMD most simulation features 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported
150 ns/day DHFR on Released Production bio-molecular dynamics (MD) Written for use only on GPUs ACEMD 1x K20 Single and Multi-GPUs software specially optimized to run on GPUs
http://www.deshawresearch.com/resources_whatd TBD TBD Under development DESMOND esmonddoes.html
Powerful distributed computing Depends upon Released http://folding.stanford.edu molecular dynamics system; Folding@Home number of GPUs GPUs and CPUs GPUs get 4X the points of CPUs implicit solvent and folding
High-performance all-atom Depends upon Released http://www.gpugrid.net/ biomolecular simulations; explicit GPUGrid.net number of GPUs NVIDIA GPUs only solvent and binding
Simple fluids and binary mixtures (pair potentials, high-precision Up to 66x on 2090 vs. Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled HALMD NVE and NVT, dynamic 1 CPU core Single GPU -binary-mixture-kob-andersen correlations)
Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in September 2013
mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html GPU Perf compared against Multi-core x86 CPU socket. Implicit: 127-213 GPU Perf benchmarked on GPU supported features Implicit and explicit solvent, Released, Version 5.1 Library and application for molecular dynamics ns/day Explicit: 18- OpenMM custom forces Multi-GPU and may be a kernelon high to-performance kernel perf comparison 55 ns/day DHFR AMBER 12 GPU Support Revision 12.2 1/22/2013
13 Kepler - Our Fastest Family of GPUs Yet
30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU).
7.4x The green nodes contain Dual E5-2687W CPUs (8
20.00 18.90 Cores per CPU) and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU 6.6x
15.00
11.85 5.6x
Nanoseconds Day / 10.00 3.5x 5.00 3.42
0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 14 K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00
The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU)
20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
15.00
10.00 7.28
6.50 6.56 Speedup Speedup Compared to CPU Only
5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance
15 Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local Hamiltonian, LOBPCG algorithm, Released; Version 7.2.2 www.abinit.org 1.3-2.7X Abinit diagonalization / Multi-GPU support orthogonalization
Integrating scheduling GPU into http://www.olcf.ornl.gov/wp- Under development SIAL programming language and 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support SIP runtime environment 2012/deumens_ESaccel_2012.pdf
Pilot project completed, ADF Fock Matrix, Hessians TBD Under development www.scm.com Multi-GPU support
http://inac.cea.fr/L_Sim/BigDFT/news.html, http://www.olcf.ornl.gov/wp- 5-25X DFT; Daubechies wavelets, Released, Version 1.6.0 content/training/electronic-structure-2012/BigDFT- (1 CPU core to BigDFT part of Abinit Multi-GPU support Formalism.pdf and http://www.olcf.ornl.gov/wp- GPU kernel) content/training/electronic-structure-2012/BigDFT- HPC-tues.pdf
Code for performing quantum Monte Carlo (QMC) electronic Under development http://www.tcm.phy.cam.ac.uk/~mdt26/casino.ht TBD Casino structure calculations for finite Multi-GPU support ml and periodic systems
CASTEP TBD TBD Under development GPU Perf comparedhttp://www.castep.org/Main/HomePage against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features http://www.olcf.ornl.gov/wp- DBCSR (spare matrix multiply Under development and may be a kernel to kernel perf comparison 2-7X content/training/ascc_2012/friday/ACSS_2012_Van CP2K library) Multi-GPU support deVondele_s.pdf Libqc with Rys Quadrature 1.3-1.6X, Released Algorithm, Hartree-Fock, MP2 and http://www.msg.ameslab.gov/gamess/index.html GAMESS-US 2.3-2.9x HF Multi-GPU support CCSD Quantum Chemistry Applications Application Features Supported GPU Perf Release Status Notes
(ss|ss) type integrals within calculations using Hartree Fock ab Released, Version 7.0 initio methods and density functional 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK Multi-GPU support theory. Supports organics & inorganics.
Joint PGI, NVIDIA & Gaussian Under development http://www.gaussian.com/g_press/nvidia_press. TBD Gaussian Collaboration Multi-GPU support htm
Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/g GPAW orthonormalizing of vectors, residual 8x Multi-GPU support pu.html, Samuli Hakala (CSC Finland) & Chris minimization method (rmm-diis) O’Grady (SLAC)
Schrodinger, Inc. Under development http://www.schrodinger.com/kb/278 Investigating GPU acceleration TBD Multi-GPU support Jaguar http://www.schrodinger.com/productpage/14/7
/32/ http://on- demand.gputechconf.com/gtc/2013/presentation Released CU_BLAS, SP2 algorithm TBD s/S3195-Fast-Quantum-Molecular-Dynamics-in- LATTE Multi-GPU support LATTE.pdf
Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU www.molcas.org support coming in Version 8 GPU Perf compared against Multi-core x86 CPU socket. Density-fitted MP2 (DF-MP2), density GPU Perf benchmarked on GPU supported features 1.7-2.3X Under development www.molpro.net fitted local correlation methods (DF- and may be a kernel to kernel perf comparison MOLPRO projected Multiple GPU Hans-Joachim Werner RHF, DF-KS), DFT Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported
Pseudodiagonalization, full Under development Academic port. diagonalization, and density 3.8-14X MOPAC2012 Single GPU http://openmopac.net matrix assembling
Development GPGPU benchmarks: Triples part of Reg-CCSD(T), www.nwchem-sw.org Released, Version 6.3 CCSD & EOMCCSD task 3-10X projected And http://www.olcf.ornl.gov/wp- NWChem Multiple GPUs schedulers content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf
Full GPU support for ground- state, real-time calculations; Kohn-Sham Hamiltonian, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ Octopus orthogonalization, subspace diagonalization, poisson solver, time propagation
ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/
Density functional theory (DFT) First principles materials code that computes Released plane wave pseudopotential 6-10X the behavior of the electron structures of PEtot Multi-GPU calculations materials GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x and may be a kernel to kernel perf comparison Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf Quantum Chemistry Applications
Features Application GPU Perf Release Status Notes Supported
NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php /GPU_version_of_QMCPACK
Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php Espresso/PWscf computational kernels, 3D FFTs and http://www.quantum-espresso.org/
Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5 http://youtu.be/EJODzk6RFxE?hd=1 and “Full GPU-based solution” GAMESS CPU TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure- 2012/Luehr-ESCMA.pdf
2x Hybrid Hartree-Fock DFT 2 GPUs Available on request By Carnegie Mellon University functionals including exact VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores
3x NICS Electronic Structure Determination Workshop 2012: Generalized Wang-Landau Under development GPU Perf compared against Multi-core x86 CPU socket. with 32 GPUs vs. http://www.olcf.ornl.gov/wp- WL-LSMS method Multi-GPU support GPU Perfcontent/training/electronic benchmarked on GPU-structure supported- features 32 (16-core) CPUs and2012/Eisenbach_OakRidge_February.pdf may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
3D visualization of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html
High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU
University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm
Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/
Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
Interactive Molecule Under development Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support
Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution
PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/
Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase
High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup
Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node
Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU
Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html
Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme
Aligns multiple query sequences http://sourceforge.net/apps/mediawiki/mummer MUMmer GPU against reference sequence in 3-10x Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel
Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly
Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers GPU Perf compared against same or similar code running on single CPU machine Released, Version 1.12 Performance measured internally or independently Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models 3 Ways to Accelerate Applications
Applications
OpenACC Programming Libraries Directives Languages
“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND
Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Board Path Finding Games
Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode GPU Programming Languages
Numerical analytics R, MATLAB, Mathematica, LabVIEW
Fortran OpenACC, CUDA Fortran
C OpenACC, CUDA C
C++ Thrust, CUDA C++
Python PyCUDA, Anaconda Accelerate
C# GPU.NET Easiest Way to Learn CUDA
Learn from the Best
Anywhere, Any Time
It’s Free! 50K 127 $$ Registered Countries
Engage with an Active Community Develop on GeForce, Deploy on Tesla
GeForce GTX Titan Tesla K20X/K20
Designed for Gamers & Developers Designed for Cluster Deployment
1+ Teraflop Double Precision Performance ECC Dynamic Parallelism 24x7 Runtime Hyper-Q for CUDA Streams GPU Monitoring Cluster Management Available Everywhere! GPUDirect-RDMA Hyper-Q for MPI Develop on any GPU released in the last 3 years; 3 Year Warranty Macs, PCs, Workstations, Clusters, Supercomputers Integrated OEM Systems, Professional Support TITAN: World’s Fastest Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.59 Petaflops on Linpack 90% of Performance from GPUs Test Drive K20 GPUs! Experience The Acceleration
Run on Tesla K20 GPUs today
Sign up for FREE GPU Test Drive on remotely hosted clusters www.nvidia.com/GPUTestDrive LIFE AND MATERIAL SCIENCES Mark Berger; [email protected]