Outline of NVIDIA tutorial at MSI
I. Intro to GPUs
A. MSI plans to add significantly more GPUs in the future; the current system is described at: https://www.msi.umn.edu/hpc/cascade
B. MSI currently has: i. 8 Westmere nodes with 4 Fermi M2070 GPGPUs/node (Cascade queue) ii. 4 SandyBridge nodes with 2 Kepler K20m GPGPUs/node (Kepler queue)
II. Computational Chemistry and Biology Applications
A. Molecular Dynamics Applications, including: i. AMBER, NAMD, LAMMPS, GROMACS, CHARMM, DESMOND, DL_POLY, OpenMM ii. GPU only codes: ACEMD, HOOMD-Blue
B. Quantum Chemistry, including: i. Abinit, BigDFT, CP2K, Gaussian, GAMESS, GPAW, NWChem, Quantum Espresso, Q-CHEM, VASP & more ii. GPU only code: TeraChem (ParaChem)
C. Bioinformatics i. NVBIO (nvBowtie), BWA-MEM (Both currently in development) update
Updated: Oct. 21, 2013 Founded 1993 Invented GPU 1999 – Computer Graphics Visual Computing, Supercomputing, Cloud & Mobile Computing NVIDIA - Core Technologies and Brands
GPU Mobile Cloud
® ® GeForce Tegra GRID Quadro® , Tesla® Accelerated Computing Multi-core plus Many-cores
GPU Accelerator CPU Optimized for Many Optimized for Parallel Tasks Serial Tasks
3-10X+ Comp Thruput 7X Memory Bandwidth 5x Energy Efficiency How GPU Acceleration Works
Application Code
Compute-Intensive Functions Rest of Sequential 5% of Code CPU Code GPU CPU
+ GPUs : Two Year Heart Beat
32 Volta Stacked DRAM 16 Maxwell Unified Virtual Memory 8 Kepler Dynamic Parallelism 4
Fermi 2 FP64
DP GFLOPS GFLOPS per DP Watt 1
Tesla 0.5 CUDA
2008 2010 2012 2014 Kepler Features Make GPU Coding Easier Hyper-Q Dynamic Parallelism Speedup Legacy MPI Apps Less Back-Forth, Simpler Code FERMI 1 Work Queue CPU Fermi GPU CPU Kepler GPU
KEPLER 32 Concurrent Work Queues Developer Momentum Continues to Grow
100M 430M CUDA –Capable GPUs CUDA-Capable GPUs
150K 1.6M CUDA Downloads CUDA Downloads
1 50 Supercomputer Supercomputers
60 640 University Courses University Courses
4,000 37,000 Academic Papers Academic Papers
2008 2013 Explosive Growth of GPU Accelerated Apps
# of Apps Top Scientific Apps 200 61% Increase Molecular AMBER LAMMPS CHARMM NAMD Dynamics GROMACS DL_POLY 150 Quantum QMCPACK Gaussian 40% Increase Quantum Espresso NWChem Chemistry GAMESS-US VASP
CAM-SE 100 Climate & COSMO NIM GEOS-5 Weather WRF
Chroma GTS 50 Physics Denovo ENZO GTC MILC
ANSYS Mechanical ANSYS Fluent 0 CAE MSC Nastran OpenFOAM 2010 2011 2012 SIMULIA Abaqus LS-DYNA
Accelerated, In Development Overview of Life & Material Accelerated Apps MD: All key codes are available AMBER, CHARMM, DESMOND, DL_POLY, GROMACS, LAMMPS, NAMD GPU only codes: ACEMD, HOOMD-Blue Great multi-GPU performance Focus: scaling to large numbers of GPUs / nodes
QC: All key codes are ported/optimizing: Active GPU acceleration projects: Abinit, BigDFT, CP2K, GAMESS, Gaussian, GPAW, NWChem, Quantum Espresso, VASP & more GPU only code: TeraChem Analytical instruments – actively recruiting Bioinformatics – market development
GPU Test Drive Experience GPU Acceleration
For Computational Chemistry Researchers, Biophysicists
Preconfigured with Molecular Dynamics Apps
Remotely Hosted GPU Servers
Free & Easy – Sign up, Log in and See Results
www.nvidia.com/gputestdrive
11 Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported
AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks
2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump
Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx
165 ns/Day DHFR Released Release 4.6.3; 1st Multi-GPU support Implicit (5x), Explicit (2x) on GROMACS Multi-GPU, multi-node www.gromacs.org 4X C2075s
http://lammps.sandia.gov/bench.html#desktop Lennard-Jones, Gay-Berne, 3.5-18x on Released and LAMMPS Tersoff & many more potentials ORNL Titan Multi-GPU, multi-node http://lammps.sandia.gov/bench.html#titan
4.0 ns/day Released Full electrostatics with PME and NAMD 2.9 F1-ATPase on 100M atom capable NAMD most simulation features http://www.ks.uiuc.edu/Research/namd/ 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported
Production biomolecular dynamics (MD) software specially 150 ns/day DHFR on 1x Released Written for use only on GPUs optimized to run on GPUs ACEMD K20 Single and Multi-GPUs http://www.acellera.com/
Bonded, pair, excluded interactions; Van Released, Version der Waals, electrostatic, nonbonded far TBD http://www.schrodinger.com/productpage/14/3/ DESMOND 3.4.0/0.7.2 interactions
Powerful distributed computing molecular Depends upon number of Released http://folding.stanford.edu dynamics system; implicit solvent and Folding@Home GPUs GPUs and CPUs GPUs get 4X the points of CPUs folding
High-performance all-atom biomolecular Depends upon number of http://www.gpugrid.net/ Released GPUGrid.net simulations; explicit solvent and binding GPUs
Simple fluids and binary mixtures (pair Up to 66x on 2090 vs. 1 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled-binary- potentials, high-precision NVE and NVT, HALMD CPU core Single GPU mixture-kob-andersen dynamic correlations)
Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in late 2013
mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html
Implicit: 127-213 ns/day Library and application for molecular dynamics on high- Implicit and explicit solvent, custom Released, Version 5.2 Explicit: 18-55 performance OpenMM forces Multi-GPU ns/day DHFR https://simtk.org/home/openmm_suite GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local Released; Version 7.4.1 www.abinit.org Hamiltonian, LOBPCG algorithm, 1.3-2.7X ABINIT Multi-GPU support diagonalization / orthogonalization
Integrating scheduling GPU into SIAL http://www.olcf.ornl.gov/wp- Under development programming language and SIP 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support runtime environment 2012/deumens_ESaccel_2012.pdf
Available Q1 2014 Fock Matrix, Hessians TBD www.scm.com ADF Multi-GPU support
DFT; Daubechies wavelets, Released, Version 1.7.0 2-5X http://bigdft.org BigDFT part of Abinit Multi-GPU support
Code for performing quantum Monte Carlo (QMC) electronic structure Under development TBD http://www.tcm.phy.cam.ac.uk/~mdt26/casino.html Casino calculations for finite and periodic Multi-GPU support systems
CASTEP TBD TBD Under development http://www.castep.org/Main/HomePage
http://www.olcf.ornl.gov/wp- Released DBCSR (spare matrix multiply library) 2-7X content/training/ascc_2012/friday/ACSS_2012_VandeVon CP2K Multi-GPU support dele_s.pdf
Libqc with Rys Quadrature Algorithm, 1.3-1.6X, Released http://www.msg.ameslab.gov/gamess/index.html GAMESS-US Hartree-Fock, MP2 and CCSD 2.3-2.9x HF Multi-GPU support GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
(ss|ss) type integrals within calculations using Hartree Fock ab initio methods and Released, Version 7.0 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK density functional theory. Supports Multi-GPU support organics & inorganics.
Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Gaussian Collaboration
Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.ht GPAW orthonormalizing of vectors, residual 8x Multi-GPU support ml, Samuli Hakala (CSC Finland) & minimization method (rmm-diis) Chris O’Grady (SLAC)
Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 http://www.schrodinger.com/productpage/14/7/32/
http://on- Released demand.gputechconf.com/gtc/2013/presentations/S31 CU_BLAS, SP2 algorithm TBD LATTE Multi-GPU support 95-Fast-Quantum-Molecular-Dynamics-in-LATTE.pdf
Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU support www.molcas.org coming in Version 8
Density-fitted MP2 (DF-MP2), density Under development www.molpro.net fitted local correlation methods (DF-RHF, 1.7-2.3X projected MOLPRO Multiple GPU Hans-Joachim Werner DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported
Pseudodiagonalization, full Released Academic port. diagonalization, and density matrix 3.8-14X MOPAC2013 available Q1 2014 MOPAC2012 http://openmopac.net assembling Single GPU
Development GPGPU benchmarks: www.nwchem- sw.org Triples part of Reg-CCSD(T), CCSD Released, Version 6.3 7-8X And http://www.olcf.ornl.gov/wp- NWChem & EOMCCSD task schedulers Multiple GPUs content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf
Full GPU support for ground-state, real-time calculations; Kohn-Sham Octopus Hamiltonian, orthogonalization, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ subspace diagonalization, poisson solver, time propagation
ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/
Density functional theory (DFT) Released First principles materials code that computes the plane wave pseudopotential 6-10X PEtot Multi-GPU behavior of the electron structures of materials calculations
Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf
GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications
Features Application GPU Perf Release Status Notes Supported
NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php/GPU _version_of_QMCPACK
Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php and computational kernels, 3D FFTs Espresso/PWscf http://www.quantum-espresso.org/
Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5k “Full GPU-based solution” GAMESS CPU http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure-2012/Luehr- ESCMA.pdf
2x Hybrid Hartree-Fock DFT 2 GPUs Available on request functionals including exact By Carnegie Mellon University VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores
3x NICS Electronic Structure Determination Workshop 2012: Under development http://www.olcf.ornl.gov/wp- Generalized Wang-Landau method with 32 GPUs vs. 32 WL-LSMS Multi-GPU support content/training/electronic-structure- (16-core) CPUs 2012/Eisenbach_OakRidge_February.pdf
GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
3D visualization of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html
High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU
University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm
Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/
Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
Interactive Molecule Released Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support
Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution
PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/
Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase
High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html
Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme
Aligns multiple query sequences MUMmer GPU 3-10x http://sourceforge.net/apps/mediawiki/mummer against reference sequence in Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel
Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers Released, Version 1.12 Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models GPU Perf compared against same or similar code running on single CPU machine Performance measured internally or independently
MD Average Speedups
The blue node contains Dual E5-2687W CPUs 10 (8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs
(8 Cores per CPU) and 1 or 2 NVIDIA K10, K20, 8 or K20X GPUs.
6
4 Performance Performance Relative to CPU Only 2
0 CPU CPU + K10 CPU + K20 CPU + K20X CPU + 2x K10 CPU + 2x K20 CPU + 2x K20X
Average speedup calculated from 4 AMBER, 3 NAMD, 3 LAMMPS, and 1 GROMACS test cases. Error bars show the maximum and minimum speedup for each hardware configuration. Molecular Dynamics (MD) Applications Features Application GPU Perf Release Status Notes/Benchmarks Supported
AMBER 12, GPU Revision Support 12.2 PMEMD Explicit Solvent & GB > 100 ns/day JAC Released http://ambermd.org/gpus/benchmarks. AMBER Implicit Solvent NVE on 2X K20s Multi-GPU, multi-node htm#Benchmarks
2x C2070 equals Release C37b1; Implicit (5x), Explicit (2x) Solvent Released 32-35x X5667 http://www.charmm.org/news/c37b1.html#pos CHARMM via OpenMM Single & Multi-GPU in single node CPUs tjump
Two-body Forces, Link-cell Pairs, Source only, Results Published Release V 4.04 Ewald SPME forces, 4x http://www.stfc.ac.uk/CSE/randd/ccg/softwar DL_POLY Multi-GPU, multi-node Shake VV e/DL_POLY/25526.aspx
165 ns/Day DHFR Released Release 4.6.3; 1st Multi-GPU support Implicit (5x), Explicit (2x) on GROMACS Multi-GPU, multi-node www.gromacs.org 4X C2075s
http://lammps.sandia.gov/bench.html#desktop Lennard-Jones, Gay-Berne, 3.5-18x on Released and LAMMPS Tersoff & many more potentials ORNL Titan Multi-GPU, multi-node http://lammps.sandia.gov/bench.html#titan
4.0 ns/day Released Full electrostatics with PME and NAMD 2.9 F1-ATPase on 100M atom capable NAMD most simulation features http://www.ks.uiuc.edu/Research/namd/ 1x K20X Multi-GPU, multi-node GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison New/Additional MD Applications Ramping Features Application GPU Perf Release Status Notes Supported
Production biomolecular dynamics (MD) software specially 150 ns/day DHFR on 1x Released Written for use only on GPUs optimized to run on GPUs ACEMD K20 Single and Multi-GPUs http://www.acellera.com/
Bonded, pair, excluded interactions; Van Released, Version der Waals, electrostatic, nonbonded far TBD http://www.schrodinger.com/productpage/14/3/ DESMOND 3.4.0/0.7.2 interactions
Powerful distributed computing molecular Depends upon number of Released http://folding.stanford.edu dynamics system; implicit solvent and Folding@Home GPUs GPUs and CPUs GPUs get 4X the points of CPUs folding
High-performance all-atom biomolecular Depends upon number of http://www.gpugrid.net/ Released GPUGrid.net simulations; explicit solvent and binding GPUs
Simple fluids and binary mixtures (pair Up to 66x on 2090 vs. 1 Released, Version 0.2.0 http://halmd.org/benchmarks.html#supercooled-binary- potentials, high-precision NVE and NVT, HALMD CPU core Single GPU mixture-kob-andersen dynamic correlations)
Kepler 2X faster than Released, Version 0.11.3 http://codeblue.umich.edu/hoomd-blue/ Written for use only on GPUs HOOMD-Blue Fermi Single and Multi-GPU on 1 node Multi-GPU w/ MPI in late 2013
mdcore TBD TBD Released, Version 0.1.7 http://mdcore.sourceforge.net/download.html
Implicit: 127-213 ns/day Library and application for molecular dynamics on high- Implicit and explicit solvent, custom Released, Version 5.1 Explicit: 18-55 performance OpenMM forces Multi-GPU ns/day DHFR https://simtk.org/home/openmm_suite GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Built from Ground Up for GPUs Computational Chemistry
What Study disease & discover drugs Predict drug and protein interactions
Speed of simulations is critical GPU READY Why APPLICATIONS Enables study of: ACEMD Longer timeframes AMBER Larger systems DL_PLOY More simulations GAMESS GROMACS How GPUs increase throughput & accelerate simulations LAMMPS NAMD AMBER 11 Application NWChem 4.6x performance increase with 2 GPUs with Q-CHEM only a 54% added cost* Quantum Espresso TeraChem
• AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) • Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333 AMBER 12
26 Ross Walker (AMBER developer) video
27 Performance
AMBER 12
SAN DIEGO SUPERCOMPUTER CENTER 28 Explicit Solvent Performance (JAC-DHFR NVE Production Benchmark)
2xK20X 102.13 1xK20X 89.13 2xK20 95.59 1xK20 81.09 4xK10 (2 card) 97.95 3xK10 (1.5 card) 82.70 2xK10 (1 card) 67.02 1xK10 (0.5 card) 52.94 Single < $1000 GPU 2xM2090 (1 node) 58.28 1xM2090 43.74 1xGTX Titan 111.34 4xGTX680 (1 node) 118.88 3xGTX680 (1 node) 101.31 2xGTX680 (1 node) 86.84 1xGTX680 72.49 1xGTX580 54.46 96xE5-2670 (6 nodes) 41.55 64xE5-2670 (4 nodes) 58.19 48xE5-2670 (3 nodes) 50.25 32xE5-2670 (2 nodes) 38.49 16xE5-2670 (1 node) 21.13 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00
Kepler K20 GPU Kepler K10 GPU M2090 GPU GTX GPU Gordon CPU
SAN DIEGO SUPERCOMPUTER CENTER Throughput in ns/day 29 Protein Folding Simulation With AMBER Accelerated By GPUs 2.95x Faster
168 ns/day 585 ns/day
2x8xE5-2670 CPU 2x8xE5-2670 CPU + Tesla K20X GPU
Data courtesy of AMBER.org AMBER Performance Example (TRP Cage) CPU 2x8xE5-2670 GPU K20
SAN DIEGO SUPERCOMPUTER CENTER 31 Historical Single Node / Single GPU AMBER Performance (DHFR Production NVE) PMEMD Peformance 120
100
80
GPU
60 CPU ns/day Linear (GPU) Linear (CPU) 40
20
0 2007 2008 2009 2010 2011 2012 2013 2014
SAN DIEGO SUPERCOMPUTER CENTER 32 Interactive MD? • Single nodes are now fast enough, GPU enabled cloud nodes actually make sense as a back end now.
SAN DIEGO SUPERCOMPUTER CENTER 33 Cost Comparison 4 simultaneous simulations, 23,000 atoms, 250ns each, 5 days maximum time to solution.
Traditional Cluster GPU Workstation Nodes Required 12 1 (4 GPUs) Interconnect QDR IB None Time to complete simulations 4.98 days 2.25 days
Power Consumption 5.7 kW (681.3 kWh) 1.0 kW (54.0 kWh) System Cost (per day) $96,800 ($88.40) $5200 ($4.75)
Simulation Cost (681.3 * 0.18) + (88.40 * 4.98) (54.0 * 0.18) + (4.75 * 2.25)
$562.87 $20.41
>25x cheaper AND solution obtained in less than half the time
SAN DIEGO SUPERCOMPUTER CENTER 34 Kepler - Our Fastest Family of GPUs Yet
30.00 Factor IX Running AMBER 12 GPU Support Revision 12.1 25.39 25.00 The blue node contains Dual E5-2687W CPUs 22.44 (8 Cores per CPU). 7.4x The green nodes contain Dual E5-2687W CPUs
20.00 18.90 (8 Cores per CPU) and either 1x NVIDIA M2090, 6.6x 1x K10 or 1x K20 for the GPU
15.00
11.85 5.6x
Nanoseconds Day / 10.00 3.5x 5.00 3.42
0.00 Factor IX 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X M2090 GPU speedup/throughput increased from 3.5x (with M2090) to 7.4x (with K20X) when compared to a CPU only node 35 K10 Accelerates Simulations of All Sizes
30 Running AMBER 12 GPU Support Revision 12.1
The blue node contains Dual E5-2687W CPUs 24.00 25 (8 Cores per CPU).
19.98 The green nodes contain Dual E5-2687W CPUs 20 (8 Cores per CPU) and 1x NVIDIA K10 GPU
15
10
Speedup Speedup Compared to CPU Only 5.50 5.53 5.04 5 2.00
0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB Gain 24x performance by adding just 1 GPU Nucleosome when compared to dual CPU performance K20 Accelerates Simulations of All Sizes 30.00 28.00 Running AMBER 12 GPU Support Revision 12.1 25.56 SPFP with CUDA 4.2.9 ECC Off 25.00
The blue node contains 2x Intel E5-2687W CPUs (8 Cores per CPU) 20.00 Each green nodes contains 2x Intel E5-2687W CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
15.00
10.00 7.28
6.50 6.56 Speedup Speedup Compared to CPU Only
5.00 2.66 1.00 0.00 CPU All TRPcage GB JAC NVE PME Factor IX NVE Cellulose NVE Myoglobin GB Nucleosome Molecules PME PME GB Gain 28x throughput/performance by adding just one K20 GPU Nucleosome when compared to dual CPU performance
37 K20X Accelerates Simulations of All Sizes 35 31.30 Running AMBER 12 GPU Support Revision 12.1
30 28.59 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 25 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 1x NVIDIA K20X GPU 20
15
10 8.30
7.15 7.43 Speedup Speedup Compared to CPU Only
5 2.79
0 CPU TRPcage JAC NVE Factor IX NVE Cellulose NVE Myoglobin Nucleosome All Molecules GB PME PME PME GB GB
Gain 31x performance by adding just one K20X GPU Nucleosome when compared to dual CPU performance
38 K10 Strong Scaling over Nodes
Cellulose 408K Atoms (NPT) Running AMBER 12 with CUDA 4.2 ECC Off 6
The blue nodes contains 2x Intel X5670 CPUs (6 Cores per CPU) 5
The green nodes contains 2x Intel X5670
CPUs (6 Cores per CPU) plus 2x NVIDIA 4 K10 GPUs 2.4x
3 CPU Only 3.6x With GPU
Nanoseconds Day / 2 5.1x 1
0 1 2 4 Cellulose Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes
Cellulose (NVE) Running AMBER 12 GPU Support 9 Revision 18
8 The blue nodes contain Dual E5- 2687W CPUs (8 Cores per CPU). 7 1.2x The green nodes contain Dual E5- 6 1.9x 2687W CPUs (8 Cores per CPU) and 5 2x NVIDIA K20 GPUs 3.1x CPU Only 4 5.3x With GPU
Nanoseconds/Day 3
2
1 Cellulose 0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes
Factor IX Running AMBER 12 GPU Support 35 Revision 18
The blue nodes contain Dual E5- 30 1.3x 2687W CPUs (8 Cores per CPU). 1.8x 25 The green nodes contain Dual E5- 2.8x 2687W CPUs (8 Cores per CPU) and 20 2x NVIDIA K20 GPUs 4.7x CPU Only 15
With GPU Nanoseconds/Day
10
5 Factor IX
0 1 2 4 8 Number of Nodes GPUs significantly outperform CPUs while scaling over multiple nodes K20 Scaling over 8 Nodes
JAC (NVE) Running AMBER 12 GPU Support 140 Revision 18
120 The blue nodes contain Dual E5- 2687W CPUs (8 Cores per CPU).
100
The green nodes contain Dual E5- 1.7x 1.8x 2687W CPUs (8 Cores per CPU) and 80 2x NVIDIA K20 GPUs 2.4x CPU Only 60
4x With GPU Nanoseconds/Day 40
20
0 DHFR 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes
Cellulose (NVE) Running AMBER 12 GPU Support Revision 18 9 The blue nodes contain Dual E5- 8 2687W CPUs (8 Cores per CPU). 1.3x 7 The green nodes contain Dual E5-
2x 2687W CPUs (8 Cores per CPU) and
6 2x NVIDIA K20X GPUs 3.3x 5
5.8x CPU Only 4 With GPU
Nanoseconds/Day 3
2
1 Cellulose 0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes
Factor IX Running AMBER 12 GPU Support Revision 18 40 The blue nodes contain Dual E5- 35 2687W CPUs (8 Cores per CPU).
1.4x 30 The green nodes contain Dual E5- 2687W CPUs (8 Cores per CPU) and 1.9x 25 2x NVIDIA K20X GPUs 2.9x 20 5.1x CPU Only With GPU
15 Nanoseconds/Day
10
5 Factor IX
0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes
JAC (NVE) Running AMBER 12 GPU Support Revision 18 140 The blue nodes contain Dual E5- 120 2687W CPUs (8 Cores per CPU).
The green nodes contain Dual E5- 100 1.7x 2687W CPUs (8 Cores per CPU) and 1.9x 2x NVIDIA K20X GPUs 80 2.5x 4.3x CPU Only 60
With GPU Nanoseconds/Day 40
20 DHFR
0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes Kepler – Universally Faster
9 Running AMBER 12 GPU Support Revision 12.1
8 The CPU Only node contains Dual E5-2687W
CPUs (8 Cores per CPU). 7
The Kepler nodes contain Dual E5-2687W CPUs 6 (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs 5 JAC Factor IX 4 Cellulose
3
2 Speedups Speedups Compared CPU to Only
1
0 CPU Only CPU + K10 CPU + K20 CPU + K20X Cellulose
The Kepler GPUs accelerated all simulations, up to 8x K10 Extreme Performance
Running AMBER 12 GPU Support Revision 12.1 JAC 23K Atoms (NVE) 120 The blue node contains Dual E5-2687W CPUs (8 Cores per CPU). 97.99 100 The green node contain Dual E5-2687W CPUs
(8 Cores per CPU) and 2x NVIDIA K10 GPUs
80
60
Nanoseconds Day / 40
20 12.47
0 DHFR 1 Node 1 Node Gain 7.8X performance by adding just 2 GPUs when compared to dual CPU performance K20 Extreme Performance
DHRF JAC 23K Atoms (NVE) Running AMBER 12 GPU Support Revision 12.1 120 SPFP with CUDA 4.2.9 ECC Off
The blue node contains 2x Intel E5-2687W CPUs 95.59 100 (8 Cores per CPU)
Each green node contains 2x Intel E5-2687W
80 CPUs (8 Cores per CPU) plus 2x NVIDIA K20 GPU
60
40 Nanoseconds Day /
20 12.47
0 1 Node 1 Node DHFR
Gain > 7.5X throughput/performance by adding just 2 K20 GPUs when compared to dual CPU performance
48 Replace 8 Nodes with 1 K20 GPU
90.00 35000 $32,000.00 Running AMBER 12 GPU Support Revision 81.09 12.1 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x 70.00 65.00 Intel E5-2687W CPUs (8 Cores per CPU) 25000 60.00 Each green node contains 2x Intel E5- 2687W CPUs (8 Cores per CPU) plus 1x 50.00 20000 NVIDIA K20 GPU
Note: Typical CPU and GPU node pricing 40.00 15000 used. Pricing may vary depending on node configuration. Contact your preferred HW 30.00 vendor for actual pricing. 10000 20.00 $6,500.00
5000 10.00
0.00 0 Nanoseconds/Day Cost Cut down simulation costs to ¼ and gain higher performance DHFR
49 Replace 7 Nodes with 1 K10 GPU
Performance on JAC NVE Cost Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off 80 $35,000.00 $32,000 The eight (8) blue nodes each contain 2x 70 $30,000.00 Intel E5-2687W CPUs (8 Cores per CPU)
60 The green node contains 2x Intel E5-2687W $25,000.00
CPUs (8 Cores per CPU) plus 1x NVIDIA 50 K10 GPU $20,000.00 40 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW 30
vendor for actual pricing. Nanoseconds Day / $10,000.00 20 $7,000
10 $5,000.00
0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled
Cut down simulation costs to ¼ and increase performance by 70% DHFR
50 Extra CPUs decrease Performance
Cellulose NVE Running AMBER 12 GPU Support Revision 12.1
8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8
6 Cores per CPU)
5
4 1 E5-2687W 2 E5-2687W
3 Nanoseconds Day /
2
CPUs 2 GPUs 2 CPUs
1 CPU 2 GPUs 2 CPU 1 2 1
0 Cellulose CPU Only CPU with dual K20s
When used with GPUs, dual CPU sockets perform worse than single CPU sockets. Kepler - Greener Science
Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC 2500 The blue node contains Dual E5-2687W CPUs (150W each, 8 Cores per CPU).
The green nodes contain Dual E5-2687W CPUs 2000 Lower is better (8 Cores per CPU) and 1x NVIDIA K10, K20, or K20X GPUs (235W each).
1500 Energy Expended
1000 = Power x Time Energy Energy Expended (kJ)
500
0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2
Cores per CPU socket 4+ (1 CPU core drives 1 GPU)
CPU speed (Ghz) 2.66+ System memory per node (GB) 16
GPUs Kepler K10, K20, K20X
1-2 # of GPUs per CPU socket (4 GPUs on 1 socket is good to do 4 fast serial GPU runs)
GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 16x or higher Server storage 2 TB
Network configuration Infiniband QDR or better 53 Scale to multiple nodes with same single node configuration Benefits of GPU AMBER Accelerated Computing
Faster than CPU only systems in all tests
Most major compute intensive aspects of classical MD ported
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and over multiple nodes
K20 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated AMBER for free – www.nvidia.com/GPUTestDrive 54 NAMD 2.9 K20 Accelerates Simulations of All Sizes
3 Running NAMD 2.9 with CUDA 4.0 ECC Off 2.7 2.6 The blue node contains 2x Intel E5-2687W CPUs 2.5 2.4 (8 Cores per CPU)
Each green node contains 2x Intel E5-2687W 2 CPUs (8 Cores per CPU) plus 1x NVIDIA K20 GPUs
1.5
1 Speedup Speedup Compared to CPU Only
0.5
0 CPU All Molecules ApoA1 F1-ATPase STMV
Gain 2.5x throughput/performance by adding just 1 GPU Apolipoprotein A1 when compared to dual CPU performance
56 Kepler - Our Fastest Family of GPUs Yet
4.50
ApoA1 4.00 Running NAMD version 2.9 4.00 The blue node contains Dual E5-2687W CPUs 3.57 (8 Cores per CPU). 3.45 3.50 2.9x The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 1x NVIDIA M2090, 3.00 1x K10 or 1x K20 for the GPU. 2.63 2.6x 2.50 2.5x 2.00
Nanoseconds/Day 1.9x 1.50 1.37
1.00
0.50
0.00 1 CPU Node 1 CPU Node + 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X Apolipoprotein A1 M2090 GPU speedup/throughput increased from 1.9x (with M2090) to 2.9x (with K20X) when compared to a CPU only node 57 Kepler - Our Fastest Family of GPUs Yet
1.4
F1-ATPase 1.25 Running NAMD version 2.9 1.2 The blue node contains Dual E5-2687W CPUs 1.12 (8 Cores per CPU).
1 0.96 The green nodes contain Dual E5-2687W 2.7x CPUs (8 Cores per CPU) and either 1x NVIDIA 0.83 M2090, 1x K10 or 1x K20 for the GPU. 0.8 2.4x
2.1x 0.6
0.46 1.8x Nanoseconds/Day
0.4
0.2
F1-ATPase 0 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
GPU speedup/throughput increased from 1.8x (with M2090) to 2.7x (with K20X) when compared to a CPU only node 58 Kepler - Our Fastest Family of GPUs Yet
0.4
STMV 0.35 Running NAMD version 2.9 0.35 The blue node contains Dual E5-2687W CPUs 0.31 (8 Cores per CPU).
0.29 0.3 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 1x NVIDIA
3.0x 0.25 M2090, 1x K10 or 1x K20 for the GPU. 0.22 2.7x 0.2 2.5x
Nanoseconds/Day 0.15 0.12 1.9x
0.1
0.05
0 Satellite Tobacco Mosaic Virus 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X GPU speedup/throughput increased from 1.9x (with M2090) to 3.0x (with K20X) when compared to a CPU only node 59 Kepler - Our Fastest Family of GPUs Yet
8 Running NAMD version 2.9 ApoA1 7.14 The blue node contains Dual E5-2687W CPUs 7 6.67 (8 Cores per CPU). 6.25 The green nodes contain Dual E5-2687W 6 5.2x CPUs (8 Cores per CPU) and either 2x NVIDIA 5.00 M2090, K10, K20, or K20X for the GPU. 5 4.9x
4 4.6x
Nanoseconds/Day 3 3.6x
2 1.37
1 Apolipoprotein A1 0 2 CPU Node 2 CPU Node + 2xM2090 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X
60 Kepler - Our Fastest Family of GPUs Yet
2.5 Running NAMD version 2.9 F1-ATPase 2.22 The blue node contains Dual E5-2687W CPUs 2.04 (8 Cores per CPU).
2 The green nodes contain Dual E5-2687W 1.79 CPUs (8 Cores per CPU) and either 2x NVIDIA 4.8x M2090, K10, K20, or K20X for the GPU.
1.56
1.5 4.4x
3.9x
1 3.4x Nanoseconds/Day
0.46 0.5
F1-ATPase
0 2 CPU Node 2 CPU Node + 2xM2090 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X
61 Kepler - Our Fastest Family of GPUs Yet
0.7 Running NAMD version 2.9 STMV 0.61 The blue node contains Dual E5-2687W CPUs 0.6 (8 Cores per CPU). 0.56 0.52 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and either 2x NVIDIA 0.5 5.3x M2090, K10, K20, or K20X for the GPU.
0.41 0.4 4.9x
0.3 4.5x Nanoseconds/Day 3.6x 0.2
0.12 0.1
Satellite Tobacco Mosaic Virus
0 2 CPU Node 2 CPU Node + 2 CPU Node + 2xK10 2 CPU Node + 2xK20 2 CPU Node + 2xK20X 2xM2090
62 Kepler – Universally Faster
6 Running NAMD version 2.9
The CPU Only node contains Dual E5-2687W 5 CPUs (8 Cores per CPU). 5.1x The Kepler nodes contain Dual E5-2687W CPUs 4 4.7x (8 Cores per CPU) and 1 or two NVIDIA K10, 4.3x K20, or K20X GPUs. F1-ATPase 3 ApoA1 STMV 2.9x 2 2.6x
2.4x Speedup Speedup Compared to CPU Only
1
0 CPU Only 1x K10 1x K20 1x K20X 2x K10 2x K20 2x K20X | Kepler nodes use Dual CPUs | F1-ATPase The Kepler GPUs accelerate all simulations, up to 5.1x Average acceleration printed in bars ApoA1 Relative Performance 6
5.2 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.1 4 3.6 The green nodes contain Single or Dual E5-2687W CPUs (8 Cores per CPU) Relative to and either 1x or 2x NVIDIA M2090 or 2x CPU 2.9 3 2.7 K20X for the GPU.
1.9 2
1.0 1 0.4
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU Apolipoprotein A1 E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket
64 F1-ATPase Relative Performance 6
5.1 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.0 4 The green nodes contain Single or Dual 3.4 E5-2687W CPUs (8 Cores per CPU) Relative to and either 1x or 2x NVIDIA M2090 or 2.9 2x CPU 3 2.7 K20X for the GPU.
2 1.8
1.0 1 0.4
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + F1-ATPase 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket
65 STMV Relative Performance 6
5.1 Running NAMD version 2.9 5 The blue node contains Single or Dual E5-2687W CPUs (8 Cores per CPU). 4.0 4 The green nodes contain Single or Dual 3.4 E5-2687W CPUs (8 Cores per CPU) Relative to 2.9 and either 1x or 2x NVIDIA M2090 or 2x CPU 3 2.7 K20X for the GPU.
2 1.8
1.0 1 0.4
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + Satellite Tobacco Mosaic 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU Virus E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket
66 NAMD CPU and GPU Scaling
Courtesy of Jim Phillips @ Beckman Institute UIUC
67 Outstanding Strong Scaling with Multi-STMV
Running NAMD version 2.9 100 STMV on Hundreds of Nodes Each blue XE6 CPU node contains 1x 1.2 AMD 1600 Opteron (16 Cores per CPU). Fermi XK6 1 Each green XK6 CPU+GPU node CPU XK6 2.7x contains 1x AMD 1600 Opteron (16 Cores per CPU) and an additional 1x 0.8 NVIDIA X2090 GPU.
2.9x 0.6
Nanoseconds Day / 0.4
0.2 3.6x 3.8x 0 32 64 128 256 512 640 768 # of Nodes Concatenation of 100 Satellite Tobacco Mosaic Virus Accelerate your science by 2.7-3.8x when compared to CPU-based supercomputers Replace 3 Nodes with 1 2090 GPU
Running NAMD version 2.9 F1-ATPase Each blue node contains 2x Intel Xeon X5550 CPUs (4 Cores per CPU). 4 CPU Nodes 0.8 9000 0.74 $8,000 The green node contains 2x Intel Xeon X5550 1 CPU Node +8000 0.7 1x M2090 GPUs CPUs (4 Cores per CPU) and 1x NVIDIA M2090 0.63 GPU 7000 0.6 Note: Typical CPU and GPU node pricing used. 6000 0.5 Pricing may vary depending on node configuration. 5000 Contact your preferred HW vendor for actual 0.4 $4,000 pricing. 4000 0.3 3000 0.2 2000
0.1 1000
0 0 Nanoseconds/Day Cost
Speedup of 1.2x for 50% the cost F1-ATPase K20 - Greener: Twice The Science Per Watt
1200000 Energy Used in Simulating 1 Nanosecond of ApoA1 Running NAMD version 2.9 1000000 Each blue node contains Dual E5-2687W CPUs (95W, 4 Cores per CPU).
800000 Each green node contains 2x Intel Xeon Lower is better X5550 CPUs (95W, 4 Cores per CPU) and 2x NVIDIA K20 GPUs (225W per GPU)
600000 Energy Expended
Energy Energy Expended (kJ) 400000 = Power x Time
200000
0 1 Node 1 Node + 2x K20 Cut down energy usage by ½ with GPUs
70 Kepler - Greener: Twice The Science/Joule Energy used in simulating 1 ns of SMTV 250000 Running NAMD version 2.9
The blue node contains Dual E5-2687W 200000 CPUs (150W each, 8 Cores per CPU).
Lower is better The green nodes contain Dual E5-2687W 150000 CPUs (8 Cores per CPU) and 2x NVIDIA K10, K20, or K20X GPUs (235W each).
100000 Energy Expended
Energy Energy Expended (kJ) = Power x Time
50000
0 CPU Only CPU + 2 K10s CPU + 2 K20s CPU + 2 K20Xs
Cut down energy usage by ½ with GPUs
Satellite Tobacco Mosaic Virus Recommended GPU Node Configuration for NAMD Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32
Kepler K10, K20, K20X GPUs Fermi M2090, M2075, C2075
# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand
72 Scale to multiple nodes with same single node configuration Summary/Conclusions Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests
Large performance boost with small marginal price increase
Energy usage cut in half
GPUs scale very well within a node and over multiple nodes
Tesla K20 GPU is our fastest and lowest power high performance GPU to date
Try GPU accelerated NAMD for free – www.nvidia.com/GPUTestDrive 73 LAMMPS More Science for Your Money (LAMMPS) Embedded Atom Model 6 5.5 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU). 5 4.5 Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K10, K20, or K20X GPUs (235W). 4
3.3 2.92 3 2.47
2 1.7 Speedup Speedup Compared to CPU Only 1
0 CPU Only CPU + 1x CPU + 1x CPU + 1x CPU + 2x CPU + 2x CPU + 2x K10 K20 K20X K10 K20 K20X Experience performance increases of up to 5.5x with Kepler GPU nodes. K20X, the Fastest GPU Yet (LAMMPS)
7 Blue node uses 2x E5-2687W (8 Cores and 150W per CPU).
6 Green nodes have 2x E5-2687W and 2
NVIDIA M2090s or K20X GPUs (235W). 5
4
3
2 Speedup Speedup Relative to CPU Alone
1
0 CPU Only CPU + 2x M2090 CPU + K20X CPU + 2x K20X
Experience performance increases of up to 6.2x with Kepler GPU nodes. One K20X performs as well as two M2090s Get a CPU Rebate to Fund Part of Your GPU Budget
Acceleration in LAMMPS Loop Time Computation by Additional GPUs 20 The blue node contains Dual X5670 CPUs 18.2 (6 Cores per CPU). 18
16 The green nodes contain Dual X5570 CPUs (4 Cores per CPU) and 1-4 NVIDIA M2090 14 12.9 GPUs. 12 9.88 10
8
6 5.31 Normalized to CPU Only 4
2
0 1 Node 1 Node + 1x M2090 1 Node + 2x M2090 1 Node + 3x M2090 1 Node + 4x M2090
Increase performance 18x when compared to CPU-only nodes
Cheaper CPUs used with GPUs AND still faster overall performance when compared to more expensive CPUs! Excellent Strong Scaling on Large Clusters
LAMMPS Gay-Berne 134M Atoms
600 GPU Accelerated XK6 500
CPU only XE6 400 3.55x 300
200
LoopTime (seconds) 3.48x 3.45x 100
0 300 400 500 600 700 800 900 Nodes From 300-900 nodes, the NVIDIA GPU-powered XK6 maintained 3.5x performance compared to XE6 CPU nodes
Each blue Cray XE6 Nodes have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Cores per CPU) and 1x NVIDIA X2090 GPUs Sustain 5x Performance for Weak Scaling
LAMMPS Weak Scaling with 32K Atoms per Node 45
40
35
30 6.7x 5.8x 4.8x 25
20
15 LoopTime (seconds) 10
5
0 1 8 27 64 125 216 343 512 729 Nodes Performance of 4.8x-6.7x with GPU-accelerated nodes when compared to CPUs alone
Each blue Cray XE6 Node have 2x AMD Opteron CPUs (16 Cores per CPU) Each green Cray XK6 Node has 1x AMD Opteron 1600 CPU (16 Core per CPU) and 1x NVIDIA X2090 Faster, Greener — Worth It! (LAMMPS)
Energy Consumed in one loop of EAM 140
120 GPU-accelerated computing uses Lower is better 53% less energy than CPU only
100
80
60 Energy Expended = Power x Time Power calculated by combining the component’s TDPs Energy Energy Expended (kJ) 40
20
0 1 Node 1 Node + 1 K20X 1 Node + 2x K20X
Blue node uses 2x E5-2687W (8 Cores and 150W per CPU) and CUDA 4.2.9. Green nodes have 2x E5-2687W and 1 or 2 NVIDIA K20X GPUs (235W) running CUDA 5.0.35.
Try GPU accelerated LAMMPS for free – www.nvidia.com/GPUTestDrive Accelerate LAMMPS Simulations with GPUs
7 6.6 “Summary of best speedups 6 versus running on a single XK7 CPU for CPU-only and 5 accelerated runs. Simulation is 400 timesteps for a 1
4 million molecule droplet. The
XK7 w/out GPU speedups are calculated
XK7 w/ GPU based on the single node Speedup 3 loop time of 440.3 seconds.”
2
1.0 1 *Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers 0 1 Node – Three-Body Potentials,” Computer Physics Communications (2013, submitted) Accelerate LAMMPS Simulations with GPUs
250
211 “Summary of best speedups versus running on a single 200 XK7 CPU for CPU-only and accelerated runs. Simulation is 400 timesteps for a 1
150 million molecule droplet. The
speedups are calculated XK7 w/out GPU based on the single node
XK7 w/ GPU Speedup 100 loop time of 440.3 seconds.”
50 41.6
*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers 0 – Three-Body Potentials,” Computer 64 Nodes Physics Communications (2013, submitted) Molecular Dynamics with LAMMPS on a Hybrid Cray Supercomputer
W. Michael Brown National Center for Computational Sciences Oak Ridge National Laboratory
NVIDIA Technology Theater, Supercomputing 2012 November 14, 2012 Early Kepler LAMMPS Benchmarks on Titan 32.00 4 16.00 XK7+GPU 8.00
4.00 XK6 3
)
s )
( 2.00 s XK6+GPU
Atomic Fluid (
e
1.00 e
m 2 i
XK7+GPU m i
T 0.50 T 0.25 XK6 0.13 1 0.06 XK6+GPU 0.03 0
1 2 4 8 16 32 64 128 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 3.0 8.00 XK7+GPU 2.5 4.00
2.0
)
s
2.00 (
)
s e ( XK6 1.5
Bulk Copper
m e
1.00 i
T m
i 1.0 T 0.50 XK6+GPU 0.5 0.25 0.0
0.13 1 4 6 4 6 4 6 4 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 2 4 8 16 32 64 128 1 Early Kepler LAMMPS Benchmarks on Titan 64.00 32 32.00 XK7+GPU 16
16.00
)
)
s (
8 s
8.00 (
Protein e
XK6 e
m
i m
4.00 i
T 4 T 2.00 XK6+GPU 2
1.00
0.50
1
1 4 6 4 6 6
4
4
1 6 5 9 8
1 2 4 8 16 32 64 128 Nodes 2
2 0
3
0
4
6 1 128.00 1 16 64.00 14 32.00 XK7+GPU 12
16.00
) )
s 10 s
( 8.00
(
e
Liquid Crystal 4.00 XK6 8 e
m
m
i
i T 2.00 6 T 1.00 4 0.50 XK6+GPU 0.25 2 0.13 0
1 4 6 4 6 4 6 4 1 2 4 8 16 32 64 128 Nodes 1 6 5 2 9 8 2 0 0 3 1 4 6 1 Early Titan XK6/XK7 LAMMPS Benchmarks
18 Speedup with Acceleration on XK6/XK7 Nodes 16 1 Node = 32K Particles 14 900 Nodes = 29M Particles 12 10 8 6 4 2 0 Atomic Fluid (cutoff Atomic Fluid (cutoff Bulk Copper Protein Liquid Crystal = 2.5σ) = 5.0σ) XK6 (1 Node) 1.92 4.33 2.12 2.6 5.82 XK7 (1 Node) 2.90 8.38 3.66 3.36 15.70 XK6 (900 Nodes) 1.68 3.96 2.15 1.56 5.60 XK7 (900 Nodes) 2.75 7.48 2.86 1.95 10.14 CPU and GPU Scaling in Water Simulations (LAMMPS)
“Simulation rates for a fixed- size 256K molecule water box in parallel (Left) and a scaled-size water box with 32K molecules per node (Right). Parallel simulation rates with ideal efficiency would have a slope equal to that of the green line in each plot.”
*Brown, W.M., Yamada, M., “Implementing Molecule Dynamics on Hybrid High Performance Computers – Three-Body Potentials,” Computer Physics Communications (2013, submitted) Recommended GPU Node Configuration for LAMMPS Computational Chemistry
Workstation or Single Node Configuration
# of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32
GPUs Kepler K10, K20, K20X
# of GPUs per CPU socket 1-2 GPU memory preference (GB) 6 GPU to CPU connection PCIe 2.0 or higher
Server storage 500 GB or higher
Network configuration Gemini, InfiniBand
88 Scale to multiple nodes with same single node configuration GROMACS 4.6.1 and 4.6.3 Erik Lindahl (GROMACS developer) video Kepler – Our Fastest Family of GPUs Yet
16 GROMACS 4.6.1 Water (192K Atoms) Running GROMACS 4.6.1
14 The blue nodes contains either single or dual E5-2687W CPUs (8 Cores per CPU). 12
3.1x The green nodes contain either single or 10 2.8x dual E5-2687W CPUs (8 Cores per
CPU) and either 1x NVIDIA M2090, 1x K10 or 1x K20 for the GPU
8 3.2x ns/Day
6 1.7x 1.7x
4
2
0 1 CPU Node 1 CPU Node + M2090 1 CPU Node + K10 1 CPU Node + K20 1 CPU Node + K20X
Single Sandy Bridge CPU per node with single K10, K20 or K20X produces best performance
91 K20X, The Fastest Yet
192K Water Molecules 25 Running GROMACS version 4.6.1 and CUDA 5.0.35 22.27 The blue node contains 2 E5-2687W CPUs 20 (150W each, 8 Cores per CPU). 3.1x The green nodes contain 2 E5-2687W CPUs (8 15 Cores per CPU) and 1 or 2 NVIDIA K20X GPUs 13.65 (235W each).
10 1.9x
Nanoseconds/Day 7.26
5
0 CPU CPU + K20 CPU + 2x K20X
Using K20X nodes increases performance by 3.1x Water K20 Increases Performance
Water (192K Atoms) 70 Running GROMACS 4.6.3 66.29 The blue nodes contain 1x E5-2687W CPU 60 (150W TDP, 8 Cores per CPU).
2.7x The green nodes contain 1x E5-2687W CPU 50 (8 Cores per CPU) and 1x NVIDIA K20 GPU
(235W TDP per GPU). 39.12 40 1 CPU/Node (1 CPU + 1x K20)/Node 30 2.8x
Nanoseconds/Day 21.46 20 2.9x 11.70 10 2.7x
0 1 2 4 8
Get up to 2.9x performance compared to CPU-only nodes 2x K20 Increases Performance
Water (192K Atoms) 80 Running GROMACS 4.6.3 75.53 The blue nodes contain 1x E5-2687W CPU 70 (150W TDP, 8 Cores per CPU).
60 The green nodes contain 1x E5-2687W CPU 3.1x (8 Cores per CPU) and 2x NVIDIA K20 GPU
(235W TDP per GPU). 50 48.02
40 1 CPU/Node 3.5x (1 CPU + 2x K20)/Node
30
Nanoseconds/Day 26.84
20 15.02 3.7x 3.5x 10
0 1 2 4 8
Get up to 3.7x performance compared to CPU-only nodes Great Scaling in Large Systems
Water (3072K Atoms) 6 Running GROMACS 4.6.3
The blue nodes contain 1x E5-2687W CPU 5.00 (150W TDP, 8 Cores per CPU). 5
The green nodes contain 1x E5-2687W CPU (8 Cores per CPU) and 1x NVIDIA K20 GPU 4 (235W TDP per GPU). 2.9x
3 1 CPU/Node (1 CPU + 1x K20)/Node
2.35 Nanoseconds/Day 2 2.8x 1.19 1 0.70 2.8x 2.9x
0 1 2 4 8 Get up to 2.9x performance compared to CPU-only nodes Great Scaling in Large Systems
Water (3072K Atoms) 7 Running GROMACS 4.6.3 6.46 The blue nodes contain 1x E5-2687W CPU 6 (150W TDP, 8 Cores per CPU).
The green nodes contain 1x E5-2687W CPU 5 (8 Cores per CPU) and 2x NVIDIA K20 GPU
3.8x (235W TDP per GPU).
4 3.34 1 CPU/Node (1 CPU + 2x K20)/Node 3
Nanoseconds/Day 4x 2 1.64
3.9x 1 0.84 3.5x
0 1 2 4 8
Get up to 4x performance compared to CPU-only nodes Higher Performance with GPUs
Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 M2090/node The blue nodes contain single E5-2687W CPUs (8 Cores per CPU).
1 M2090/node The green nodes contain single E5-2687W ns/day CPUs (8 Cores per CPU) and either 1x or (log scale) 2x NVIDIA M2090 for the GPU. Single E5-2687w/node 1 1 2 4 8
0.1 # Nodes (log scale) Higher Performance with GPUs Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 M2090/node
The blue nodes contain dual E5-2687W 1 M2090/node CPUs (8 Cores per CPU).
The green nodes contain dual E5-2687W Dual E5-2687w/node CPUs (8 Cores per CPU) and either 1x or 2x NVIDIA M2090 for the GPU. ns/day (log scale) 1 1 2 4 8
0.1 # Nodes (log scale) Relative Performance Water (3072K Atoms) 2.5 Running GROMACS 4.6.3 2.2 The blue nodes contain either single or dual E5- 2 2687W CPUs (8 Cores per CPU).
The green nodes contain either single or dual E5- 1.5 2687W CPUs (8 Cores per CPU) and either 1x or 2x NVIDIA M2090 for the GPU. Relative to 1.2 2x CPU 1.0 1
0.6 0.5
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xCPU+ 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20x M2090 K20x CPU Single-Socket Dual-Socket GROMACS Scaling; 1 CPU & 2 GPUs /node
Water (192K Atoms) Running GROMACS 4.6.3 100 The blue nodes contain single E5-2687W 2 K20/node CPUs (8 Cores per CPU).
The green nodes contain single E5- 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU. Single E5-2687
ns/day 10 (log scale)
1 1 2 4 8 Nodes (log scale) GROMACS Scaling; 2 CPUs & 2 GPUs /node
Water (192K Atoms) 100 2 K20/node Running GROMACS 4.6.3
The blue nodes contain dual E5-2687W CPUs (8 Cores per CPU).
Dual E5-2687 The green nodes contain dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.
ns/day 10 (log scale)
1 1 2 4 8 # Nodes (log scale) Relative Performance 3.5 Water (192K Atoms) Running GROMACS 4.6.3 3 2.9 The blue nodes contain either single or dual E5-2687W CPUs (8 2.5 Cores per CPU).
2.0 2.0 The green nodes contain either 2 single or dual E5-2687W CPUs (8 Relative to 1.7 Cores per CPU) and either 1x or 2x 2x CPU 1.6 1.5 NVIDIA M2090 and K20 for the GPU. 1.1 1.0 1
0.6 0.5
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20 M2090 K20 CPU Single-Socket Dual-Socket GROMACS Scaling; 1 CPU & 2 GPUs /node
Water (3072K Atoms)
10 Running GROMACS 4.6.3 2 K20/node The blue nodes contain single E5-2687W CPUs (8 Cores per CPU).
The green nodes contain single E5- 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.
Single E5-2687 ns/day 1 (log scale) 1 2 4 8
0.1 Nodes (log scale) GROMACS Scaling; 2 CPUs & 2 GPUs /node Water (3072K Atoms) 10 Running GROMACS 4.6.3 2 K20/node The blue nodes contain dual E5-2687W CPUs (8 Cores per CPU).
The green nodes contain dual E5- Dual E5-2687 2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 for the GPU.
ns/day 1 (log scale) 1 2 4 8
0.1 # Nodes (log scale) Relative Performance
3 Water (3072K Atoms) 2.8 Running GROMACS 4.6.3
2.5 The blue nodes contain either 2.2 single or dual E5-2687W CPUs (8 2.0 Cores per CPU). 2 1.8 The green nodes contain either Relative to 1.7 2x CPU single or dual E5-2687W CPUs (8 1.5 Cores per CPU) and either 1x or 2x NVIDIA M2090 and K20 for the 1.2 GPU. 1.0 1
0.6 0.5
0 1xCPU 2xCPU 1xCPU+ 1xCPU+ 1xCPU + 1xCPU + 1xCPU + 1xCPU + 1xGPU 2xGPU 1xGPU 2xGPU 1xGPU 2xGPU E5-2687w K20 M2090 K20 CPU Single-Socket Dual-Socket K20X Scaling over 8 Nodes Water (192K Atoms) Running GROMACS 4.6.3 100 The blue nodes contain Dual E5-2687W 90 CPUs (8 Cores per CPU).
80 The green nodes contain Dual E5-2687W 70 CPUs (8 Cores per CPU) and 2x NVIDIA 2.3x K20X GPUs 60
50 2.7x CPU Only
40 With GPU Nanoseconds/Day 30 3x 20 3x 10
0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes K20X Scaling over 8 Nodes Water (3072K Atoms) Running GROMACS 4.6.3 9 The blue nodes contain Dual E5-2687W 8 CPUs (8 Cores per CPU).
7 The green nodes contain Dual E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA
6 K20X GPUs 2.8x 5
CPU Only 4 2.9x With GPU Nanoseconds/Day 3
2 2.8x 1 2.8x
0 1 2 4 8 Number of Nodes
GPUs significantly outperform CPUs while scaling over multiple nodes Additional Strong Scaling on Larger System 128K Water Molecules 160 Running GROMACS 4.6 pre-beta with CUDA 4.1 140 Each blue node contains 1x Intel X5670 (95W TDP, 6 Cores per CPU) 120
2x Each green node contains 1x Intel X5670 100 (95W TDP, 6 Cores per CPU) and 1x NVIDIA M2070 (225W TDP per GPU) 80 CPU Only With GPU 60
Nanoseconds Day / 2.8x 40
20 3.1x
0 8 16 32 64 128 Number of Nodes
Up to 128 nodes, NVIDIA GPU-accelerated nodes deliver 2-3x performance when compared to CPU-only nodes Replace 3 Nodes with 2 GPUs
Running GROMACS 4.6 pre-beta with ADH in Water (134K Atoms) CUDA 4.1
4 CPU Nodes 9 9000 The blue node contains 2x Intel X5550 8.36 CPUs (95W TDP, 4 Cores per CPU) $8,000 1 CPU Node + 8 2x M2090 GPUs8000 The green node contains 2x Intel X5550 6.7 7 $6,500 7000 CPUs (95W TDP, 4 Cores per CPU) and 2x NVIDIA M2090s as the GPU (225W TDP 6 6000 per GPU)
5 5000 Note: Typical CPU and GPU node pricing 4 4000 used. Pricing may vary depending on node configuration. Contact your preferred HW 3 3000 vendor for actual pricing.
2 2000
1 1000
0 0 Nanoseconds/Day Cost
Save thousands of dollars and perform 25% faster Greener Science
ADH in Water (134K Atoms) Running GROMACS 4.6 with CUDA 4.1 12000
The blue nodes contain 2x Intel X5550 CPUs (95W TDP, 4 Cores per CPU) 10000
Lower is better The green node contains 2x Intel X5550 CPUs, Consumed) 8000 4 Cores per CPU) and 2x NVIDIA M2090s GPUs (225W TDP per GPU)
6000
KiloJoules (
4000 Energy Expended = Power x Time
2000 Energy Energy Expended
0 4 Nodes 1 Node + 2x M2090 (760 Watts) (640 Watts)
In simulating each nanosecond, the GPU-accelerated system uses 33% less energy Recommended GPU Node Configuration for GROMACS Computational Chemistry
Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+
System memory per socket (GB) 32
GPUs Kepler K10, K20, K20X
1x Kepler-based GPUs (K20X, K20 or K10): need fast Sandy # of GPUs per CPU socket Bridge or perhaps the very fastest Westmeres, or high-end AMD Opterons
GPU memory preference (GB) 6
GPU to CPU connection PCIe 2.0 or higher Server storage 500 GB or higher
111 Network configuration Gemini, InfiniBand CHARMM Release C37b1 GPU Acceleration via OpenMM (GPUs supported on single node) GPUs Outperform CPUs
Daresbury Crambin 19.6k Atoms 70 Running CHARMM release C37b1 60 The blue nodes contains 44 X5667 CPUs 50 (95W, 4 Cores per CPU).
The green nodes contain 2 X5667 CPUs and 40 1 or 2 NVIDIA C2070 GPUs (238W each).
30 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node
Nanoseconds Day / configuration. Contact your preferred HW 20 vendor for actual pricing.
10
0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 $44,000 $3000 $4000
1 GPU = 15 CPUs More Bang for your Buck
Daresbury Crambin 19.6k Atom 12 Running CHARMM release C37b1
10 The blue nodes contains 44 X5667 CPUs (95W, 4 Cores per CPU).
8 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W). 6 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node 4 configuration. Contact your preferred HW vendor for actual pricing. Scaled Scaled Performance / Price 2
0 44x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070
Using GPUs delivers 10.6x the performance for the same cost Greener Science with NVIDIA Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms 18000
16000 Running CHARMM release C37b1
14000 The blue nodes contains 64 X5667 CPUs
(95W, 4 Cores per CPU).
12000 The green nodes contain 2 X5667 CPUs and 10000 1 or 2 NVIDIA C2070 GPUs (238W each). Lower is better 8000 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW 6000 Energy Energy Expended (kJ) vendor for actual pricing.
4000 Energy Expended 2000 = Power x Time 0 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070
Using GPUs will decrease energy use by 75% CHARMM Future Release with Native GPU Improvements (Multiple GPUs supported on multiple nodes) Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL Courtesy of Antti-Pekka Hynninen @ NREL www.acellera.com
470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms) 116 ns/day on 1 GPU for DHFR (23K atoms)
M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009) www.acellera.com
NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1
1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009) 2 For a list of selected references see http://www.acellera.com/acemd/publications
Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
Local Hamiltonian, non-local Released; Version 7.4.1 www.abinit.org Hamiltonian, LOBPCG algorithm, 1.3-2.7X ABINIT Multi-GPU support diagonalization / orthogonalization
Integrating scheduling GPU into SIAL http://www.olcf.ornl.gov/wp- Under development programming language and SIP 10X on kernels content/training/electronic-structure- ACES III Multi-GPU support runtime environment 2012/deumens_ESaccel_2012.pdf
Available Q1 2014 Fock Matrix, Hessians TBD www.scm.com ADF Multi-GPU support
DFT; Daubechies wavelets, Released, Version 1.7.0 2-5X http://bigdft.org BigDFT part of Abinit Multi-GPU support
Code for performing quantum Monte Carlo (QMC) electronic structure Under development TBD http://www.tcm.phy.cam.ac.uk/~mdt26/casino.html Casino calculations for finite and periodic Multi-GPU support systems
CASTEP TBD TBD Under development http://www.castep.org/Main/HomePage
http://www.olcf.ornl.gov/wp- Released DBCSR (spare matrix multiply library) 2-7X content/training/ascc_2012/friday/ACSS_2012_VandeVon CP2K Multi-GPU support dele_s.pdf
Libqc with Rys Quadrature Algorithm, 1.3-1.6X, Released http://www.msg.ameslab.gov/gamess/index.html GAMESS-US Hartree-Fock, MP2 and CCSD 2.3-2.9x HF Multi-GPU support GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications
Application Features Supported GPU Perf Release Status Notes
(ss|ss) type integrals within calculations using Hartree Fock ab initio methods and Released, Version 7.0 8x http://www.ncbi.nlm.nih.gov/pubmed/21541963 GAMESS-UK density functional theory. Supports Multi-GPU support organics & inorganics.
Under development Joint PGI, NVIDIA & Gaussian TBD Multi-GPU support http://www.gaussian.com/g_press/nvidia_press.htm Gaussian Collaboration
Electrostatic poisson equation, Released https://wiki.fysik.dtu.dk/gpaw/devel/projects/gpu.ht GPAW orthonormalizing of vectors, residual 8x Multi-GPU support ml, Samuli Hakala (CSC Finland) & minimization method (rmm-diis) Chris O’Grady (SLAC)
Under development Schrodinger, Inc. Jaguar Investigating GPU acceleration TBD Multi-GPU support http://www.schrodinger.com/kb/278 http://www.schrodinger.com/productpage/14/7/32/
http://on- Released demand.gputechconf.com/gtc/2013/presentations/S31 CU_BLAS, SP2 algorithm TBD LATTE Multi-GPU support 95-Fast-Quantum-Molecular-Dynamics-in-LATTE.pdf
Released, Version 7.8 MOLCAS CU_BLAS support 1.1x Single GPU; Additional GPU support www.molcas.org coming in Version 8
Density-fitted MP2 (DF-MP2), density Under development www.molpro.net fitted local correlation methods (DF-RHF, 1.7-2.3X projected MOLPRO Multiple GPU Hans-Joachim Werner DF-KS), DFT GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications Features Application GPU Perf Release Status Notes Supported
Pseudodiagonalization, full Released Academic port. diagonalization, and density matrix 3.8-14X MOPAC2013 available Q1 2014 MOPAC2012 http://openmopac.net assembling Single GPU
Development GPGPU benchmarks: www.nwchem- sw.org Triples part of Reg-CCSD(T), CCSD Released, Version 6.3 7-8X And http://www.olcf.ornl.gov/wp- NWChem & EOMCCSD task schedulers Multiple GPUs content/training/electronic-structure- 2012/Krishnamoorthy-ESCMA12.pdf
Full GPU support for ground-state, real-time calculations; Kohn-Sham Octopus Hamiltonian, orthogonalization, 1.5-8X Released, Version 4.1.0 http://www.tddft.org/programs/octopus/ subspace diagonalization, poisson solver, time propagation
ONETEP TBD TBD Under development http://www2.tcm.phy.cam.ac.uk/onetep/
Density functional theory (DFT) Released First principles materials code that computes the plane wave pseudopotential 6-10X PEtot Multi-GPU behavior of the electron structures of materials calculations
Released, Version 4.0.1 http://www.q- RI-MP2 8x-14x Q-CHEM Multi-GPU support chem.com/doc_for_web/qchem_manual_4.0.pdf
GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Quantum Chemistry Applications
Features Application GPU Perf Release Status Notes Supported
NCSA Released University of Illinois at Urbana-Champaign Main features 3-4x QMCPACK Multiple GPUs http://cms.mcc.uiuc.edu/qmcpack/index.php/GPU _version_of_QMCPACK
Created by Irish Centre for PWscf package: linear algebra Quantum Released, Version 5.0 High-End Computing (matrix multiply), explicit 2.5-3.5x Multiple GPUs http://www.quantum-espresso.org/index.php and computational kernels, 3D FFTs Espresso/PWscf http://www.quantum-espresso.org/
Completely redesigned to 44-650X vs. exploit GPU parallelism. YouTube: Released, Version 1.5k “Full GPU-based solution” GAMESS CPU http://youtu.be/EJODzk6RFxE?hd=1 and TeraChem Multi-GPU/single node http://www.olcf.ornl.gov/wp- version content/training/electronic-structure-2012/Luehr- ESCMA.pdf
2x Hybrid Hartree-Fock DFT 2 GPUs Available on request functionals including exact By Carnegie Mellon University VASP comparable to Multiple GPUs http://arxiv.org/pdf/1111.0716.pdf exchange 128 CPU cores
3x NICS Electronic Structure Determination Workshop 2012: Under development http://www.olcf.ornl.gov/wp- Generalized Wang-Landau method with 32 GPUs vs. 32 WL-LSMS Multi-GPU support content/training/electronic-structure- (16-core) CPUs 2012/Eisenbach_OakRidge_February.pdf
GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison ABINIT ABINIT on GPUS
Speed in the parallel version: For ground-state calculations, GPUs can be used. This is based on CUDA+MAGMA
For ground-state calculations, the wavelet part of ABINIT (which is BigDFT) is also very well parallelized : MPI band parallelism, combined with GPUs BigDFT Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA Courtesy of BigDFT team @ CEA GAUSSIAN Gaussian Key quantum chemistry code ACS Fall 2011 press release Joint collaboration between Gaussian, NVDA and PGI for GPU acceleration: http://www.gaussian.com/g_press/nvidia_press.htm No such release exists for Intel MIC or AMD GPUs Mike Frisch quote: “Calculations using Gaussian are limited primarily by the available computing resources,” said Dr. Michael Frisch, president of Gaussian, Inc. “By coordinating the development of hardware, compiler technology and application software among the three companies, the new application will bring the speed and cost-effectiveness of GPUs to the challenging problems and applications that Gaussian’s customers need to address.”
GAMESS GAMESS Partnership Overview Mark Gordon and Andrey Asadchev, key developers of GAMESS, in collaboration with NVIDIA. Mark Gordon is a recipient of a NVIDIA Professor Partnership Award.
Quantum Chemistry one of major consumers of CPU cycles at national supercomputer centers
NVIDIA developer resources fully allocated to GAMESS code We like to push the envelope as much as we can in the direction of highly scalable efficient codes.“ GPU technology seems like a good way to achieve this goal. Also, since we are associated with a DOE Laboratory, energy efficiency is important, and this is another reason to explore quantum chemistry on GPUs. Prof. Mark Gordon Distinguished Professor, Department” of Chemistry, Iowa State University and Director, Applied Mathematical Sciences Program, AMES Laboratory 140 GAMESS August 2011 GPU Performance First GPU supported GAMESS release via "libqc", a library for fast quantum chemistry on multiple NVIDIA GPUs in multiple nodes, with CUDA software 2e- AO integrals and their assembly into a closed shell Fock matrix
2.0
4x E5640 CPUs 4x E5640 CPUs + 4x Tesla C2070s
1.0
GAMESS Aug. 2011 Release Relative Release 2011 Aug. GAMESS
Performance for Two Small Molecules Small Two for Performance
0.0 Ginkgolide (53 atoms) Vancomycin (176 atoms) Upcoming GAMESS Q4 2013 Release
Multi-nodes with multi-GPUs supported Rys Quadrature Hartree-Fock 8 CPU cores: 8 CPU cores + M2070 yields 2.3-2.9x speedup. See 2012 publication Møller–Plesset perturbation theory (MP2): Preliminary code completed Paper in development Coupled Cluster SD(T): CCSD code completed, (T) in progress
142
GAMESS - New Multithreaded Hybrid CPU/GPU Approach to H-F
Hartree-Fock GPU Speedups* 3.5
2.9 3.0 Adding 1x 2070 GPU speeds up computations 2.5 2.5 2.5 2.4 by 2.3x to 2.9x 2.3 2.3 2.3
2.0
Speedup 1.5
1.0
0.5
* A. Asadchev, M.S. Gordon, “New Multithreaded Hybrid CPU/GPU Approach to 0.0 Hartree-Fock,” Journal of Chemical Theory and Taxol 6-31G Taxol 6-31G(d) Taxol 6- Taxol 6- Valinomycin 6- Valinomycin 6- Valinomycin 6- Computation (2012) 31G(2d,2p) 31++G(d,p) 31G 31G(d) 31G(2d,2p)
143 GPAW Increase Performance with Kepler
3.5 Running GPAW 10258
The blue nodes contain 1x E5-2687W CPU (8 3.0 3 Cores per CPU).
2.7 2.5 The green nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU) and 1x or 2x NVIDIA K20X for the GPU.
2
1.6 1.5 1.5 1.4
1 1 1
1 Speedup Compared toCPU Only
0.5
0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler
3 Running GPAW 10258
The blue nodes contain 1x E5-2687W CPU (8 2.5 Cores per CPU).
The green nodes contain 1x E5-2687W CPUs (8 Cores per CPU) and 2x NVIDIA K20 or K20X for 2 the GPU. 2.4x
1.5 2.2x
1
1.7x Speedup Speedup Compared to CPU Only
0.5
0 Silicon K=1 Silicon K=2 Silicon K=3 Increase Performance with Kepler
1.6 Running GPAW 10258
The blue nodes contain 2x E5-2687W CPUs (8 1.4 Cores per CPU).
1.2 The green nodes contain 2x E5-2687W CPUs (8 1.4x Cores per CPU) and 2x NVIDIA K20 or K20X for the GPU. 1 1.4x 0.8 1.3x 0.6
0.4 Speedup Compared toCPU Only
0.2
0 Silicon K=1 Silicon K=2 Silicon K=3 Used with permission from Samuli Hakala
152 153 154 155 156 157 NWChem NWChem 6.3 Release with GPU Acceleration
Addresses large complex and challenging molecular-scale scientific problems in the areas of catalysis, materials, geochemistry and biochemistry on highly scalable, parallel computing platforms to obtain the fastest time-to-solution Researchers can for the first time be able to perform large scale coupled cluster with perturbative triples calculations utilizing the NVIDIA GPU technology. A highly scalable multi-reference coupled cluster capability will also be available in NWChem 6.3. The software, released under the Educational Community License 2.0, can be downloaded from the NWChem website at www.nwchem-sw.org NWChem - Speedup of the non-iterative calculation for various configurations/tile sizes
System: cluster consisting of dual-socket nodes constructed from:
• 8-core AMD Interlagos processors • 64 GB of memory • Tesla M2090 (Fermi) GPUs
The nodes are connected using a high-performance QDR Infiniband interconnect
Courtesy of Kowolski, K., Bhaskaran-Nair, at al @ PNNL, JCTC (submitted) Kepler, Faster Performance (NWChem)
180 165 Uracil 160
140
)
120
100 81 80
60 54 Time to Solution (seconds Solution to Time 40
20 Uracil Molecule 0 CPU Only CPU + 1x K20X CPU + 2x K20X
Performance improves by 2x with one GPU and by 3.1x with 2 GPUs Quantum Espresso/PWscf Kepler, fast science
AUsurf 14 Running Quantum Espresso version 5.0- build7 on CUDA 5.0.36 12
The blue node contains 2 E5-2687W CPUs 10 (150W, 8 Cores per CPU).
The green nodes contain 2 E5-2687W CPUs 8 and 1 or 2 NVIDIA M2090 or K10 GPUs (225W and 235W respectively). 6
4 Performance Performance Relative to CPU Only 2
0 CPU Only CPU + M2090 CPU + K10 CPU + 2x M2090 CPU + 2x K10
Using K10s delivers up to 11.7x the performance per node over CPUs And 1.7x the performance when compared to M2090s Extreme Performance/Price from 1 GPU
4 Quantum Espresso Simulations run 3.5 on FERMI @ ICHEC.
3 A 6-Core 2.66 GHz Intel X5650 was used for the CPU 2.5 An NVIDIA C2050 was used for the 2 GPU
1.5
CPU+ Scaled Scaled to CPU Only 1 GPU CPU 0.5 Only
0 Price: Performance: (Shilu-3) Performance: (Water-on-Calcite)
Calcite structure
Adding a GPU can improve performance by 3.7x while only increasing price by 25% Extreme Performance/Price from 1 GPU
4 Price and Performance scaled to the CPU only system Quantum Espresso Simulations run 3.5 on FERMI @ ICHEC.
3 A 6-Core 2.66 GHz Intel X5650 was used for the CPU 2.5 An NVIDIA C2050 was used for the 2 GPU
1.5 CPU+ GPU 1 CPU 0.5 Only
0 Price: Performance: (AUSURF112, k- Performance: (AUSURF112, point) gamma-point)
Calculation done for a gold surface of 112 atoms
Adding a GPU can improve performance by 3.5x while only increasing price by 25% Replace 72 CPUs with 8 GPUs
Quantum Espresso Simulations run on PLX @ CINECA. 250
LSMO-BFO (120 Atoms) 8 K-points Intel 6-Core 2.66 GHz X5550 were used for the CPUs 223 219 200 NVIDIA M2070s were used for the
GPUs
150
100 Elapsed Elapsed Time(minutes)
50
0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)
The GPU Accelerated setup performs faster and costs 24% less QE/PWscf - Green Science
LSMO-BFO (120 Atoms) 8 K-points Quantum Espresso Simulations run 12000 on PLX @ CINECA.
Intel 6-Core 2.66 GHz X5550 were 10000 used for the CPUs
Lower is better NVIDIA M2070s were used for the 8000 GPUs
6000
4000 PowerConsumption (Watts)
2000
0 120 CPUs ($42,000) 48 CPUs + 8 GPUs ($32,800)
Over a year, the lower power consumption would save $4300 on energy bills NVIDIA GPUs Use Less Energy
Energy Consumption on Different Tests Quantum Espresso Simulations run 0.6 on FERMI @ ICHEC.
CPU Only A 6-Core 2.66 GHz Intel X5650 was used for the CPU 0.5
CPU+GPU An NVIDIA C2050 was used for the
GPU 0.4
Lower is better 0.3 -58%
0.2 PowerConsumption [kW/h]
0.1 -54% -57%
0 Shilu-3 AUSURF112 Water-on-Calcite In all tests, the GPU Accelerated system consumed less than half the power as the CPU Only QE/PWscf - Great Strong Scaling in Parallel
CdSe-159 Walltime of 1 full SCF Quantum Espresso Simulations run 35000 on STONEY @ ICHEC.
30000 Two quad core 2.87 GHz Intel Lower is better X5560s were used in each node CPU 25000 CPU+GPU Two NVIDIA M2090s were used in 2.5x each node for the CPU+GPU test
20000
(s) ime ime
T 15000
10000 2.2x 5000 2.1x 2.2x
0 2 (16) 4 (32) 6 (48) 8 (64) 10 (80) 12 (96) 14 (112) Nodes (Total CPU Cores)
Speedups up to 2.5x with GPU Accelerations 159 Cadmium Selenide nanodots QE/PWscf - More Powerful Strong Scaling
GeSnTe134 Walltime of full SCF
4500 Quantum Espresso Simulations run CPU on PLX @ CINECA. 4000 CPU+GPU Two 6-Core 2.4 GHz Intel E5645s 3500 were used in each node 1.6x Lower is better 3000 Two NVIDIA M2070s were used in
each node for the CPU+GPU test 2500
2000 Time(s) 2.3x 1500 2.4x 2.1x 1000
500
0 4(48) 8(96) 12(144) 16(192) 24(288) 32(384) 44(528) Nodes (Total CPU Cores) Accelerate your cluster by up to 2.1x with NVIDIA GPUs
Try GPU accelerated Quantum Espresso for free – www.nvidia.com/GPUTestDrive TeraChem TeraChem Supercomputer Speeds on GPUs Time for SCF Step 100 TeraChem running on 8 C2050s on 1 node 90 NWChem running on 4096 Quad Core CPUs 80 In the Chinook Supercomputer
70 Giant Fullerene C240 Molecule
60
50
40 Time(Seconds) 30
20
10
0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)
Similar performance from just a handful of GPUs TeraChem Bang for the Buck Performance/Price TeraChem running on 8 C2050s on 1 node
600 NWChem running on 4096 Quad Core
CPUs 493 In the Chinook Supercomputer 500 Giant Fullerene C240 Molecule
400 Note: Typical CPU and GPU node pricing used. Pricing may vary depending on node configuration. Contact your preferred HW 300 vendor for actual pricing.
200
100 Price/Performance Price/Performance relativeSupercomputer to 1 0 4096 Quad Core CPUs ($19,000,000) 8 C2050 ($31,000)
Dollars spent on GPUs do 500x more science than those spent on CPUs Kepler’s Even Better
Olestra BLYP 453 Atoms B3LYP/6-31G(d) 800 2000 TeraChem running on C2050 and K20C
700 1800 First graph is of BLYP/G-31(d) Second is B3LYP/6-31G(d) 1600 600 1400
500
1200
400 1000
Seconds Seconds 300 800
600 200 400 100 200
0 0 C2050 K20C C2050 K20C
Kepler performs 2x faster than Tesla Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
3D visualization of volumetric data Released, Version 5.5 Visualization from Visage Imaging. 70x Amira 5® and surfaces Single GPU http://www.visageimaging.com/overview.html
High-Throughput parallel blind Virtual Screening, Available upon request to Allows fast processing of large http://www.biomedcentral.com/1471- 100X authors BINDSURF ligand databases 2105/13/S14/S13 Single GPU
University of Bristol Empirical Free Released 6.5-13.4X http://www.bris.ac.uk/biochemistry/cpfg/bude/b BUDE Energy Forcefield Single GPU ude.htm
Released Schrodinger, Inc. GPU accelerated application 3.75-5000X Core Hopping Single and Multi-GPUs http://www.schrodinger.com/products/14/32/
Real-time shape similarity Released Open Eyes Scientific Software 800-3000X FastROCS searching/comparison Single and Multi-GPUs http://www.eyesopen.com/fastrocs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Viz, “Docking” and Related Applications Growing
Related Features Supported GPU Perf Release Status Notes Applications
Interactive Molecule Released Written for use only on GPUs TBD http://www.molecular-visualization.com Visualizer Multi-GPU support
Molegro Virtual Energy grid computation, pose evaluation and guided differential 25-30x Single GPU http://www.molegro.com/gpu.php Docker 6 evolution
PIPER Protein http://www.bu.edu/caadlab/gpgpu09.pdf; Molecule docking 3-8x Released Docking http://www.schrodinger.com/productpage/14/40/
Lines: 460% increase Cartoons: 1246% increase Released, Version 1.6 Surface: 1746% increase 1700x http://pymol.org/ PyMol Single GPUs Spheres: 753% increase Ribbon: 426% increase
High quality rendering, large structures (100 million atoms), 100-125X or greater Released, Version 1.9.1 analysis and visualization tasks, Visualization from University of Illinois at Urbana-Champaign VMD on kernels Multi-GPU support http://www.ks.uiuc.edu/Research/vmd/ multiple GPU support for display of molecular orbitals GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison John Stone (VMD developer) video FastROCS
OpenEye Japan Hideyuki Sato, Ph.D.
November 13, 2013 © 2012 OpenEye Scientific Software ROCS on the GPU: FastROCS
400000
300000
200000 Second
100000 Shape Overlays per Shape Overlays 0 CPU GPU
November 13, 2013 © 2012 OpenEye Scientific Software FastROCS scaling across 4x K10s (2 physical GPUs per K10) 53 million conformers (10.9 compounds of PubChem at 5 conformers per molecule) 9000000
8000000 7000000 6000000 5000000 4000000 3000000
2000000 Conformers per Second per Conformers 1000000 0 1 2 3 4 5 6 7 8 Number of individual K10 GPUs (Note, each K10 has 2 physical GPUs on the board)
November 13, 2013 © 2012 OpenEye Scientific Software “Organic Growth” of Bioinformatics Applications Features GPU Application Release Status Website Supported Speedup Alignment of short sequencing Released, Version 0.6.2 6-10x http://seqbarracuda.sourceforge.net/ BarraCUDA reads Multi-GPU, multi-node Parallel search of Smith- Released, Version 2.0.10 10-50x http://sourceforge.net/projects/cudasw/ CUDASW++ Waterman database Multi-GPU, multi-node Parallel, accurate long read Released, Version 1.0.40 10x http://cushaw.sourceforge.net/ CUSHAW aligner for large genomes Multiple-GPU Protein alignment according to Released, Version 2.2.26 http://eudoxus.cheme.cmu.edu/gpublast/gpubla 3-4x GPU-BLAST BLASTP Single GPU st.html
Scalable motif discovery Released, Version 3.0.13 https://sites.google.com/site/yongchaosoftware/ mCUDA-MEME 4-10x algorithm based on MEME Multi-GPU, multi-node mcuda-meme
Aligns multiple query sequences MUMmer GPU 3-10x http://sourceforge.net/apps/mediawiki/mummer against reference sequence in Released, Version 3.0 gpu/index.php?title=MUMmerGPU parallel
Hardware and software for Released reference assembly, blast, SW, 400x http://www.seqnfind.com/ SeqNFind Multi-GPU, multi-node HMM, de novo assembly Short read alignment tool that is SOAP3 not heuristic based; reports all 10x Beta release, Version 0.01 http://soap.genomics.org.cn/soap3.html answers Released, Version 1.12 Fast short read alignment 6-8x http://ugene.unipro.ru/ UGENE Multi-GPU, multi-node Parallel linear regression on Released, Version 0.1-1 multiple similarly-shaped 150x http://insilicos.com/products/widelm WideLM Multi-GPU, multi-node models GPU Perf compared against same or similar code running on single CPU machine Performance measured internally or independently NVBIO: Efficient Sequence Analysis on GPUs
Jacopo Pantaleoni NVIDIA nvBowtie: re-engineering Bowtie2 on NVBIO nvBowtie: re-engineering Bowtie2 on NVBIO
We took Bowtie2’s algorithms and rewrote them on top of NVBIO
Supports the full spectrum of features
Single End End-to-End Best Mapping
Paired Ends Local All Mapping
Favors accuracy over speed
© NVIDIA 2013 - ISMB
nvBowtie - Architecture Overview
CPU CPU background background input thread output thread seed map prefetch consume read score alignment batches batches
FASTQ no SAM / (GZIP) finished? BAM decompress / reformat / reformat yes compress
reseed? CPU CPU yes CPU CPU no
CPU CPU CPU GPU traceback CPU CPU
© NVIDIA 2013 - ISMB nvBowtie - Simulated Reads
wgsim: 1M x 100bp reads - (wgsim -r 0.01 -1 100 -S11 -d0 -e0)
nvBowtie: K20
Bowtie2: 6-core SB
© NVIDIA 2013 - ISMB nvBowtie - Simulated Reads
Same sensitivity/accuracy at 3-4x the speed
Greater sensitivity/accuracy at same speed
© NVIDIA 2013 - ISMB nvBowtie - Real Datasets
Ion Proton Ion Proton 100M x 175bp (8-350) end-to-end 100M x 175bp (8-350) local
speedup 4.3x speedup 7.6x
alignment rate +0.5% alignment rate -0.6% disagreement 0.002% disagreement 0.03%
- -
Illumina HiSeq 2000 Illumina HiSeq 2000 10M x 100bp x 2 end-to-end 10M x 100bp x 2 local speedup 2.4x speedup 2.6x
alignment rate = alignment rate = disagreement 0.006% disagreement 0.022%
ERR161544 ERR161544
© NVIDIA 2013 - ISMB nvBowtie
Same quality @ higher speed: Best -mapping 2-7x faster on K20 than Bowtie2 on 6-core SB
Opens new doors: All-mapping on K20 comparable to Best-mapping on 6-core SB
Possibly even faster: With more GPU memory, we could likely get 1.5x faster
© NVIDIA 2013 - ISMB
Coming Soon: NVBIO: [email protected] [email protected] BWA-MEM, SeqAn: CRAM, [email protected] [email protected] Variant Calling … Benefits of GPU Accelerated Computing
Faster than CPU only systems in all tests
Large performance boost with marginal price increase
Energy usage cut by more than half
GPUs scale well within a node and over multiple nodes
K20 GPU is our fastest and lowest power high performance GPU yet
Try GPU accelerated TeraChem for free – www.nvidia.com/GPUTestDrive 192 3 Ways to Accelerate Applications
Applications
OpenACC Programming Libraries Directives Languages
“Drop-in” Easily Accelerate Maximum Acceleration Applications Flexibility GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
Linear Algebra NVIDIA cuFFT, FFT, BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA cuRAND
Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Board Games Path Finding
Visual Processing NVIDIA Image & Video NVIDIA Video NPP Encode GPU Programming Languages
Numerical analytics R, MATLAB, Mathematica, LabVIEW
Fortran OpenACC, CUDA Fortran
C OpenACC, CUDA C
C++ Thrust, CUDA C++
Python PyCUDA, Anaconda Accelerate
C# GPU.NET Easiest Way to Learn CUDA
Learn from the Best
Anywhere, Any Time
It’s Free! 50K 127 $$ Registered Countries
Engage with an Active Community Develop on GeForce, Deploy on Tesla
GeForce GTX Titan Tesla K20X/K20
Designed for Gamers & Developers Designed for Cluster Deployment
1+ Teraflop Double Precision Performance ECC Dynamic Parallelism 24x7 Runtime Hyper-Q for CUDA Streams GPU Monitoring Cluster Management Available Everywhere! GPUDirect-RDMA Hyper-Q for MPI Develop on any GPU released in the last 3 years; 3 Year Warranty Macs, PCs, Workstations, Clusters, Supercomputers Integrated OEM Systems, Professional Support TITAN: World’s Fastest Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.59 Petaflops on Linpack 90% of Performance from GPUs Test Drive K20 GPUs! Experience The Acceleration www.nvidia.com/GPUTestDrive
Run Computational Chemistry Apps on Tesla K20 GPUs today
Try Preconfigured Apps: AMBER, NAMD, GROMACS, LAMMPS, Quantum Espresso, TeraChem Or Load Your Own
Sign up for FREE GPU Test Drive on remotely hosted clusters