GPU Progress in Life Sciences

Duncan Poole

CUDA Zone Developer Portal Collaboration Prof Partnership

1000+ Research Papers 120 Million CUDA GPUs 200+ universities teaching CUDA 60,000+ Active Developers

8x Peak Double Precision of CPUs Performance •• •• IEEE 754754--20082008 SP & DP Floating Point DRAM I/F

•• Increased Shared Memory from 16 KB to 64 KB I/F I/F DRAM DRAM DRAM I/F •• Added L1 and L2 Caches

Flexibility •• ECC on all Internal and External Memories I/F I/F HOSTHOST L2L2 •• Enable up to 1 TeraByte of GPU Memories DRAM I/F •• High Speed GDDR5 Memory Interface Giga Thread Giga DRAM I/F •• 512 Cores vsvs240 Cores DRAM I/F I/F DRAM DRAM •• Multiple Simultaneous Tasks on GPU Usability •• 10x Faster Atomic Operations •• C++ Support •• System Calls, printf support

void serial_function(… ) { ... C CUDA Rest of C }}} Key Kernels void other_function(int ... ) { Application ... }}} NVCC CPU Compiler void saxpy_serial(float ... ) { (Open64) for (int i = 0; i <<< n; ++i) y[i] = a*x[i] + y[i]; Modify into }}} Parallel CUDA object CPU object CUDA code void main( ) { files files float x; Linker saxpy_serial(..); ... }}} CPU-GPU Executable

Windows Linux OSX

Languages C for CUDA (NVCC) C for CUDA (NVCC) C for CUDA (NVCC) C++ for CUDA C++ for CUDA C++ for CUDA Libraries Libraries Libraries OpenCL (OCG) OpenCL OpenCL DirectCompute Fortran (PGI)

Debug & Profile

NEXUS cuda-gdb Allinea DDT cuda-gdb Apple Shark Visual Profiler Visual Profiler TotalView Visual Profiler

Dev Environment

Visual Studio + Command-Line Command-Line Apple XCode NEXUS + cuda-gdb + cuda-gdb

Reduced Time to Solution

Single runs currently take days, even on multiple nodes

Increase Throughput

Multiple runs at one time on a cluster

Larger Problem Sizes

Simulations will progress from 40-60K atoms to 500K+ atoms Lowers System Costs (power costs)

Research workstations, departmental clusters, supercomputing centers

Need critical mass of life science applications: Code User - “Do my key codes run on a GPU?” “Does my code scale to multiple nodes?” “Can I run longer, run larger, or multiple jobs? © NVIDIA Feb26th 2010 Tesla Bio Workbench Applications

AMBER (MD) Docking ACEMD (MD) GPU AutoDock GROMACS (MD) Sequence analysis GROMOS (MD) CUDASW++ (SmithWaterman) MUMmerGPU LAMMPS (MD) GPU-HMMER NAMD (MD) CUDA-MEME Motif Discovery TeraChem (QC) CUDA-BLAST VMD (Visualization MD & QC)

TeraChem

Applications LAMMPS GPU-AutoDock

MUMmerGPU

Community Download, Technical Discussion Benchmarks Community Documentation papers Forums & Configurations

Tesla Personal Supercomputer Tesla GPU Clusters

Platforms

Collaboration with NVIDIA to produce CUDA version of AMBER. PMEMD Engine Implicit Solvent GB (V1.0 complete) Explicit Solvent PME (in progress)

Focus on accuracy. It MUST pass the AMBER regression tests. Energy conservation MUST be comparable to double precision CPU code.

AMBER 10 CUDA Patch is located here – will be standard for AMBER 11

Patch Now • Generalized Born Generalized Born Simulations

ns/ ns/ • PME: Particle Mesh Ewald Q1 2010 day day • Beta release 27.72 30 ns/day 0.6 0.52 ns/day • Multi-GPU + MPI support 25 0.5 Q2 2010 • Beta 2 release 20 0.4 7x 8.6x 15 0.3 Performance 10 0.2 • GB 65-120x single core 4.04 ns/day 0.06 • PME 24 cores 5 0.1 ns/day 0 0 Myoglobin Nucleosome 2492 atoms 25095 atoms

2x Intel Quad Core E5462 Tesla C1060 GPU More Info 2.8 GHz http://www.nvidia.com/object/amber_on_tesla.html Data courtesy of San Diego Supercomputing Center © NVIDIA Feb26th 2010 AMBER: Assisted Model Building with Energy Refinement

Beta • Particle Mesh Ewald (PME) GROMACS on Tesla GPU Vs CPU now • Implicit solvent GB • Arbitrary forms of non- Reaction-Field Steps / Particle-Mesh-Ewald Steps Cutoffs bonded interactions sec (PME) / sec 12.0 7000 22x • Multi-GPU + MPI support 3.5x Q2 2010 • Beta 2 release 10.0 6000 5000 8.0 5.2x 4000 6.0 PME results 3000 1 Tesla GPU 3.5x-4.7x faster 4.0 2000 than CPU 2.0 1000 0.0 0 Solvated Spectrin Dimer NaCl (894 atoms) Spectrin (75.6K atoms) (37.8K atoms) More Info http://www.nvidia.com/object/gromacs_on_tesla.html Data courtesy of Stockholm Center for Biomembrane Research © NVIDIA Feb26th 2010 NAMD: Scaling Molecular Dynamics on a GPU Cluster

STMV Feature complete on CUDA : NAMD Results on Steps/s CPU vs GPU Clusters available in NAMD 2.7 Beta 2 9 Full electrostatics with PME 8 Multiple time-stepping 7 1-4 Exclusions 6 5 4 GPUs = 16 CPUs 4 GPU Tesla PSC outperforms 4 8 CPU servers 3 2 1 Scales to a GPU cluster 0 1 2 4 8 16 Number of Quad-Core CPUs More Info Or Number of Tesla GPUs http://www.nvidia.com/object/namd_on_tesla.html Data courtesy of Theoretical and Computational Bio-physics Group, UIUC © NVIDIA Feb26th 2010 NAMD: Scaling Molecular Dynamics on a GPU Cluster

STMV NAMD 2.7 Beta 2 NAMD Results on Steps/s CPU vs GPU Clusters 4 GPU Tesla PSC outperforms 9 8 CPU servers 8 7 6 “We expect GPUs to maintain their current factor -of -10 advantage in peak performance 5 4 GPUs = 16 CPUs relative to CPUs, while their obtained 4 performance advantage for well-suited problems continues to grow. “ 3 – Phillips, Stone ACM Oct 2009. 2 1 Researchers also noted a 2.7x better performance/watt 0 1 2 4 8 16 Number of Quad-Core CPUs Or Number of Tesla GPUs

More Info Data courtesy of Theoretical and Computational Bio-physics Group, UIUC © NVIDIA Feb26th 2010 http://www.nvidia.com/object/namd_on_tesla.html TeraChem: Quantum Chemistry Package for GPUs

Speedup of TeraChem on GPU vs Beta • HF, Kohn-Sham, DFT GAMESS on CPU now • Multiple GPUs supported

• Full release Olestra 50 x Q1 2010 • MPI support

Vallnomyclin 11 x

7 x First QC SW written ground-up for Taxol GPUs Buckyball 8 x 4 Tesla GPUs outperform 256 quad- core CPUs 0 10 20 30 40 50 60 4x Tesla C1060 GPU More Info 2x Intel Quad Core 5520 CPU http://www.nvidia.com/object/terachem_on_tesla.html © NVIDIA Feb26th 2010 VMD: Acceleration using CUDA GPUs

Several CUDA applications in 10 3 Lattice Molecular Orbital Computation VMD 1.8.7 Points/sec in VMD Molecular Orbital Display 70000 Coulomb-based Ion Placement 60000 Implicit Ligand Sampling 50000 40000 Speedups : 20x - 100x 30000 20000 Multiple GPU support enabled 10000 0 C60-b C60-a Thr-b Thr-a (carbon-60) (carbon-60) (threonine-17) (threonine-17)

Intel Core2 Q6600 (Quad) 2.4 GHz Tesla C1060 GPU More Info http://www.nvidia.com/object/vmd_on_tesla.html Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC © NVIDIA Feb26th 2010 Bio-Informatics Applications

Protein sequence alignment # of States HMMER: 60-100x faster using profile HMMs (HMM Size) 27 mins GPUs 37 mins 2295 69 mins CPU Available now 28 hours 15 mins Supports multiple GPUs 21 mins 1431 40 mins 25 hours

8 mins Speedups range from 60- 11 mins 789 21 mins 100x faster than CPU 11 hours Time in minutes (Log Scale) Download 3 Tesla C1060 2 Tesla C1060 1 Tesla C1060 Opteron 2376 2.3 GHz http://www.mpihmmer.org/releases.ht

High-throughput pair-wise 3 mins S. suis local sequence alignment 11 mins Designed for large sequences L. 6 mins monocytogenes 22 mins Drop-in replacement for

“mummer” component in C. 10 mins MUMmer software briggsae 38 mins

0.0 10.0 20.0 30.0 40.0 Speedups 3.5x to 3.75x 1 Tesla GPU (8-series) Download Intel Xeon 5120 (3.0 GHz) http://mummergpu.sourceforge.net

CUDA Smith Waterman: Sequence Alignment

CUDA MEME: Motif Discovery

GPU AutoDock: Protein Docking

Although seemingly harmless at first glance, one must not recycle random numbers in any way or conformational sampling is greatly reduced. In addition, doing so, even if the numbers are permuted, causes positional and/or rotational aliasing of the molecule because they do not quite sum to zero.

On a related note, one cannot skimp on random number quality or the system will rapidly heat up. CUDA is sophisticated enough to allow generation of high quality random numbers

Summation of forces *must* be deterministic. This means one cannot use atomic ops to perform this sum unless one first converts to a fixed point representation, and one cannot use atomic ops to order neighbor lists or bins or simulations will not be reproducible. Floating point math is not associative.

Verification - a lot of time was put into validating this work. The goal of this project was to port production quality code to the GPU not to do research but to accelerate the use of this code for every day science.

Accelerated 64-bit shared memory atomic operations allow for faster (and simpler) charge interpolation as well as local accumulation of forces into shared memory (if converted to 64-bit fixed point first)

Vastly improved double-precision performance accelerates FFTs, force accumulation, and integration