ENABLING NEW SCIENCE TESLA™ BIO Workbench

The ® Tesla™ Bio Workbench enables biophysicists GPU SOLUTIONS and computational chemists to push the boundaries of life The Tesla Bio Workbench applications can be deployed on sciences research. It turns a standard PC into a “computational GPU-based desktop personal supercomputers or in data laboratory” capable of running complex bioscience codes, in center solutions. Built on the revolutionary, massively fields such as drug discovery and DNA sequencing, more than parallel CUDA™ GPU computing architecture, these solutions 10-20 times faster through the use of NVIDIA Tesla GPUs. are designed to accelerate the pace of computational science It consists of bioscience applications; a community site for and are available today. Visit www.nvidia.com/tesla to find out downloading, discussing, and viewing the results of these more about: applications; and GPU-based platforms that enable these WORKSTATION SOLUTIONS: applications to run at 1/10th the cost of CPU-only computers.

Complex molecular simulations that had been only possible using supercomputing resources can now be run on an individual workstation, optimizing the scientific workflow and accelerating the pace of research. These simulations can also be scaled up to GPU-based clusters of servers to simulate large molecules and systems that would have otherwise required a supercomputer. TESLA PERSONAL SUPERCOMPUTER Applications that are accelerated on GPUs include: For personal supercomputing at your desk

& DATA CENTER SOLUTIONS:

- AMBER, GROMACS, HOOMD, LAMMPS, NAMD, TeraChem (Quantum Chemistry), VMD

• Bio Informatics

- CUDA-BLASTP, CUDA-EC, CUDA-MEME, CUDASW++ (Smith-Waterman), GPU-HMMER, MUMmerGPU

For more information, visit www.nvidia.com/bio_workbench TESLA GPU COMPUTING CLUSTERS For computing with large-scale installations GPU-BASED MOLECULAR DYNAMICS & QUANTUM CHEMISTRY APPLICATIONS

AMBER HOOMD The explicit solvent and implicit solvent simulations in AMBER have HOOMD-blue is a general purpose particle dynamics package been accelerated using CUDA-enabled GPUs. Paired with a Tesla written from ground-up to take advantage of the revolutionary GPU computing solution built on the CUDA architecture, it enables CUDA architecture in NVIDIA GPUs. It contains a number of more than 10x speedup compared to a single quad-core CPU. different force fields and integrators and its object oriented design enables adding additional ones easily. Benchmark data for explicit solvent PME and implicit solvent GB Benchmarks. Details can be found on the AMBER 11 Detailed benchmarks are available on the HOOMD-blue page. One NVIDIA benchmark page courtesy of the San Diego Tesla GPU running HOOMD-blue can outperform 32 CPU cores. Supercomputing Center.

HOOMD on GPUs vs LAMMPS on 32 CPU cores

25 1.2 450 1.04 2 GPUs 2 GPUs 20.70 400 1 20 350 0.8 1 GPU 1 GPU 15 300 11.93 0.6 0.52 ns/day 10 ns/day 250 32 CPU 32 CPU 0.4 cores cores 5.94 200 5 0.2 0.06 150 LAMMS on 32 x 86 CPU cores

Time steps/sec HOOMD-BLUE on 1 Tesla 10-series GPU 0 0 100 DHFR NVE Nucleosome HOOMD-BLUE on 2 Tesla 10-series GPUs 23,558 atoms 25,095 atoms 1GB-1 50

8 CPU cores 0 Tesla C1060 CPU = 2x Quad-core Intel Xeon E 5462, ifort 10.1, MKL 10.1, MPICHZ Polymer Lennard-Jones Liquid Tesla C2050 GPU = Tesla C1060 / Tesla C2050, Gfortran 4.1, nvcc v3.0 (64K particles) (64K particles)

Data courtesy of University of Michigan

GROMACS LAMMPS GROMACS is a molecular dynamics package designed LAMMPS is a classical molecular dynamics package written to primarily for simulation of biochemical molecules like proteins, run well on parallel machines. The CUDA version of LAMMPS is lipids, and nucleic acids that have a lot complicated bonded accelerated by moving the force calculations to the GPU. interactions. The CUDA port of GROMACS enabling GPU acceleration is now available in beta and supports Particle- LAMMPS on the GPU scales very well as seen by the results below. Mesh-Ewald (PME), arbitrary forms of non-bonded interactions, Two Tesla GPUs running GPU-LAMMPS outperforms 24 CPUs. and implicit solvent Generalized Born methods. LAMMPS Results on CPU vs GPU Clusters The CUDA version of GROMACS currently supports a single 180 GPU and produces the following results: 160 2 GPUs 140 120

Particle-Mesh-Ewald (PME) Reaction-Field Cutoffs 100 12.0 7000 80

10.0 6000 60 5000 24 CPUs 8.0 40 22x simulated/sec Femtosec 4000 20 6.0 Faster 3000 steps/sec

steps/sec 0 4.0 2000 1 8 24

2.0 1000 Number of Quad-Core CPUs or Number of Tesla GPUs 0.0 0 Solvated Spectrin Spectrin Dimer NaCl (894 atoms) 2 GPUs = 24 CPUs (37.8K atoms) (75.6K atoms) Data courtesy of Scott Hampton & Pratul K. Agarwal, Oak Ridge National Laboratory 2x Intel Quad Core E5462 Tesla C1060 GPU

Data courtesy of Stockholm Center for Biomembrane Research GPU-BASED MOLECULAR DYNAMICS & QUANTUM CHEMISTRY APPLICATIONS

NAMD VMD The Team at University of Illinois at Urbana-Champaign (UIUC) VMD is a molecular program for displaying, has been enabling CUDA-acceleration on NAMD since 2007. animating, and analyzing large biomolecular systems using They have performed scaling experiments on the NCSA 3-D graphics and built-in scripting. Several key kernels and Tesla-based Lincoln cluster and demonstrated that 4 Tesla applications in VMD now take advantage of the massively GPUs can outperform a cluster with 16 quad-core CPUs. parallel CUDA architecture of NVIDIA’s GPUs. These applications run 20x to 100x faster when using a NVIDIA CUDA NAMD scales very well on a Tesla GPU cluster as demonstrated GPU compared to running them on a CPU only. by these results.

NAMD Results on CPU vs GPU Clusters Molecular Orbital Computate in VMD

9 70000

8 60000

7 50000

6 40000

5 30000 4 GPUs = 16 CPUs Lattice points/sec Lattice 4 3

10 20000 STMV Step/s Intel Core 2 Q6600 (Quad) 2.4 GHz 3 10000 Tesla C1060 GPU 2 0 C60-b C60-a Thr-b Thr-a 1 (carbon-60) (carbon-60) (threonine-17) (threonine-17) 0 Data courtesy of Theoretical and Computational Bio-physics Group, UIUC 1 2 4 8 16 Number of Quad-Core CPUs or Number of Tesla GPUs

Data courtesy of Theoretical and Computation Bio-physics Group, UIUC

TeraChem TeraChem is the first general purpose quantum chemistry package designed ground up specifically to take advantage of the massively parallel CUDA architecture of NVIDIA GPUs. TeraChem currently supports density functional theory (DFT) and Hartree-Fock (HF) methods.

TeraChem supports multiple GPUs using p-threads and will soon be MPI enabled. A workstation with 4 Tesla GPUs running TeraChem outperforms 256 quad-core CPUs running GAMESS.

Speedup of TeraChem on GPU vs GAMESS on CPU

50x Olestra

11x Vallnomyclin

7x Taxol

8x 2x Intel Quad Core 5520 CPU Buckyball 4x Tesla C1060 GPU

0 10 20 30 40 50 60 70

Data courtesy of Petachem GPU-BASED BIO-INFORMATICS APPLICATIONS

CUDA-BLASTP CUDA-MEME CUDA-BLASTP is designed to accelerate NCBI BLAST for CUDA-MEME is a motif discovery software based on MEME scanning protein sequence databases by taking advantage (version 3.5.4). It accelerates MEME by taking advantage of the of the massively parallel CUDA architecture of NVIDIA Tesla massively parallel CUDA architecture of NVIDIA Tesla GPUs. It GPUs. CUDA-BLASTP also has a utility to convert FASTA supports the OOPS and ZOOPS models in MEME. format database into files readable by CUDA-BLASTP. CUDA-MEME running on one Tesla C1060 GPU is up to 23x CUDA-BLASTP running on a workstation with two Tesla C1060 faster than MEME running on a x86 CPU. This cuts compute GPUs is 10x faster than NCBI BLAST (2.2.22) running on an time from hours on CPUs to minutes using GPUs. The data Intel i7-920 CPU. This cuts compute time from minutes on in the chart below are for the OOPS (one occurrence per CPUs to seconds using GPUs. sequence) model for 4 datasets.

CUDA-MEME vs MEME Speedups CUDA-BLASTP vs NCBI BLASTP Speedups

12 4 hours 15 mins 57 mins 25 10 3.6 sec 3.6 sec 8.5 sec

21 sec 8 20 22 sec 22 mins 6

Speedup NCBI BLASTP: 1 Thread Intel i7-920 CPU 15 4 NCBI BLASTP: 4 Threads Intel i7-920 CPU CUDA BLASTP: 1 Tesla C1060 GPU 2 Speedup 10 CUDA BLASTP: 2 Tesla C1060 GPUs 0 AMD Opteron 2.2 GHz 127 254 517 1054 2026 5 Tesla C1060 GPU Query Sequence Length 0 Data courtesy of Nanyang Technological University, Singapore Mini-drosoph HS_100 HS_200 HS_400 4 sequences 100 sequences 200 sequences 4 sequences

Data courtesy of Nanyang Technological University, Singapore

CUDA-EC CUDA-EC is a fast parallel sequence error correction tool for CUDASW++ (Smith-Waterman) short reads. It corrects sequencing errors in high-throughput CUDASW++ is a bio-informatics software for Smith-Waterman short-read (HTSR) data and accelerates HTSR by taking protein database searches that takes advantage of the advantage of the massively parallel CUDA architecture of massively parallel CUDA architecture of NVIDIA Tesla GPUs to NVIDIA Tesla GPUs. Error correction is a preprocessing step for perform sequence searches 10x-50x faster than NCBI BLAST. many DNA fragment assembly tools and is very useful for the CUDASW++ supports query lengths up to 59K. new high-throughput sequencing machines. CUDASW++ can take advantage of multiple Tesla GPUs and CUDA-EC running on one Tesla C1060 GPU is up to 20x faster runs between 10x-50x faster than NCBI BLAST, getting up to than Euler-SR running on a x86 CPU. This cuts compute time 30 GCUPs on query lengths of over 5000. Whereas BLAST is a from minutes on CPUs to seconds using GPUs. The data in the heuristic algorithm, CUDASW++ does optimal local alignment chart below are error rates of 1%, 2%, and 3% denoted by A1, using the Smith-Waterman algorithm. A2, A3 and so on for 5 different reference genomes. CUDASW++ vs NCBI BLAST

35 CUDA-EC vs Euler-SR Speedups 2 GPUs: 30 BLOSUM45 CUDASW++ 2.0 25 25

20 1 GPU: 20 BLOSUM45

GCUPS 15

10 BLAST on CPU: 15 BLOSUM62 BLAST BLOSUM50 10-3k BLAST BLOSUM62 10-2k 5 BLAST on CPU: 1 Tesla C1060 GPU (CUDASW++)

Speedup 10 BLOSUM50 0 2 Tesla C1060 GPUs (CUDASW++) 144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478 5 Euler-SR: AMD Opteron 2.2 GHz Query Sequence Length CUDA-EC: Tesla C1060 0 Data courtesy of Nanyang Technological University, Singapore E A1 A2 A3 C1 C2 C3 B1 B2 B3 D1 D2 S. cer 5 S. cer 7 H. inf E. col S. aures 0.58M 1.1M 1.9M 4.7M 4.7M

Read Length =35 Error rates of 1%, 2%, 3%

Data courtesy of Nanyang Technological University, Singapore GPU-HMMER MUMmerGPU GPU-HMMER is a bioinformatics software that does protein MUMmerGPU is a bio-informatics software for high-throughput sequence alignment using profile HMMs by taking advantage of sequence alignment using GPUs. It accelerates the alignment the massively parallel CUDA architecture of NVIDIA Tesla GPUs. of multiple query sequences against a single reference GPU-HMMER is 60-100x faster than HMMER (2.0). sequence by taking advantage of the massively parallel CUDA architecture of NVIDIA Tesla GPUs. GPU-HMMER accelerates the hmmsearch tool using GPUs and gets speed-ups ranging from 60-100x. GPU-HMMER can take MUMmerGPU (version 2.0) provides 3x-4x speedup compared to advantage of multiple Tesla GPUs in a workstation to reduce the running MUMmer on CPUs. The data shown in the chart below search from hours on a CPU to minutes using a GPU. is based on MUMmerGPU version 1.0 and will soon be updated.

HMMER: 60-100x faster MUMmerGPU vs MUMmer

3 mins 120 15 mins S. suis 11 mins 1 Tesla GPU (8-Series) 100 8 mins Intel Xeon 5120 (3.0 GHz) 80 21 mins 6 mins 11 mins 27 mins 3 GPUs L. monocytogenes 60 37 mins 2 GPUs 22 mins

Speedup 40 mins 40 21 mins Opteron 2376 2.3 GHz 69 mins 1 GPUs 1 Tesla C1060 GPU 10 mins 20 2 Tesla C1060 GPUs . briggsae 11 mins 25 mins 28 mins CPU 3 Tesla C1060 GPUs 38 mins 0 789 1431 2295 0 10 20 30 40 # of States (HMM Size)

Data courtesy of Scalable Informatics and University of Buffalo Time (minutes) Data courtesy of University of Maryland

To learn more about the NVIDIA Tesla Bio Workbench, go to www.nvidia.com/bio_workbench. Partner Logo

© 2010 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Tesla, and CUDA, are trademarks and/or registered trademarks of NVIDIA Corporation. All company and product names are trademarks or registered trademarks of the respective owners with which they are associated.