GPGPU Progress in Computational Chemistry
Total Page:16
File Type:pdf, Size:1020Kb
GPGPU Progress in Computational Chemistry Mark Berger, Life and Material Sciences Alliances Manager [email protected] Agenda • A Little about NVIDIA • Update on GPUs in Computational Chemistry • A Couple of Examples • Upcoming Events 2 PARALLEL PERSONAL VISUALIZATION COMPUTING COMPUTING c 3 NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA | ~$4B revenue | ~5,500 employees Founded in 1999 with primary business in semiconductor industry Products for graphics in workstations, notebooks, mobile devices, etc. Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007 Development of GPUs as a co-processing accelerator for x86 CPUs HPC Evolution of GPUs 2004: Began strategic investments in GPU as HPC co-processor 2006: G80 first GPU with built-in compute features, 128 cores; CUDA SDK Beta 2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1.1 2008: Tesla 10-series based on GT 200, 240 cores – CUDA 2.0, 2.3 3 Years With 3 Generations 2009: Tesla 20-series, code named “Fermi” up to 512 cores – CUDA SDK 3.0, 3.2 4 Tesla Data Center & Workstation GPU Solutions Tesla M-series GPUs Tesla C-series GPUs M2090 | M2070 | M2050 C2070 | C2050 Integrated CPU-GPU Workstations Servers & Blades 2 to 4 Tesla GPUs M2090 M2070 M2050 C2070 C2050 Cores 512 448 448 448 448 Memory 6 GB 6 GB 3 GB 6 GB 3 GB Memory bandwidth 177.6 GB/s 150 GB/s 148.8 GB/s 148.8 GB/s 148.8 GB/s (ECC off) Single 1030 1030 1331 1030 1030 Peak Precision Perf 515 515 Double Gflops 665 515 515 Precision 5 GPUs are Disruptive Speed Throughput Insight Days to minutes Simulate more scenarios Real time analysis Minutes to seconds Add GPUs - Accelerate Computing CPU GPU Speedup Minimum Port, Big Speed-up Application Code Rest of Sequential Only Critical Functions CPU Code GPU Parallelize using CUDA CPU Programming Model + Tesla GPUs Power 3 of Top 5 Supercomputers #2 : Tianhe-1A #4 : Nebulae #5 : Tsubame 2.0 4650 Tesla GPU’s 1.2 PFLOPS 4224 Tesla GPU’s 1.194 PFLOPS 7168 Tesla GPU’s 2.5 PFLOPS “We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao Public comments acknowledging Tianhe-1A More Performance, Less Power Performance per Megawatts Power Fastest Top500 Systems 4.5 4 3.5 GPU Systems #2, Tienhe 1A 3 7168 Tesla GPUs 2.5 PFLOPS 2.5 #4, Nebulae 2 #5, Tsubame 2.0 4650 Tesla GPUs Petaflop/s 4224 Tesla GPUs 1.2 PFLOPS 1.5 1.194 PFLOPS Traditional Systems 1 0.5 0 0 1 2 3 4 5 6 7 8 Megawatts Power Strategic Focus on Applications Senior-level relationship and market managers Dedicated technical resources More than 150 people devoted to libraries, tools, application porting and market development Worldwide focus Reaching a Broad Range of Markets Scientific computing Creative pro Education / research Why Computational Chemistry? 13 Usage of TeraGrid National Supercomputing Grid 2008 TeraGrid Usage By Discipline Atmospheric All 15 Others Molecular Sciences 3% Biosciences 2% 29% Earth Sciences 5% Half of the cycles Advanced Chemistry Scientific 13% Computing 5% Astronomical Sciences Physics Chemical, 12% Thermal 21% Materials Systems Research 4% 6% 14 Molecular Dynamics/Mechanics Leading MD Applications Features Application GPU Perf Release Status Notes Supported PMEMD: Single and multi-GPUs. AMBER Explicit & Implicit 8X V11 Released Expect 2x more performance in Solvent V11 patch release (shortly) Implicit (5x), Explicit Single GPU released, Next release: 2H2011 2x-5x GROMACS (2x) Solvent Version 4.5.4 Better Explicit, MPI Lennard-Jones, Gay- 3.4-24x Released Single and multi-GPU. LAMMPS Berne Non-bond force 2x-7x Released, v2.8 Single and multi-GPU. NAMD calculation GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 16 Additional MD/MM Applications Ramping Features Application GPU Perf Release Status Notes Supported TBD, Single GPU. 4-29X Released Abalone “Simulations” (on 1060 GPU) Agile Molecule, Inc. Production bio-molecular “µ-sec long Written for use on dynamics (MD) software specially trajectories on Released ACEMD GPUs optimized to run on single and workstation” multi-GPUs Two-body Forces, Link- V 4.0 Source only Next release: 2H2011 cell Pairs, Ewald SPME 4x DL_POLY Results Published Multi-GPU, multi-node supported forces, Shake VV HOOMD- Written for use on 2X Released, Version Single and multi-GPU. GPUs (32 CPU cores vs. 0.9.2 Blue 2 10XX GPUs) GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 17 Viz and “Docking” Applications Related Features GPU Release Status Notes Applications Supported Perf Core GPU accelerated Up to Single and multi-GPUs. Released, Suite 2011 Hopping application 5000X Schrodinger, Inc. Real-time shape Single and multi-GPUs. similarity 800-3000X Released FastROCS Open Eyes Scientific Software searching/comparison High quality rendering, large structures (100 million atoms), GPU acceleration for computationally demanding 100-125X Visualization from University of analysis and visualization tasks, Released, Version 1.9 VMD multiple GPU support for very or greater Illinois at Urbana-Champaign fast display of molecular orbitals arising in quantum chemistry calculations GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 18 Quantum Chemistry Quantum Chemistry Features GPU Application Release Status Notes Supported Perf Libqc with Rys Single GPU supported in 10/1/10 Quadrature Algorithm, GAMESS- release. integral evaluation, 2.5X Released Multi-GPU supported later closed shell Fock US 2011 release. matrix construction Triples part of Reg- Development GPGPU CCSD(T), CCSD & 3-8X Date TBA, benchmarks: www.nwchem- NWChem EOMCCSD task projected in development sw.org schedulers Date TBA, Various features 8-14x In development Significant porting already Q-CHEM including RI-MP2 projected 44-650X Single and Multi-GPU. “Full GPU-based vs. Completely redesigned to exploit Version 1.45 released TeraChem solution” GAMESS massive GPU parallelism CPU ver. GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 20 Material Science Material Science Features GPU Application Release Status Notes Supported Perf BigDFT - 50% of the http://inac.cea.fr/L_Sim/BigDFT/ne Abinit program (short 6-30X Released June 2009 ws.html convolutions) PWscf package: linear Quantum- algebra (matrix Created by Irish Centre for High- multiply), explicit TBD Released May 5, 2011 Espresso/ End Computing computational kernels, PWscf 3D FFTs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 22 Make it easy to find: Tesla Bio WorkBench Molecular Dynamics & Quantum Chemistry Bio-Informatics TeraChem Hex (Docking) Applications LAMMPS CUDA-BLASTP CUDASW++ MUMmerGPU CUDA-EC Download, Technical Discussion Benchmarks Community Documentation papers Forums & Configurations Tesla Personal Supercomputer Tesla GPU Clusters Platforms 23 A Couple of Examples 24 AMBER and NVIDIA Collaboration AMBER 11 Collaboration supported both explicit solvent Particle Mesh Ewald Molecular Dynamics simulations (NVE, NVT, NPT) and implicit solvent Generalized Born simulations – CUDA support in AMBER 11 – Released April 2010 AMBER 11 plus patches Collaboration on improvements that double (2X) performance – Released August 2011 See the Performance Difference! Trp-Cage (304 atom) Protein Folding Simulation via GPU Accelerated AMBER 11 80 ns/day 367 ns/day 4 core CPU 4 core CPU + Tesla M2070 GPU Up to 4X Performance Increase Data courtesy of AMBER.org Outstanding AMBER Results — Just add GPUs JAC NVE 70 60 Higher is better 50 JAC DFHR — Dihydrofolate 40 reductase (DFHR) ns / Day ns — 23,558 atoms 30 — Joint AMBER CHARMM (JAC) benchmark JAC NVE - (2x CPU)/Node 20 — Water modeled as an JAC NVE - (2x CPU + 2x GPUs)/Node explicit TIP3P solvent 10 1 2 4 6 # of Nodes Base node configuration: Dual Xeon X5670s and Dual Tesla M2070 GPUs on each node And Get Up to 3.5X Performance Increase Adding GPUs Improves Performance on Small and Large Molecules 5.0 300% higher 4.0 (2x E5670 CPU + 2x Tesla C2070) per node 190% higher 3.0 2x E5670 CPUs per node 2.0 Base node configuration: Dual Xeon X5670s and Dual With GPU Tesla M2070 GPUs per node 1.0 No GPU Relative AMBER & AMBER Large for Small Molecules Performance Relative 0.0 JAC DFHR (23,558 atoms) Cellulose NPT (408,609 atoms) Make Research More Productive with GPUs 300% Higher 4.0 3.0 AMBER 11 on 2X E5670 CPUs + 2X Tesla C2070s (per node) AMBER 11 on 2X E5670 2.0 CPUs (per node) 54% Higher Base node configuration: With GPU Dual Xeon X5670s and Dual 1.0 Tesla M2070 GPUs per node No GPU Relative Cost and Performance Benefit Scale Benefit Performance and Cost Relative 0.0 Cost of 1 Node Performance Speed-up Adding Two 2070 GPUs to a Node Yields a 3x Performance Increase LAMMPS Gay-Berne Benchmark GPGPU Comparison 26 M2050 Node 24 Westmere– SLES 10 22 Processor: Xeon 5650 (2.66 GHz) GPU: Tesla M2050 20 Cache: 12MB/processor, shared 18 Node: 2-processor 12-core 3-GPU SL390s 16 M2090 Node 14 Westmere– SLES 10 Processor: Xeon 5650 (2.66 GHz) 12 GPU: Tesla M2090 10 High Cache: 12MB/processor, shared 8 is Node: 2-processor 12-core 8-GPU SL390s 6 Best 4 Each process uses 1 Core & 1 GPU