GPGPU Progress in Computational Chemistry

GPGPU Progress in Computational Chemistry Mark Berger, Life and Material Sciences Alliances Manager [email protected] Agenda • A Little about NVIDIA • Update on GPUs in Computational Chemistry • A Couple of Examples • Upcoming Events 2 PARALLEL PERSONAL VISUALIZATION COMPUTING COMPUTING c 3 NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA | ~$4B revenue | ~5,500 employees Founded in 1999 with primary business in semiconductor industry Products for graphics in workstations, notebooks, mobile devices, etc. Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007 Development of GPUs as a co-processing accelerator for x86 CPUs HPC Evolution of GPUs 2004: Began strategic investments in GPU as HPC co-processor 2006: G80 first GPU with built-in compute features, 128 cores; CUDA SDK Beta 2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1.1 2008: Tesla 10-series based on GT 200, 240 cores – CUDA 2.0, 2.3 3 Years With 3 Generations 2009: Tesla 20-series, code named “Fermi” up to 512 cores – CUDA SDK 3.0, 3.2 4 Tesla Data Center & Workstation GPU Solutions Tesla M-series GPUs Tesla C-series GPUs M2090 | M2070 | M2050 C2070 | C2050 Integrated CPU-GPU Workstations Servers & Blades 2 to 4 Tesla GPUs M2090 M2070 M2050 C2070 C2050 Cores 512 448 448 448 448 Memory 6 GB 6 GB 3 GB 6 GB 3 GB Memory bandwidth 177.6 GB/s 150 GB/s 148.8 GB/s 148.8 GB/s 148.8 GB/s (ECC off) Single 1030 1030 1331 1030 1030 Peak Precision Perf 515 515 Double Gflops 665 515 515 Precision 5 GPUs are Disruptive Speed Throughput Insight Days to minutes Simulate more scenarios Real time analysis Minutes to seconds Add GPUs - Accelerate Computing CPU GPU Speedup Minimum Port, Big Speed-up Application Code Rest of Sequential Only Critical Functions CPU Code GPU Parallelize using CUDA CPU Programming Model + Tesla GPUs Power 3 of Top 5 Supercomputers #2 : Tianhe-1A #4 : Nebulae #5 : Tsubame 2.0 4650 Tesla GPU’s 1.2 PFLOPS 4224 Tesla GPU’s 1.194 PFLOPS 7168 Tesla GPU’s 2.5 PFLOPS “We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao Public comments acknowledging Tianhe-1A More Performance, Less Power Performance per Megawatts Power Fastest Top500 Systems 4.5 4 3.5 GPU Systems #2, Tienhe 1A 3 7168 Tesla GPUs 2.5 PFLOPS 2.5 #4, Nebulae 2 #5, Tsubame 2.0 4650 Tesla GPUs Petaflop/s 4224 Tesla GPUs 1.2 PFLOPS 1.5 1.194 PFLOPS Traditional Systems 1 0.5 0 0 1 2 3 4 5 6 7 8 Megawatts Power Strategic Focus on Applications Senior-level relationship and market managers Dedicated technical resources More than 150 people devoted to libraries, tools, application porting and market development Worldwide focus Reaching a Broad Range of Markets Scientific computing Creative pro Education / research Why Computational Chemistry? 13 Usage of TeraGrid National Supercomputing Grid 2008 TeraGrid Usage By Discipline Atmospheric All 15 Others Molecular Sciences 3% Biosciences 2% 29% Earth Sciences 5% Half of the cycles Advanced Chemistry Scientific 13% Computing 5% Astronomical Sciences Physics Chemical, 12% Thermal 21% Materials Systems Research 4% 6% 14 Molecular Dynamics/Mechanics Leading MD Applications Features Application GPU Perf Release Status Notes Supported PMEMD: Single and multi-GPUs. AMBER Explicit & Implicit 8X V11 Released Expect 2x more performance in Solvent V11 patch release (shortly) Implicit (5x), Explicit Single GPU released, Next release: 2H2011 2x-5x GROMACS (2x) Solvent Version 4.5.4 Better Explicit, MPI Lennard-Jones, Gay- 3.4-24x Released Single and multi-GPU. LAMMPS Berne Non-bond force 2x-7x Released, v2.8 Single and multi-GPU. NAMD calculation GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 16 Additional MD/MM Applications Ramping Features Application GPU Perf Release Status Notes Supported TBD, Single GPU. 4-29X Released Abalone “Simulations” (on 1060 GPU) Agile Molecule, Inc. Production bio-molecular “µ-sec long Written for use on dynamics (MD) software specially trajectories on Released ACEMD GPUs optimized to run on single and workstation” multi-GPUs Two-body Forces, Link- V 4.0 Source only Next release: 2H2011 cell Pairs, Ewald SPME 4x DL_POLY Results Published Multi-GPU, multi-node supported forces, Shake VV HOOMD- Written for use on 2X Released, Version Single and multi-GPU. GPUs (32 CPU cores vs. 0.9.2 Blue 2 10XX GPUs) GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 17 Viz and “Docking” Applications Related Features GPU Release Status Notes Applications Supported Perf Core GPU accelerated Up to Single and multi-GPUs. Released, Suite 2011 Hopping application 5000X Schrodinger, Inc. Real-time shape Single and multi-GPUs. similarity 800-3000X Released FastROCS Open Eyes Scientific Software searching/comparison High quality rendering, large structures (100 million atoms), GPU acceleration for computationally demanding 100-125X Visualization from University of analysis and visualization tasks, Released, Version 1.9 VMD multiple GPU support for very or greater Illinois at Urbana-Champaign fast display of molecular orbitals arising in quantum chemistry calculations GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 18 Quantum Chemistry Quantum Chemistry Features GPU Application Release Status Notes Supported Perf Libqc with Rys Single GPU supported in 10/1/10 Quadrature Algorithm, GAMESS- release. integral evaluation, 2.5X Released Multi-GPU supported later closed shell Fock US 2011 release. matrix construction Triples part of Reg- Development GPGPU CCSD(T), CCSD & 3-8X Date TBA, benchmarks: www.nwchem- NWChem EOMCCSD task projected in development sw.org schedulers Date TBA, Various features 8-14x In development Significant porting already Q-CHEM including RI-MP2 projected 44-650X Single and Multi-GPU. “Full GPU-based vs. Completely redesigned to exploit Version 1.45 released TeraChem solution” GAMESS massive GPU parallelism CPU ver. GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 20 Material Science Material Science Features GPU Application Release Status Notes Supported Perf BigDFT - 50% of the http://inac.cea.fr/L_Sim/BigDFT/ne Abinit program (short 6-30X Released June 2009 ws.html convolutions) PWscf package: linear Quantum- algebra (matrix Created by Irish Centre for High- multiply), explicit TBD Released May 5, 2011 Espresso/ End Computing computational kernels, PWscf 3D FFTs GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison 22 Make it easy to find: Tesla Bio WorkBench Molecular Dynamics & Quantum Chemistry Bio-Informatics TeraChem Hex (Docking) Applications LAMMPS CUDA-BLASTP CUDASW++ MUMmerGPU CUDA-EC Download, Technical Discussion Benchmarks Community Documentation papers Forums & Configurations Tesla Personal Supercomputer Tesla GPU Clusters Platforms 23 A Couple of Examples 24 AMBER and NVIDIA Collaboration AMBER 11 Collaboration supported both explicit solvent Particle Mesh Ewald Molecular Dynamics simulations (NVE, NVT, NPT) and implicit solvent Generalized Born simulations – CUDA support in AMBER 11 – Released April 2010 AMBER 11 plus patches Collaboration on improvements that double (2X) performance – Released August 2011 See the Performance Difference! Trp-Cage (304 atom) Protein Folding Simulation via GPU Accelerated AMBER 11 80 ns/day 367 ns/day 4 core CPU 4 core CPU + Tesla M2070 GPU Up to 4X Performance Increase Data courtesy of AMBER.org Outstanding AMBER Results — Just add GPUs JAC NVE 70 60 Higher is better 50 JAC DFHR — Dihydrofolate 40 reductase (DFHR) ns / Day ns — 23,558 atoms 30 — Joint AMBER CHARMM (JAC) benchmark JAC NVE - (2x CPU)/Node 20 — Water modeled as an JAC NVE - (2x CPU + 2x GPUs)/Node explicit TIP3P solvent 10 1 2 4 6 # of Nodes Base node configuration: Dual Xeon X5670s and Dual Tesla M2070 GPUs on each node And Get Up to 3.5X Performance Increase Adding GPUs Improves Performance on Small and Large Molecules 5.0 300% higher 4.0 (2x E5670 CPU + 2x Tesla C2070) per node 190% higher 3.0 2x E5670 CPUs per node 2.0 Base node configuration: Dual Xeon X5670s and Dual With GPU Tesla M2070 GPUs per node 1.0 No GPU Relative AMBER & AMBER Large for Small Molecules Performance Relative 0.0 JAC DFHR (23,558 atoms) Cellulose NPT (408,609 atoms) Make Research More Productive with GPUs 300% Higher 4.0 3.0 AMBER 11 on 2X E5670 CPUs + 2X Tesla C2070s (per node) AMBER 11 on 2X E5670 2.0 CPUs (per node) 54% Higher Base node configuration: With GPU Dual Xeon X5670s and Dual 1.0 Tesla M2070 GPUs per node No GPU Relative Cost and Performance Benefit Scale Benefit Performance and Cost Relative 0.0 Cost of 1 Node Performance Speed-up Adding Two 2070 GPUs to a Node Yields a 3x Performance Increase LAMMPS Gay-Berne Benchmark GPGPU Comparison 26 M2050 Node 24 Westmere– SLES 10 22 Processor: Xeon 5650 (2.66 GHz) GPU: Tesla M2050 20 Cache: 12MB/processor, shared 18 Node: 2-processor 12-core 3-GPU SL390s 16 M2090 Node 14 Westmere– SLES 10 Processor: Xeon 5650 (2.66 GHz) 12 GPU: Tesla M2090 10 High Cache: 12MB/processor, shared 8 is Node: 2-processor 12-core 8-GPU SL390s 6 Best 4 Each process uses 1 Core & 1 GPU

GPGPU Progress in Computational Chemistry

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support