Accélération De Calculs De Simulations Des Matériaux Sur GPU
Total Page:16
File Type:pdf, Size:1020Kb
Accélération de calculs de simulations des matériaux sur GPU Francois Courteille Senior Solution Architect, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential1 Tesla, the NVIDIA compute platform Quantum Chemistry Applications Overview AGENDA Molecular Dynamics Applications Overview Material Science Applications at the start of the Pascal Era 2 GAMING AUTO ENTERPRISE HPC & CLOUD TECHNOLOGY THE WORLD LEADER IN VISUAL COMPUTING 3 OUR TEN YEARS IN HPC GPU-Trained AI Machine Beats World Champion in Go World’s First Atomic Model of HIV Capsid Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs Google Outperform Stanford Builds AI Humans in ImageNet Machine using GPUs Fermi: World’s AlexNet beats expert code First HPC GPU by huge margin using GPUs World’s First GPU Top500 System Discovered How H1N1 World’s First 3-D Mapping Mutates to Resist Drugs of Human Genome CUDA Launched 2006 2008 2010 2012 2014 20164 CREDENTIALS BUILT OVER TIME Academia Games Finance TORCH CAFFE # of Applications Manufacturing Internet Oil & Gas 450 410 National Labs Automotive Defense 400 370 M & E THEANO MATCONVNET 350 300 287 MOCHA.JL PURINE 242 250 206 MINERVA MXNET* 200 300K 150 113 BIG SUR TENSORFLOW 100 50 WATSON CNTK 0 2011 2012 2013 2014 2015 2016 Majority of HPC Applications are 300K CUDA Developers, 100% of Deep Learning GPU-Accelerated, 410 and Growing 4x Growth in 4 years Frameworks are Accelerated www.nvidia.com/appscatalog 5 END-TO-END TESLA PRODUCT FAMILY FULLY INTEGRATED DL HYPERSCALE HPC MIXED-APPS HPC STRONG-SCALING HPC SUPERCOMPUTER Tesla M4, M40 Tesla K80 Tesla P100 DGX-1 Hyperscale deployment for DL HPC data centers running mix Hyperscale & HPC data For customers who need to training, inference, video & of CPU and GPU workloads centers running apps that get going now with fully image processing scale to multiple GPUs integrated solution 6 TESLA FOR SIMULATION LIBRARIES DIRECTIVES LANGUAGES ACCELERATED COMPUTING TOOLKIT TESLA ACCELERATED COMPUTING 8 OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS MD: All key codes are GPU-accelerated QC: All key codes are ported or optimizing Great multi-GPU performance Focus on using GPU-accelerated math libraries, OpenACC directives Focus on dense (up to 16) GPU nodes &/or large # of GPU nodes GPU-accelerated and available today: ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso, ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS- Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*, UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012, LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD, NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack, OpenMM, PolyFTS, SOP-GPU* & more Quantum Espresso/PWscf, QUICK, TeraChem* Active GPU acceleration projects: CASTEP, GAMESS, Gaussian, ONETEP, Quantum Supercharger Library*, VASP & more 10 green* = application where >90% of the workload is on GPU MD VS. QC ON GPUS “Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp) Simulates positions of atoms over time; Calculates electronic properties; chemical-biological or ground state, excited states, spectral properties, chemical-material behaviors making/breaking bonds, physical properties Forces calculated from simple empirical formulas Forces derived from electron wave function (bond rearrangement generally forbidden) (bond rearrangement OK, e.g., bond energies) Up to millions of atoms Up to a few thousand atoms Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods Single precision dominated Double precision is important Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC Geforce (Accademics), Tesla (Servers) Tesla recommended ECC off ECC on 11 GPU-Accelerated Quantum Chemistry Apps Green Lettering Indicates Performance Slides Included Abinit Gaussian Octopus Quantum SuperCharger ACES III GPAW ONETEP Library ADF LATTE Petot TeraChem BigDFT LSDalton Q-Chem VASP CP2K MOLCAS QMCPACK WL-LSMS GAMESS-US Mopac2012 Quantum Espresso NWChem GPU Perf compared against dual multi-core x86 CPU socket. 12 ABINIT 6th International ABINIT Developer Workshop FROM RESEARCH TO INDUSTRY April 15-18, 2013 - Dinard, France USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU) Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson With a contribution from Yann Pouillon for the build system www.cea.fr ABINIT ON GPU OUTLINE Introduction Introduction to GPU computing Architecture, programming model Density-Functional Theory with plane waves How to port it on GPU ABINIT on GPU Implementation Performances of ABINIT v7.0 How To… Compile Use ABINIT on GPU| April 15, 2013 | PAGE 13 LOCAL OPERATOR: FAST FOURIER TRANSFORMS ! Our choice (for a first version): g r g r g v dr e v r c e eff eff g ' Replace the MPI plane wave level g ' ~ = i ~ () F FT by a FFT computation on GPU i ( ) ! Included in Cuda package : cuFFT library Interfaced with c and Fortran ! FTT A system dependent efficiency is expected ! Small FFT underuse the GPU and are dominated by ! the time required to transfer the data to/from the GPU Within PAW, small plane wave basis are used ABINIT on GPU| April 15, 2013 | PAGE 14 LOCAL OPERATOR: FAST FOURIER TRANSFORMS ! Implementation ! Interface Cuda / Fortran ! New routines that encapsulate cufft calls ! ! Development of 3 small cuda kernels Optimization of memory copy to overlap fourwf – option 2 computation and data transfer Performance highly Speedup dependent on FFT size TEST CASE 107 gold atoms cell FFT size TGCC-Curie (Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 15 LOCAL OPERATOR: FAST FOURIER TRANSFORMS Improvement : multiple band optimization ! The Cuda kernels are more intensively used Requires less transfers ! See bandpp variable When Locally Optimal Block Preconditioned Conjugate Gradient algorithm is used 107 gold atoms 31 copper atoms bandpp fourwf on GPU bandpp fourwf on GPU 1 112,94 1 36,02 2 84,55 2 20,79 4 70,55 4 14,14 TEST CASE 8 61,91 10 9,16 TGCC-Curie 18 54,79 20 9,50 (Intel Westmere + Nvidia M2090) 72 50,55 100 7,94 ABINIT on GPU| April 15, 2013 | PAGE 16 NON LOCAL OPERATOR ! Implementation ~ ~ ! Development of 2 Cuda Kernels for p ψ = ∑ p g ' g ' ψ i g ' i ~ ! ψ ~ ψ Development of 2 Cuda Kernels for g VNL = ∑i , j g pi Dij p j ! d Using collaborative memory zones as much as possible (threads linke • Non-local operator Used• for Overlap: operator to the same plane wave lie in the same collaborative zone) •• ForcesNon-l and stress tensor Performances (NL operator only) • PAW occupation matrix nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU In addition to all MPI 107 gold atoms 3142 1122 2.8 parallelization levels ! 31 copper atom 85 42 2.0 • Need a large number of PW s 3 UO2 atoms 63 36 1.8 • Need a large number of NL projectors TGCC-Curie (Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17 Algorithm 1 lobpcg Locally Optimal Block Require: 0 = { 0 , . , 0 } close to the minimum and K a pre- 1 m Preconditioned Conjugate conditioner; P = {P(0) , . , P0 } is initialized to 0. 1 m Gradient 1: for i=0,1, , .do. 2: ⌥(i) = ⌥( (i) ) bandpp Total tim e Tim e Line ar Alge bra 3: R(i) = H (i) ⌥(i) O (i) 1 674 119 2 612 116 4: W(i) = KR(i) 10 1077 428 5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o 50 5816 5167 (i) (i) (i) (i) (i) (i) 1 ,P. .. , Pm , 1 , . .. , m , W1 , . , Wm 6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i) 7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for Matrix-matrix and matrix-vector products Diagonalization/othogonalization within blocks ! These products are not parallelized ! Size of matrix increases when b andparallelization is used ! Size of vectors/matrix increases with the ___ .z number of band/plane-wave CPU cores. ! Need for- a free LAPACK packag e on GPU ! Our choice : use of cuBlas Our choice : use of MAGMA ABINIT on GPU| April 15, 2013 | PAGE 18 Matrix Algebra on GPU and Home Multicore Architectures Overview Matrix Algebra on GPU and Multicore Latest MAGMA News Architectures 2013-03- 12 News Downloads The MAGMA project aims 1o oeveop a dense linear agebra library MAGMA MIC 1.0 Beta for Intel Xeon similar 1o LAPACK bu1 for heterogeneouslhybrd architectures, Phi Coprocessors Re~ased PublPeopcaletion s 2012-11- 14 starting with current 'Multcore-tGPU' systems. Dcx:umentatb MAGMA 1.3 Released Pan rtners The MAGMA research is based on 1he dlea 1hat, 1o address the complex challenges of 1heemerging hybrd environments, optimal software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi ! A project by Innovating Computing Laboratory, eonna bthleis a pdleap'ca, ton wes a im1o ftoully oes eqnxp oit li n eatherage powebraar g1hornhmat eacs h aondf th e h ybr d Coprocessors Released cfraomponmeworksents fooffersr hybr. d manycore and GPU systems that can University of Tenessee 2012-10-24 clMAGMA 1.0 Released 2012-06-29 ! “Free of use” project MAGMA 1.2.1 Released ! First stable release 25-08-2011; ICL{>l3r Current version 1.3 lver27 2013 Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,.............. ! Computation is cut in order to offload License consuming parts on GPU Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re .. ! Goal : address hybrid environments ing d. in conditbns are met: ABINIT on GPU| April 15, 2013 | PAGE 19 ITERATIVE EIGENSOLVER Speedup of cuBLAS/MAGMA code Performances