Automated Code Generation and Optimization for GPU Kernels Alexey Titov, Ivan Ufimtsev, Nathan Luehr and Todd Martinez

Automated Code Generation and Optimization for GPU Kernels Alexey Titov, Ivan Ufimtsev, Nathan Luehr and Todd Martinez

Automated Code Generation and Optimization for GPU Kernels Alexey Titov, Ivan Ufimtsev, Nathan Luehr and Todd Martinez Department of Chemistry Stanford University GTC, May 2012 GPU computing ecosystem evolution G80 GT200 Fermi Kepler Cayman Tahiti MIC GPU computing ecosystem evolution G80 GT200 Fermi Kepler Cayman Tahiti MIC 2007 2012 TeraChem Selected feature list • Restricted, unrestricted, and restricted open shell Hartree-Fock and grid-based Kohn-Sham energy and gradient calculations • Full support of s, p and d-type basis functions • Various DFT functionals, including range-corrected and Coulomb attenuated functionals (BLYP, B3LYP, PBE, PBE0, ωPBE, ωPBEh, ωB97, ωB97x, camB3LYP, etc) and DFT grids (800 - 80,000 grid points per atom) • Static and dynamical DFT grids • Empirical dispersion correction (DFT-D3 and DFT-D2) • Geometry optimization (L-BFGS, Conjugate gradient, Steepest descent) and transition state search • Reaction path and transition state search (through DL-FIND, Kastner) • Ab initio molecular dynamics (NVE, NVT ensembles) • Time reversible Born-Oppenheimer dynamics • Spherical boundary conditions • Support of multiple-GPU systems • Single/Dynamical/Double precision accuracy • QM/MM treatment of surrounding water molecules using TIP3P force field • QM/MM with TeraChem/Amber – (w/ Ross Walker, UCSD/SDSC) • Natural bond orbital analysis through integration with NBO6 • Polarizabilities for HF and closed-shell DFT methods TeraChem the world’s fastest best GPU multi-GPU accelerated quantum computational chemisty software Gaussian Nwchem GAMESS Q-Chem MolPro DFT density functional theory GGA LDA hybrid functionals coupled cluster hartree fock ab initio molecular dynamics electronic structure molecular properties nano bionano nanosystem high performance AMD radeon NVIDIA C2050 petachem GPGPU polarization charge redistribution Modeling simulation molecular mechanics first principles jaguar http://petachem.com/ spartan mpqc psi wavefunction dmol3 gpaw cpmd gaussian basis sets gaussian type orbitals Quantum chemistry with TeraChem 3 journal covers, 8 peer-reviewed papers, 4000+ downloads of free beta Quantum chemistry with TeraChem Riding Advances in GPU Hardware: Molecule Molecule size 2009 2011 • Is it possible to easily retune codes for new and older archs for better performance? • How to simplify transitions between architectures (e.g. Fermi -> Kepler)? • How to implement complex kernels performing efficiently for GPUs? • What about other hardware architectures (Cayman, Tahiti, MIC, etc)? Increase computational capabilities + 26 more elements Managing d-functions • Increased number of kernels to calculate electron repulsion integrals over gaussian-type orbitals χ(r): 1 ( | ) (r ) (r ) (r ) (r )dr dr 1 1 2 2 1 2 | r1 r2 | J: 9 36 K: 10 45 • Increased depth of calculation: J kernel for ssss integrals batch: 63 loc (30 flops) pppp integrals batch: 306 loc (387 flops) dddd integrals batch: 2094 loc (3584 flops) Our ideal case: automate kernel generation and optimization Opening ‘combination’ lock for multiple targets Batch of integrals going through the generation pipeline voidJSSSPclne = voidJSSSPclne; dbllData0 = 0.0e0; loopentry = loopentry; Rs = Rlev(R0000, R0001); gamm = Gamma1(R0000, R0001, T); R0001multi = -0.20e1 * alp1; fltR[-1][-1][0][-1] = c * R[-1][-1][-1][0]; fltR[-1][0][-1][-1] = b * R[-1][-1][-1][0]; fltR[0][-1][-1][-1] = a * R[-1][-1][-1][0]; tmp = Temp(tmp0); Maple blockindex = 0; sed collect0 = R[-1][-1][-1][-1] * P0; blockindex = 1; collect0minus = R[0][-1][-1][-1] * P1; blockindex = 2; collect0minus = R[-1][0][-1][-1] * P2; C++/CUDA blockindex = 3; collect0minus = R[-1][-1][0][-1] * P3; Lambda = Lambda; lData0plus = tmp0 * Lambda; gthidXplus = BSIZEX; logg = logg; lData0contraction1 = clps; clse = clse; Intermediate representation Autogenerated J kernel example: dddd batch … for( [bra • ket] > ε) { // load data Bytes per thread: 1880 Gamma8(…) (reg + lmem) // calculate a,b,c and auxiliary functions R000j Mops: 47 Flops: 3583 float R0010 = c * R0001; … float R3000 = a * R2001 + 2.0f * R1001; 486 lines … float R0080 = c * R0071 + 7.0f * R0061; float P0 = fetch_data(preproP, g_thidX); tmp0 += R0000 * P0; Total 2090 lines tmp1 += R1000 * P0; … 35 lines tmp34 += R0040 * P0; × 35 = 1300 lines float P34 = fetch_data(preproP, g_thidX + ne*34); … //accumulate tmps in DP } // collect integrals and upload to global memory JDDDD performance GPU orig.+ volatile 3× GPU orig. ~ 13.33 GPU i7, 8 cores, SSE ~ 0.63 GPU i7, 1 core Architecture tuning: Empirically test different pathways Code variant #1 float R1000 = a * R0001; tmp_0 += R1000 * ((-PBx * (Pxz * PAx + PAz * Pzz + PAy * Pyz) * QDz - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QCx - PBx * (PAy * Pyx + PAz * Pzx + Pxx * PAx) * QDx - PBx * (PAz * Pzy + PAy * Pyy + Pxy * PAx) * QDy) * rtaq + ((-Pxz * QDz + PAz * Pzx - Pxy * QDy - Pxx * QDx - Pxx * QCx + Pxx * PBx + Pxx * PAx + PAy * Pyx) * rtaq + ((PAy * Pyx + PAz * Pzx + Pxx * PBx + Pxx * PAx) * QDx + (Pxy * PAx + PAz * Pzy + PAy * Pyy + Pxy * PBx) * QDy + (PAy * Pyz + Pxz * PAx + PAz * Pzz + Pxz * PBx) * QDz) * QCx) * rtap); Legend Red: density matrix elements RXXXX: variables containing values of auxiliary functions Blue: hermite expansion coefficients (ket pair) Green: hermite expansion coefficients (bra pair) Bold: density contracted with ket coefficients Italic: intermediates Code variant #2 float t650 = Pxy * R2100 + Pxz * R2010 + Pxx * R3000; float t652 = Pxy * R1100 + Pxz * R1010 + Pxx * R2000; float t639 = t650 * rtap + t652 * PAx; float t658 = -rtap * R2000 - PAx * R1000; float t660 = Pxx * QDx + QDy * Pxy + QDz * Pxz; float t659 = -Pxx * R1000 - Pxy * R0100 - Pxz * R0010; float t656 = rtap * R1000 + PAx * R0000; float t641 = t656 * PBx + (R0000 - t658) * rtap; float t640 = -t652 * rtap + t659 * PAx; float t624 = t640 * PBx + (t659 - t639) * rtap; tmp_0 += ((t639 * PBx + ((Pxy * R3100 + Pxz * R3010 + Pxx * R4000) * rtap + t650 * PAx + t652) * rtap) * rtaq + t624 * QCx + t641 * Pxx) * rtaq + t660 * ((t658 * PBx + (-rtap * R3000 - PAx * R2000 - R1000) * rtap) * rtaq + t641 * QCx); Code variant #3 float PP_0 = Pxx * QDx + Pxy * QDy + Pxz * QDz; float PP_1 = Pxx * rtaq; float PP_2 = Pxy * rtaq; float PP_3 = Pxz * rtaq; tmp_0 += PBx * ( PAx * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1) + rtap * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1)) + rtap * ( PAx * ( QCx * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) - rtaq * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) + R1000 * PP_1) + rtap * ( QCx * (R2000*PP_0 - R3000*PP_1 - R2100*PP_2 - R2010*PP_3) - rtaq * (R3000*PP_0 - R4000*PP_1 - R3100*PP_2 - R3010*PP_3) + R2000 * PP_1)) + rtap * ( QCx * (R0000*PP_0 - R1000*PP_1 - R0100*PP_2 - R0010*PP_3) - rtaq * (R1000*PP_0 - R2000*PP_1 - R1100*PP_2 - R1010*PP_3) + R0000 * PP_1); Empirical testing of code variants Colors: different auto-generated kernels Empirical testing of code variants Code variant C1060 Timing (ms) C2050 Timing (ms) Registersa FLOPsb 1 1025.26 822.57 115 1049 2 1042.57 823.99 115 1083 3 1112.64 988.97 114 1218 4 1117.36 1151.79 120 2124 5 2303.17 2511.44 145 1185 6 2523.15 2780.31 171 2012 7 2077.94 2852.86 141 1931 Development & execution stages Computer Algebra System C/C++ CUDA OpenCL C/C++ CUDA OpenCL Assembly language Assembly language Hardware Hardware Algebraic part Algebraic part Numerical part Numerical part Data input Data input Data output Data output Computation flow expressed and Computational flow designed algebraically and computed in C language then expressed & computed in C language Conclusions • Performance is sensitive to architecture-specific optimizations. • There is no direct and meaningful relationship between performance and FLOPS on GPUs. • Automatic code generation and performance tuning will provide code portability. It enables performance portability across various architectures: from the same or different vendors. Acknowledgements Nathan Luehr Ivan Ufimtsev The Boss Funding Not shown: Jeff Gour, Ed Hohenstein STTR - AFOSR Jason Quenneville, Spectral Sciences Vlad Kindratenko Guochun Shi .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us