Performance Tensor Transpose Library for Gpus

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV 퐷 푎, 푏, += 퐿 푎, 푐, , 푘 푅 푘, 푏, 푐 Sum over repeated indices 푐 and 푘 Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply 2 TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. 퐷 푎, 푏, += 퐿 푎, 푐, , 푘 푅 푘, 푏, 푐 This can be evaluated as 퐿 푎, 푐, , 푘 → 퐿 푎, , 푘, 푐 # tensor transpose 푅 푘, 푏, 푐 → 푅 푘, 푐, 푏 # tensor transpose 퐷 푎, , 푏 += 퐿 푎, , 푘, 푐 푅 푘, 푐, 푏 # matrix multiply 퐷 푎, , 푏 → 퐷 푎, 푏, # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS 3 PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime [1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013 [2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures 4 TENSOR TRANSPOSE ALGORITHMS 5 MATRIX TRANSPOSE: TILED ALGORITHM () syncthreads __ Step 1: Step 2: Read 32x32 tile from global Read shared memory in transposed memory order to shared memory and write to global memory 6 Mark Harris “An Efficient Matrix Transpose in CUDA C/C+”, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc TILED ALGORITHM 1 2 3 4 5 6 Constant shared memory usage (~32x32) Performs well when d1 and d5 are 1 5 2 3 4 6 fairly large (~32) shared memory volume looped over using TB Poor performance for small (2-8) dimensions Would it be possible to pack 5 4 1 6 3 2 multiple small dimensions into shared memory? 7 PACKED ALGORITHM 1 2 3 4 5 6 No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) 1 2 5 3 4 6 As much shared memory is allocated as it shared memory TB loop volume takes to store the elements Must choose which dimensions to pack 5 4 1 6 3 2 New problem: What if e.g. d5 is very large? 8 PACKED-SPLIT ALGORITHM 1 2 3 4 5 6 Split largest dimension Number of splits is determined by the shared memory size 1 2 5 3 4 6 Must choose which dimensions to pack, and shared memory TB loop volume number of splits 5 4 1 6 3 2 9 MEMORY POSITION CALCULATION 10 GLOBAL MEMORY POSITION CALCULATION glRead H = Number of elements in shared memory 1 2 3 4 5 6 M = Number of elements in loop volume s= 0, ..., H-1 p= 0, ..., M-1 Need to convert scalar positions s and p to global memory positions: 1 2 5 3 4 6 glRead = Global memory read shRead glWrite = Global memory write Global memory position is split into: 5 4 1 6 3 2 glWrite glRead = glMinorRead(s) + glMajorRead(p) glWrite = glMinorWrite(s) + glMajorWrite(p) 11 MAJOR POSITION CALCULATION p= 0, ..., M-1 3 4 6 // int p =0,...,M-1 // int c[n] = {1, d3, d3*d4} glMajorRead(p) // int d[n] = {d3, d4, d6} 1 2 3 4 5 6 // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} 푛 푝 int glMajorRead = 0; glMajorRead 푝 = ෍ 푚표푑 , 푑푖 푡푖 푐푖 푖=1 for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; } O(n) Observation: p is constant within thread block (and therefore 12warp) WARP-PARALLEL POSITION CALCULATION p= 0, ..., M-1 // int p = 0,...,M-1 3 4 6 // int c = {1, d3, d3*d4, 1, ..., 1} glMajorRead(p) // int d = {d3, d4, d6 , 1, ..., 1} 1 2 3 4 5 6 // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; 푛 for (int i=16;i >= 1;i/=2) { 푝 ෍ 푚표푑 , 푑푖 푡푖 푐푖 glMajorRead += __shfl_xor(glMajorRead, i); 푖=1 } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32 13 MINOR POSITION CALCULATION For Tiled algorithm this is trivial 1 2 3 4 5 6 For Packed and Packed-Split, pre-compute positions and store into registers glMinorRead(s) s= 0, ..., H-1 Number of registers per thread: 1 2 5 shared memory numReg = (H - 1)/blockDim.x + 1 shRead(s) int glMinorRead[numReg] int shRead[numReg] 5 4 1 6 3 2 int glMinorWrite[numReg] glMinorWrite(s) Template kernel with numReg 14 ALGORITHM & PARAMETER CHOICE 15 CHOOSING THE BEST ALGORITHM Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters 1 2 3 4 5 6 5 4 1 6 3 2 Packed: input and output ranks 1 2 3 4 5 6 5 4 1 6 3 2 Packed-Split: input and output ranks, number of splits 1 2 3 4 5 6 5 4 1 6 3 2 Large performance differences between different algorithm and parameter choices16 CUTT PLANS cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata); cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); Measure –plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic –plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?) 17 BENCHMARKS 18 Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average BENCHMARK 1 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total 19 TESLA K20X * * maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20 TESLA M40 21 TESLA P100 22 Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) BENCHMARK 2 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations 23 TESLA K20X 24 TESLA M40 25 TESLA P100 26 PERFORMANCE DISTRIBUTION 27 Set of 57 tensor transposes from (TTC): P. Springer, J. R. Hammond, and P. Bientinesi. TTC: A high performance compiler for tensor BENCHMARK 3 transpositions. CoRR, 2016. http://arxiv.org/abs/1603.02297 Somewhat “easy” benchmark due to small number of permutations 28 TESLA K40M TTC average 140 GiB/s cuTT average 144 GiB/s TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249 29 Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory BENCHMARK 4 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS 30 TESLA K20X GPU (a) 100 (b) 100 90 1200 90 3500 Best Best 80 80 3000 Average 1000 Average 70 70 2500 Worst Worst 60 800 60 2000 50 50 600 1500 40 40 GFlop/s GFlop/s 30 400 30 1000 20 20 500 200 Percentage Percentage of max. performance 10 10 Percentage of max. performance 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 2 vol 퐷 vol 퐿 vol 푅 퐴푟푡ℎ푚푒푡푐 퐼푛푡푒푛푠푡푦 = 퐷 = 퐷 + 퐿 ∙ 푅 vol 퐷 + vol 퐿 + vol 푅 31 TESLA M40 Single precision 32 TESLA P100 100 100 9000 4500 (a) (b) Best 90 90 8000 Best 4000 80 Average 80 7000 Average 3500 70 Worst 70 6000 Worst 3000 60 60 5000 2500 50 50 4000 40 2000 40 GFlop/s GFlop/s 3000 30 1500 30 2000 20 1000 20 1000 10 500 10 Percentage of max. performance Percentage of max. performance 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 33 CONCLUSIONS & ACKNOWLEDGEMENTS 34 CONCLUSIONS Fully runtime library for high-performance tensor transposing on NVIDIA GPUs Extensive benchmarking Achieves median of 70-80% of the maximum achievable memory bandwidth Performance equals or exceeds the performance of compiler-based approach (TTC) Enables close to peak FLOP tensor contractions on P100 Integrated as part of TAL-SH (https://github.com/DmitryLyakh/TAL_SH) Work underway to be used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Source code available at: https://github.com/ap-hynninen/cutt Manuscript available at: https://arxiv.org/abs/1705.01598 35 ACKNOWLEDGEMENTS Dmitry I.

Load more