<<

S7255: CUTT: A HIGH- PERFORMANCE TENSOR LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV 퐷 푎, 푏, 𝑖 += 퐿 푎, 푐, 𝑖, 푘 푅 푘, 푏, 푐 Sum over repeated indices 푐 and 푘 Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient libraries (such as cuBLAS) to perform multiply

2 TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. 퐷 푎, 푏, 𝑖 += 퐿 푎, 푐, 𝑖, 푘 푅 푘, 푏, 푐 This can be evaluated as 퐿 푎, 푐, 𝑖, 푘 → 퐿 푎, 𝑖, 푘, 푐 # tensor transpose 푅 푘, 푏, 푐 → 푅 푘, 푐, 푏 # tensor transpose 퐷 푎, 𝑖, 푏 += 퐿 푎, 𝑖, 푘, 푐 푅 푘, 푐, 푏 # matrix multiply 퐷 푎, 𝑖, 푏 → 퐷 푎, 푏, 𝑖 # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS 3 PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs

Previous implementation by my co-author [1] was sub-optimal on GPU platforms

Work in [2] relies on compiler to build custom kernels e.g. not runtime

[1] Dmitry I. Lyakh. 2015. An efficient tensor transpose for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013

[2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures

4 TENSOR TRANSPOSE

5

MATRIX TRANSPOSE: TILED ALGORITHM

()

syncthreads __

Step 1: Step 2: Read 32x32 tile from global Read shared memory in transposed memory order to shared memory and write to global memory

6 Mark Harris “An Efficient Matrix Transpose in CUDA C/C+”, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc TILED ALGORITHM

1 2 3 4 5 6 Constant shared memory usage (~32x32) Performs well when d1 and d5 are 1 5 2 3 4 6 fairly large (~32) shared memory volume looped over using TB Poor performance for small (2-8) Would it be possible to pack 5 4 1 6 3 2 multiple small dimensions into shared memory?

7 PACKED ALGORITHM

1 2 3 4 5 6 No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) 1 2 5 3 4 6 As much shared memory is allocated as it shared memory TB loop volume takes to store the elements Must choose which dimensions to pack

5 4 1 6 3 2 New problem: What if e.g. d5 is very large?

8 PACKED-SPLIT ALGORITHM

1 2 3 4 5 6 Split largest Number of splits is determined by the shared memory size 1 2 5 3 4 6 Must choose which dimensions to pack, and shared memory TB loop volume number of splits

5 4 1 6 3 2

9 MEMORY POSITION CALCULATION

10 GLOBAL MEMORY POSITION CALCULATION glRead H = Number of elements in shared memory 1 2 3 4 5 6 M = Number of elements in loop volume

s= 0, ..., H-1 p= 0, ..., M-1 Need to convert positions s and p to global memory positions: 1 2 5 3 4 6 glRead = Global memory read shRead glWrite = Global memory write

Global memory position is split into: 5 4 1 6 3 2 glWrite glRead = glMinorRead(s) + glMajorRead(p) glWrite = glMinorWrite(s) + glMajorWrite(p) 11 MAJOR POSITION CALCULATION

p= 0, ..., M-1 3 4 6 // int p =0,...,M-1 // int c[n] = {1, d3, d3*d4} glMajorRead(p) // int d[n] = {d3, d4, d6} 1 2 3 4 5 6 // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} 푛 푝 int glMajorRead = 0; glMajorRead 푝 = ෍ 푚표푑 , 푑푖 푡푖 푐푖 푖=1 for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; } O(n)

Observation: p is constant within thread block (and therefore 12warp) WARP-PARALLEL POSITION CALCULATION p= 0, ..., M-1 // int p = 0,...,M-1 3 4 6 // int c = {1, d3, d3*d4, 1, ..., 1} glMajorRead(p)

// int d = {d3, d4, d6 , 1, ..., 1} 1 2 3 4 5 6 // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; 푛 for (int i=16;i >= 1;i/=2) { 푝 ෍ 푚표푑 , 푑푖 푡푖 푐푖 glMajorRead += __shfl_xor(glMajorRead, i); 푖=1 } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor

Works up to n=32 13 POSITION CALCULATION

For Tiled algorithm this is trivial 1 2 3 4 5 6 For Packed and Packed-Split, pre-compute positions and store into registers glMinorRead(s) s= 0, ..., H-1 Number of registers per thread:

1 2 5 shared memory numReg = (H - 1)/blockDim.x + 1

shRead(s) int glMinorRead[numReg] int shRead[numReg]

5 4 1 6 3 2 int glMinorWrite[numReg] glMinorWrite(s) Template with numReg 14 ALGORITHM & PARAMETER CHOICE

15 CHOOSING THE BEST ALGORITHM

Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters

1 2 3 4 5 6 5 4 1 6 3 2

Packed: input and output ranks

1 2 3 4 5 6 5 4 1 6 3 2

Packed-Split: input and output ranks, number of splits

1 2 3 4 5 6 5 4 1 6 3 2

Large performance differences between different algorithm and parameter choices16 CUTT PLANS cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata);

cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream);

Measure –plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic –plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (?)

17 BENCHMARKS

18 Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average BENCHMARK 1 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total

19 TESLA K20X

*

* maximum bandwidth measured using GPU-STREAM:

Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20 TESLA M40

21 TESLA P100

22 Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) BENCHMARK 2 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations

23 TESLA K20X

24 TESLA M40

25 TESLA P100

26 PERFORMANCE DISTRIBUTION

27 Set of 57 tensor transposes from (TTC): P. Springer, J. R. Hammond, and P. Bientinesi. TTC: A high performance compiler for tensor BENCHMARK 3 transpositions. CoRR, 2016. http://arxiv.org/abs/1603.02297 Somewhat “easy” benchmark due to small number of permutations

28 TESLA K40M

TTC average 140 GiB/s cuTT average 144 GiB/s

TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249 29 Real world tensor contractions performed on TAL- SH ( Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory BENCHMARK 4 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS

30 TESLA K20X GPU

(a) 100 (b) 100 90 1200 90 3500 Best Best 80 80 3000 Average 1000 Average 70 70 2500 Worst Worst 60 800 60 2000 50 50 600

1500 40 40

GFlop/s GFlop/s 30 400 30 1000 20 20

500 200 Percentage Percentage of max. performance 10 10 Percentage of max. performance 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity

Single precision Double precision

2 vol 퐷 vol 퐿 vol 푅 퐴푟𝑖푡ℎ푚푒푡𝑖푐 퐼푛푡푒푛푠𝑖푡푦 = 퐷 = 퐷 + 퐿 ∙ 푅 vol 퐷 + vol 퐿 + vol 푅 31 TESLA M40

Single precision

32 TESLA P100

100 100 9000 4500 (a) (b) Best 90 90 8000 Best 4000 80 Average 80 7000 Average 3500 70 Worst 70 6000 Worst 3000 60 60 5000 2500 50 50

4000 40 2000 40

GFlop/s GFlop/s 3000 30 1500 30 2000 20 1000 20

1000 10 500 10 Percentage of max. performance Percentageof max. performance 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision

33 CONCLUSIONS & ACKNOWLEDGEMENTS

34 CONCLUSIONS

Fully runtime library for high-performance tensor transposing on NVIDIA GPUs Extensive benchmarking Achieves median of 70-80% of the maximum achievable memory bandwidth Performance equals or exceeds the performance of compiler-based approach (TTC) Enables close to peak FLOP tensor contractions on P100 Integrated as part of TAL-SH (https://github.com/DmitryLyakh/TAL_SH) Work underway to be used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV Source code available at: https://github.com/ap-hynninen/cutt Manuscript available at: https://arxiv.org/abs/1705.01598

35 ACKNOWLEDGEMENTS

Dmitry I. Lyakh at Oak Ridge Leadership Computing Facility at ORNL ORNL where 80% of the work was done

36