Modeling Performance of Tensor Transpose Using Regression Techniques

Modeling Performance of Tensor Transpose using Regression Techniques A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Rohit Kumar Srivastava, B.E. Graduate Program in Computer Science and Engineering The Ohio State University 2018 Master's Examination Committee: Dr. P. Sadayappan, Advisor Dr. Radu Teoderescu c Copyright by Rohit Kumar Srivastava 2018 Abstract Tensor transposition is an important primitive in many tensor algebra libraries. For example, tensor contractions are implemented using TTGT(Transpose-Transpose- GEMM-Transpose) approach. Performing efficient transpose of an arbitrary tensor requires different optimization techniques depending on the required permutation. Exhaustive evaluation of all parameter choices like slice size and blocking is pro- hibitively expensive. We present an approach to model the performance of different kernels inside TTLG, a Tensor Transpose Library for GPUs, for different parameters like slice size, blocking and resultant warp efficiency etc. Predictions made by this model are then used to guide in kernel and its parameter selection. ii To my mother, father and brother, for their unconditional love and support. iii Acknowledgments This thesis wouldn't have been possible without the guidance and support of many people. First of all I would like to express my gratitude to my advisor Prof. P. Sadayappan for his guidance, feedback, patience and critical discussions throughout the process. I'm grateful to him for providing me the opportunity to work with him on this project. A special thanks to Aravind. Regular discussions with him helped me gain deeper insight into the problem and develop better understanding of the domain, funda- mentally. This improved my technical abilities. I would like to thank my lab mates Jinsung, Vineeth, Kunal, Emre, Rui, Changwan, Prashant, Israt, Wenlie and Gordon for making my past year eventful and memorable at HPCRL. I am thankful to Ak- shay Mehra, Aaditya Chauhan, Akhil Guliyani, Anhad Mohananey and Dushyanta Dhyani for their constant support, trust and faith they have put in me and for always being there during tough times of grad school life. I thank all my friends Prithvi, San- keerth, Deepankar, Piyush, Pravar, Anant, Ajit, Sayam, Pragya and Anu for making Columbus my home and helping my transition to United States much easier. Finally, all of this wouldn't be possible without the sacrifices and hard work of my parents and my brother. Without their support, love and encouragement, I wouldn't have made it this far in life. I am nothing without them. iv Vita August 2008 { May 2012 . Bachelor of Engineering, Computer Engineering Netaji Subhas Institute of Technology University of Delhi New Delhi, India. July 2012 { August 2015 . .SDE Infibeam.com Gurugram, Haryana. September 2015 { August 2016 . SDE-1 Expedia.inc Gurugram, Haryana January 2017 { May 2016 . .Graduate Teaching Associate The Ohio State University Columbus, Ohio. May 2017 { August 2017 . Software Developer Intern Amazon Web Services Seattle, Washington. August 2017 { present . Graduate Research Associate The Ohio State University Columbus, Ohio. Publications Research Publications Jyothi Vedurada, Arjun Suresh, Aravind Sukumaran Rajam, Jinsung Kim, Changwan Hong, Sriram Krishnamoorthy, Ajay Panyala, V. Krishna Nandivada, Rohit Kumar Srivastava, P.Sadayappan Efficient Tensor Transpose Library for GPUs. IPDPS , May 2018. v Fields of Study Major Field: Computer Science and Engineering vi Table of Contents Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... v List of Tables . ix List of Figures . x 1. Introduction . 1 1.1 TTLG:Tensor Transpose Library for GPUs . 1 1.2 Regression Analysis . 3 1.2.1 External Model . 3 1.2.2 Internal Model . 4 1.3 Contribution . 5 1.4 Organization of the Thesis . 5 2. Background . 6 2.1 GPU Architecture and CUDA Programming . 6 2.2 Kernel Selection inside TTLG . 7 2.3 TTLG Kernels . 8 2.3.1 FVINoMatchG32: Fastest varying indices do not match and their sizes are greater than 32 . 9 2.3.2 FVIMatchL32: Fastest varying indices match and size less than 32 . 10 vii 2.3.3 FVIMatchG32: Fastest varying indices match and their sizes are greater than 32 . 10 2.3.4 FVINoMatchGeneral: Fastest varying indices do not match and there is no overlap between indices of input and output slice . 11 2.3.5 FVINoMatchOverlap: General scheme that can handle both matching and non-matching fvi of input and output tensors. Indices mapped to slice from input and output tensor can be overlapping . 13 2.4 Regression Models . 13 2.4.1 Linear Regression . 14 2.4.2 Random Forest Regression . 14 3. Challenges . 19 4. Linear Regression Model . 21 4.1 Efficiency Calculation . 21 4.2 Feature Engineering . 23 4.2.1 Features . 23 4.3 Derived Features . 26 4.4 Data Collection . 28 4.5 Utilizing regression model to improve performance of tensor transposition . 29 5. Experiments and Results . 31 6. Conclusion and Future Work . 40 Bibliography . 42 viii List of Tables Table Page 5.1 Hardware Configuration . 33 5.2 Mean of Absolute Error Percentage for Linear and Random Forest Regression . 34 ix List of Figures Figure Page 1.1 Model using Volume as input feature . 4 2.1 TTLG Kernel selection flowchart . 8 2.2 FVINoMatchG32 Scheme . 9 2.3 FVIMatchL32 Scheme . 11 2.4 FVINoMatchGeneral Scheme . 12 2.5 FVINoMatchOverlap Scheme . 16 2.6 Linear Regression . 17 2.7 Decision Tree . 17 2.8 Random Forest . 18 4.1 Read slice from Global Memory . 23 4.2 Write slice to Global Memory . 24 4.3 Type of sub-slices in a Slice . 25 4.4 Types of slices in Tensor . 26 4.5 Regression Model for TTLG . 30 5.1 Performance Comparison of previous versus new implementation of FVIMatchG32 . 34 x 5.2 MAEP during training phase and prediction phase . 35 5.3 TTLG Performance on All 15 . 35 5.4 Kernel Prediction by Model for All 15 . 36 5.5 Error Frequencies for All 15 . 36 5.6 Performance on All 16 . 37 5.7 Kernel Prediction by Model for All 16 . 37 5.8 Error Frequencies for All 16 . 38 5.9 Performance on All 17 . 38 5.10 Kernel Prediction by Model for All 17 . 39 5.11 Error Frequencies for All 17 . 39 xi Chapter 1: Introduction Tensor transposition is an important layout transformation primitive for many do- mains like machine learning, tensor contraction(TTGT[1]) and computational chem- istry that use tensors as a core data structure. It involves a permutation of the indices of an input tensor: Bρ(i0;i1;i2;;id1) Ai0;i1;i2;;id1 where A and B are the input and output tensors, respectively, and ρ denotes a permutation function that maps output indices to input indices. An arbitrary tensor transposition (of a d-dimensional tensor) can be achieved by d-nested loops or by computing memory offset of each element in the tensor and using a single loop over the complete volume of the tensor in a 1D fashion. Both of the above approaches can be inefficient when using GPU's for performing tensor transpose. For example say we want to transpose a 2D tensor, while performing the transpose successive elements read from a column of the input tensor can have very high strides which can lead to uncoalesced memory accesses. 1.1 TTLG:Tensor Transpose Library for GPUs TTLG[4] is a library developed to perform tensor transpose on a GPU in an efficient fashion. It divides total work into slices which are then transposed by a thread 1 block and written to its appropriate position in the Global memory. It performs in- memory transpose in GPU, this requires GPU to have enough memory for both the tensor and its transpose. We will refer to the tensor being transposed as input tensor and the transposed tensor as output tensor. It uses various techniques like thread coarsening to improve thread occupancy and shared-memory padding to provide con- flict free access to shared-memory. In order to provide coalesced memory reads and writes, indirection arrays are used to read elements from input tensor and write them to output tensor. It utilizes different GPU kernels to perform tensor transposition. For certain tensor size and output permutations the choice of the kernel to be executed is simple and based on few conditional checks. But there are certain output permutations whose transpose can be performed by multiple kernels, these are the cases where fastest varying index(fvi) of input and output tensor do not match and there are different types of kernels inside TTLG that try to optimize the transpose operation using different techniques. One way to find out the best performing kernel for a given input is to evaluate all possible kernels and then select the best one like TTC[8]. This approach performs better when the use case requires repeated transposition of same size tensor and output permutation. For single use, this can consume significant amount of time and slow down the performance of the library. Another approach is to use heuristics to prune the parameter search space like cuTT[3]. This may not give the best achievable bandwidth for the given input. TTLG uses efficiency based calculations to predict the performance of possible kernels that can be executed and decides the one that it thinks will perform the best. 2 1.2 Regression Analysis Regression analysis is a statistical technique used to find the relationship between independent variables(like shared memory, input and output slice sizes, stride, warp efficiency, fvi of input and output tensors etc.) and dependent variables(performance metrics like operations/sec and bandwidth). It helps understand variations in dependent variable with respect to changes made to independent variables.

Load more