<<

Modeling Performance of using Regression Techniques

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Rohit Kumar Srivastava, B.E.

Graduate Program in Science and

The Ohio State University

2018

Master’s Examination Committee:

Dr. P. Sadayappan, Advisor Dr. Radu Teoderescu c Copyright by

Rohit Kumar Srivastava

2018 Abstract

Tensor transposition is an important primitive in many libraries.

For example, tensor contractions are implemented using TTGT(Transpose-Transpose-

GEMM-Transpose) approach. Performing efficient transpose of an arbitrary tensor requires different optimization techniques depending on the required .

Exhaustive evaluation of all parameter choices like slice size and blocking is pro- hibitively expensive. We present an approach to model the performance of different kernels inside TTLG, a Tensor Transpose Library for GPUs, for different parameters like slice size, blocking and resultant warp efficiency etc. Predictions made by this model are then used to guide in and its parameter selection.

ii To my mother, father and brother, for their unconditional love and support.

iii Acknowledgments

This thesis wouldn’t have been possible without the guidance and support of many people.

First of all I would like to express my gratitude to my advisor Prof. P. Sadayappan for his guidance, feedback, patience and critical discussions throughout the process.

I’m grateful to him for providing me the opportunity to work with him on this project.

A special thanks to Aravind. Regular discussions with him helped me gain deeper insight into the problem and develop better understanding of the domain, funda- mentally. This improved my technical abilities. I would like to thank my lab mates

Jinsung, Vineeth, Kunal, Emre, Rui, Changwan, Prashant, Israt, Wenlie and Gordon for making my past year eventful and memorable at HPCRL. I am thankful to Ak- shay Mehra, Aaditya Chauhan, Akhil Guliyani, Anhad Mohananey and Dushyanta

Dhyani for their constant support, trust and faith they have put in me and for always being there during tough times of grad school life. I thank all my friends Prithvi, San- keerth, Deepankar, Piyush, Pravar, Anant, Ajit, Sayam, Pragya and Anu for making

Columbus my home and helping my transition to United States much easier.

Finally, all of this wouldn’t be possible without the sacrifices and hard work of my parents and my brother. Without their support, love and encouragement, I wouldn’t have made it this far in life. I am nothing without them.

iv Vita

August 2008 – May 2012 ...... Bachelor of Engineering, Computer Engineering Netaji Subhas Institute of Technology University of Delhi New Delhi, India. July 2012 – August 2015 ...... SDE Infibeam.com Gurugram, Haryana. September 2015 – August 2016 ...... SDE-1 Expedia.inc Gurugram, Haryana January 2017 – May 2016 ...... Graduate Teaching Associate The Ohio State University Columbus, Ohio. May 2017 – August 2017 ...... Software Developer Intern Amazon Web Services Seattle, Washington. August 2017 – present ...... Graduate Research Associate The Ohio State University Columbus, Ohio.

Publications

Research Publications

Jyothi Vedurada, Arjun Suresh, Aravind Sukumaran Rajam, Jinsung Kim, Changwan Hong, Sriram Krishnamoorthy, Ajay Panyala, V. Krishna Nandivada, Rohit Kumar Srivastava, P.Sadayappan Efficient Tensor Transpose Library for GPUs. IPDPS , May 2018.

v Fields of Study

Major : and Engineering

vi Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... v

List of Tables ...... ix

List of Figures ...... x

1. Introduction ...... 1

1.1 TTLG:Tensor Transpose Library for GPUs ...... 1 1.2 Regression Analysis ...... 3 1.2.1 External Model ...... 3 1.2.2 Internal Model ...... 4 1.3 Contribution ...... 5 1.4 Organization of the Thesis ...... 5

2. Background ...... 6

2.1 GPU Architecture and CUDA Programming ...... 6 2.2 Kernel Selection inside TTLG ...... 7 2.3 TTLG Kernels ...... 8 2.3.1 FVINoMatchG32: Fastest varying indices do not match and their sizes are greater than 32 ...... 9 2.3.2 FVIMatchL32: Fastest varying indices match and size less than 32 ...... 10

vii 2.3.3 FVIMatchG32: Fastest varying indices match and their sizes are greater than 32 ...... 10 2.3.4 FVINoMatchGeneral: Fastest varying indices do not match and there is no overlap between indices of input and output slice ...... 11 2.3.5 FVINoMatchOverlap: General scheme that can handle both matching and non-matching fvi of input and output . Indices mapped to slice from input and output tensor can be overlapping ...... 13 2.4 Regression Models ...... 13 2.4.1 ...... 14 2.4.2 Random Forest Regression ...... 14

3. Challenges ...... 19

4. Linear Regression Model ...... 21

4.1 Efficiency Calculation ...... 21 4.2 Feature Engineering ...... 23 4.2.1 Features ...... 23 4.3 Derived Features ...... 26 4.4 Data Collection ...... 28 4.5 Utilizing regression model to improve performance of tensor trans- position ...... 29

5. Experiments and Results ...... 31

6. Conclusion and Future Work ...... 40

Bibliography ...... 42

viii List of Tables

Table Page

5.1 Hardware Configuration ...... 33

5.2 Mean of Absolute Error Percentage for Linear and Random Forest Regression ...... 34

ix List of Figures

Figure Page

1.1 Model using Volume as input feature ...... 4

2.1 TTLG Kernel selection flowchart ...... 8

2.2 FVINoMatchG32 Scheme ...... 9

2.3 FVIMatchL32 Scheme ...... 11

2.4 FVINoMatchGeneral Scheme ...... 12

2.5 FVINoMatchOverlap Scheme ...... 16

2.6 Linear Regression ...... 17

2.7 Decision Tree ...... 17

2.8 Random Forest ...... 18

4.1 Read slice from Global Memory ...... 23

4.2 Write slice to Global Memory ...... 24

4.3 Type of sub-slices in a Slice ...... 25

4.4 Types of slices in Tensor ...... 26

4.5 Regression Model for TTLG ...... 30

5.1 Performance Comparison of previous versus new implementation of FVIMatchG32 ...... 34

x 5.2 MAEP during training phase and prediction phase ...... 35

5.3 TTLG Performance on All 15 ...... 35

5.4 Kernel Prediction by Model for All 15 ...... 36

5.5 Error Frequencies for All 15 ...... 36

5.6 Performance on All 16 ...... 37

5.7 Kernel Prediction by Model for All 16 ...... 37

5.8 Error Frequencies for All 16 ...... 38

5.9 Performance on All 17 ...... 38

5.10 Kernel Prediction by Model for All 17 ...... 39

5.11 Error Frequencies for All 17 ...... 39

xi Chapter 1: Introduction

Tensor transposition is an important layout transformation primitive for many do- mains like , (TTGT[1]) and computational chem- istry that use tensors as a core data structure. It involves a permutation of the indices of an input tensor:

Bρ(i0,i1,i2,,id1) ← Ai0,i1,i2,,id1

where A and B are the input and output tensors, respectively, and ρ denotes a permutation function that maps output indices to input indices. An arbitrary tensor transposition (of a d-dimensional tensor) can be achieved by d-nested loops or by computing memory offset of each element in the tensor and using a single loop over the complete volume of the tensor in a 1D fashion.

Both of the above approaches can be inefficient when using GPU’s for performing tensor transpose. For example say we want to transpose a 2D tensor, while performing the transpose successive elements read from a column of the input tensor can have very high strides which can lead to uncoalesced memory accesses.

1.1 TTLG:Tensor Transpose Library for GPUs

TTLG[4] is a library developed to perform tensor transpose on a GPU in an effi- cient fashion. It divides total work into slices which are then transposed by a thread

1 block and written to its appropriate position in the Global memory. It performs in- memory transpose in GPU, this requires GPU to have enough memory for both the tensor and its transpose. We will refer to the tensor being transposed as input tensor and the transposed tensor as output tensor. It uses various techniques like thread coarsening to improve thread occupancy and shared-memory padding to provide con-

flict free access to shared-memory. In order to provide coalesced memory reads and writes, indirection arrays are used to read elements from input tensor and write them to output tensor. It utilizes different GPU kernels to perform tensor transposition.

For certain tensor size and output the choice of the kernel to be exe- cuted is simple and based on few conditional checks. But there are certain output permutations whose transpose can be performed by multiple kernels, these are the cases where fastest varying index(fvi) of input and output tensor do not match and there are different types of kernels inside TTLG that try to optimize the transpose using different techniques. One way to find out the best performing kernel for a given input is to evaluate all possible kernels and then select the best one like

TTC[8]. This approach performs better when the use case requires repeated transpo- sition of same size tensor and output permutation. For single use, this can consume significant amount of time and slow down the performance of the library. Another approach is to use heuristics to prune the parameter search like cuTT[3]. This may not give the best achievable bandwidth for the given input. TTLG uses efficiency based calculations to predict the performance of possible kernels that can be executed and decides the one that it thinks will perform the best.

2 1.2 Regression Analysis

Regression analysis is a statistical technique used to find the relationship between independent variables(like shared memory, input and output slice sizes, stride, warp efficiency, fvi of input and output tensors etc.) and dependent variables(performance metrics like operations/sec and bandwidth). It helps understand variations in depen- dent with respect to changes made to independent variables.

1.2.1 External Model

Externally input to the predictive model is the same as input provided to the library which is no. of , size of each and output permutation which is not good enough. Since without internal knowledge of TTLG, only volume of the input tensor can calculated and using just the volume will make the model predict same bandwidth for a different output permutation. This a depicted in Figure 1.1 which plots the actual performance of different permutations and slices choices for 6 tensors of sizes all 15(tensor of size 15 15 15 15 15 15), 16(tensor of size 16

16 16 16 16 16) and 17(tensor of size 17 17 17 17 17 17). The X-axis shows the test case no. and on Y-axes bandwidth has been plotted. As can be seen from the graph the model(shown by blue line in form of 3 steps) predicts constant bandwidth for each test case with identical volume. Thus, without the knowledge of internal implementations of each kernel such a basic model would be unable to capture the efficiency and/or inefficiency of each kernel. This will lead to inaccurate predictions by the model and poor performance by the library.

3 att htpriua enlcnpeiti’ efrac ihhge accuracy. higher with performance it’s predict can kernel particular that to vant the for bandwidth kernels predicted those best of the each with operation. of kernel transpose the model execute to then passed can is than TTLG input more and then If executed kernel. be selected can to kernel corresponding one size, model tensor to and parameters permutation as output passed required and the on based calculated, be can like factor features different library the Inside execute. to Model Internal 1.2.2 hrfr,a nenlpeito oe o ahkre htue trbtsrele- attributes uses that kernel each for model prediction internal an Therefore, kernel which decide help can prediction performance the library, TTLG the Within

Bandwidth(GB/s) iue11 oe sn ouea nu feature input as Volume using Model 1.1: Figure Test CaseNo. 4 fvi hedboksz,coarsening size, block thread , 1.3 Contribution

This work primarily focuses on:

1. Building a predictive model for each kernel inside TTLG and using predictions

made by these models to select kernel and it’s configuration.

2. Automating the process of building Linear Regression model for different GPUs

to adapt to different hardware environments.

3. Improving the performance of kernel FVIMatchG32 using thread coarsening

1.4 Organization of the Thesis

The rest of the thesis is organized as follows:

Chapter 2 provides a background about all kernels inside TTLG and a brief introduc- tion to Linear and Random Forest Regression. Chapter 3 explains the advantages of using linear regression and the need for Auto Tuning the library. Chapter 4 describes in detail the feature selection process for building the predictive model for each kernel inside TTLG. It also describes datasets that were used for generating the training and evaluation dataset for the model and the setup used for performance evaluation of the library using the trained model. Chapter 5 presents the results obtained by the experimental setup and attempts to explain the performance achieved by the internal model. Chapter 6 discusses alternate approaches to build the predictive model for the library TTLG, regression techniques that could be used instead of linear regression and other areas that can benefit from similar predictive models.

5 Chapter 2: Background

2.1 GPU Architecture and CUDA Programming

From an architectural standpoint NVIDIA GPU comprises of a set of streaming multiprocessors(SM) where each SM consists for streaming processors(SPs). Each

SP inside an SM share registers and shared-memory. Shared-memory is divided into memory banks, accessing data from shared-memory banks is very fast as long as each thread accesses a location in a different bank(i.e. conflict free access).

In an NVIDIA GPU, smallest execution unit is a warp, threads in an SM are grouped into warps. In modern GPUs, each warp consists of 32 threads that execute in lock step fashion. Warps, threads and blocks in an SM are uniquely identified by warpId, threadId and blockId respectively.

A CUDA program is comprised of various phases that can execute on either CPU and GPU. This decision is made by the programmer, a general practice followed here is to execute parallel segments on GPU and sequential ones on CPU. The segments that are executed on a GPU are written inside functions called kernels and when they are executed it is referred to as kernel launch. At the beginning of a kernel launch a grid is formed which is composed of thread blocks which in turn are composed of warps.

6 Thread blocks are mapped to different SMs. Depending on available shared-memory and registers an SM can execute more than one thread blocks simultaneously.

2.2 Kernel Selection inside TTLG

The input to the TTLG library is no. of dimensions of tensor, size of each di- mension and output permutation. Based on this input it internally performs index fusion. This means that if there exists indices in output tensor that are also consec- utive in the input tensor then they are merged together into a single dimension. For example consider a rank 6 tensor of size [16 × 16 × 16 × 16 × 16 × 16] with output permutation as {2 3 0 1 4 5}. After index fusion the resultant tensor would become

[256 × 256 × 256] and the output permutation will reduce to {1 0 2}. After index fusion it checks whether the fvi of input and output tensor match or not. If they don’t, it uses efficiency based calculations to predict the performance of transpose for FVINoMatchG32 and for 64 different slice choices for FVINoMatchGeneral and

FVINoMatchOverlap. For FVINoMatchGeneral and FVINoMatchOverlap it selects the one with maximum efficiency among 64 slice choices and then finally executes the best one out of the three kernels. But if the fvi’s of input and output tensor do match then it checks whether fvi <32 or not. If yes then the following condition is evaluated:

size(0) × size(1) ≥ 32 && size(ρ(0)) × size(ρ(1))

. When the above condition evaluates to true the FVIMatchL32 is executed else

FVINoMatchOverlap executes. When fvi’s match and their sizes are ≥32 then FVI-

MatchG32 kernel is executed. This is the most efficient transpose operation which

7 does not require any other additional optimization technique because there is a di- rect copy from input tensor(coalesced read) to output tensor(coalesced write). Figure

2.1 gives a flowchart representation of kernel selection criteria that takes place inside

TTLG library.

nDim, Perm, Index Fusion sizes

FVI Match Yes Find out best kernel and its No ? parameter based on efficiency calculation No Input FVI >32 fvi1*fvi2>32 For both IN & Yes OUT tensor Execute the kernel with No Yes optimal settings that were found in previous step FVINoMatchOverlap FVIMatchL32 FVIMatchG32

Figure 2.1: TTLG Kernel selection flowchart

2.3 TTLG Kernels

This section explains different kernels implemented inside TTLG. Understand- ing each of them plays an important role in feature selection for each kernel and is necessary for building the desired predictive model.

8 2.3.1 FVINoMatchG32: Fastest varying indices do not match and their sizes are greater than 32

This tensor transpose scheme is similar to 2D transpose scheme. Since the fvi of both input and output tensor are greater than the warp size, consecutive elements along input fvi are read into shared-memory, while output fvi is varied along rows in a column. Since all elements along a column are mapped to the same memory bank, reading successive elements along a column can lead to severe bank conflicts, that could affect the performance of the transpose. Hence a padding of size 1 is added and shared-memory size of 32 × 33 is chosen. This results in conflict free access of consecutive elements along a column in shared-memory. Thus both read and write to global memory are coalesced in this scheme. This scheme is depicted in Figure 2.2.

i 0 i 128 1 128 i0 i1 32 32 1 … … 0 1 2 3 31 0 1 31 32 33 64 32 32 128 …992 128 0 32 992 Shared Memory

Input Matrix Output Matrix

Figure 2.2: FVINoMatchG32 Scheme

9 2.3.2 FVIMatchL32: Fastest varying indices match and size less than 32

This kernel is executed when the fvi of both input and output tensor are the same but <32 in size. It is not chosen if size(0)×size(1) < 32. A blocking factor b for index

next to fvi is chosen based on predicted performance of the kernel. Elements can be

thought of being read from 3D block of size b × b × N0, where No is the fvi of input

and output tensors. This 3D block can be represented in 2D fashion as (b × No) × b

as shown in Figure 2.3. Now, a single warp can perform coalesced read of b rows each

of size No, in block cyclic fashion depending on whether the value of b × No is >32 or

not, into shared-memory. Similarly other warps can bring successive segments along

the (b × No) dimension. Padding is applied in such a way that the dimension with

size (b × No) becomes 33 after padding. This type of padding was empirically found

to give better performance as compared padding of size 1. Next, a warp writes out

N consecutive pencils(column major access) to global memory in coalesced fashion.

Thus it ensures coalesced memory access during both read and write operations.

2.3.3 FVIMatchG32: Fastest varying indices match and their sizes are greater than 32

This kernel is executed only when the fvi’s of input and output tensors are >32.

In such a case, a warp can directly copy elements from input to output tensor. Here

both read and write operations involve accesses from global memory in coalesced

manner. This kernel performs the fastest tensor transpose in the library.

10 N 0 Pad

b

N0 * b

Figure 2.3: FVIMatchL32 Scheme

2.3.4 FVINoMatchGeneral: Fastest varying indices do not match and there is no overlap between indices of input and output slice

This scheme is a generalization of the FVINoMatchG32 scheme without the con- straint of both input and output fvi’s to be >32, but it requires that the combination of indices being brought from input and output tensor into shared-memory do not overlap(i.e. the combined indices from input and output tensor shouldn’t have com- mon components). Since the fvi’s of both the input and output tensor are smaller than warp size we combine indices next to fvi in input and output tensor to improve thread occupancy within warps. Next, the same 2D matrix transpose scheme(as described in Section 2.3.1) is utilized to perform the transpose operation. Here in- direction arrays are used to read from and write to global memory in a coalesced

11 fashion. These operations are not always fully coalesced(partial coalescing when the combination of indices being brought to shared-memory is <32). Refer to Figure 2.4 for pictorial representation of this scheme.

9 i0 x i1 i 32 31 2 64 i 2 0 1 2 3 … 3031 32 33 … 61 62 i0 63 i1 32 126 … 32 1953 32 1890 1890 63 1953 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 4 18 19 20 21 22 23 24 25 26 32 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 3 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 Shared Memory i0 32 32 64 c 8 i2 i 1 9 Input Tensor 0 63 … 1953 1 4 2 3 3

8

Output Tensor

Figure 2.4: FVINoMatchGeneral Scheme

12 2.3.5 FVINoMatchOverlap: General scheme that can handle both matching and non-matching fvi of input and out- put tensors. Indices mapped to slice from input and output tensor can be overlapping

This kernel removes the restriction of FVINoMatchGeneral which required the combination of indices of input and output tensor to be non-overlapping. Global memory reads and writes are not fully coalesced when size of non-overlapping com- bined indices is <32, which can degrade performance of tensor transposition. In such cases we may have to combine overlapping indices to achive better efficiency. Since the overlapped indices cannot be mapped to both rows and columns of shared-memory, so they can be mapped along with the input indices to columns in shared-memory and the non-overlapping indices of the output tensor can be mapped along the rows.

Now in order to achieve coalesced memory access for write operation, a separate in- direction array is is used to read elements from shared-memory, and are written out to global memory in coalesced fashion. In this way more complex data access data patterns can be achieved because of the flexibility of accessing elements in a staggered fashion from shared-memory. This scheme is depicted in Figure 2.5.

2.4 Regression Models

There are different types of regression analyses depending on whether one wants to capture linear or non-linear relationship between independent and dependent vari- ables. Here two types of regressions namely, Linear and Random Forest are explained briefly.

13 2.4.1 Linear Regression

Linear regression[5] attempts to capture the relationship between the dependent and independent variables using linear mathematical functions as depicted in Figure

2.6. Simple example for the same would be a straight line of the form y = mx + c with one dependent variable y, one independent variable x and an intercept c. This model uses least squares error to fit the model onto data. Some other variants also exist that also perform regularization to impose penalty on increasing complexity of the model and improve its generalization. Lasso[10] regression uses L1 penalty,

Ridge[2] Regression uses L2 norm penalty and Elastic[11] Regression uses (L1 + L2) norm penalty.

2.4.2 Random Forest Regression

Random Forest[6] Regression is basically an ensemble of multiple decision trees[9].

A decision tree is a tree whose internal nodes represent conditions, edges represent evaluation of those conditions and leaf nodes represent final predictions. Figure 2.7 represents a Decision Tree that depicts different decisions that one might consider before buying a car. Decision Trees are normally used for classification problems but since regression requires prediction of continuous value a decision tree simply gives median or mean of values that minimizes the overall sum of squares error. In random forest there are many such decision trees and final value is simply the weighted average of values predicted by all the trees. It uses techniques like bagging that leverages model’s instability to make the model more robust and can easily identify important features as it gives several chances to each feature to be a part of the final model by

14 randomly selecting a subset of features during each iteration when attempting to fit the data, as shown in Figure 2.8

15 i2 5 i0 12

3 20 2122 23 24 i3 0 1 2 3 4 i1 4 5 6 7 8 9 10 11 12 13 14 1516 1718 19 … … 0 … 7 8 15 127 16 112 Input Tensor

i0 x i1 x i2 32 28 i3 0 5 10 15 20 25 3031 35 39 40 59 0 0 1 2 3 4 5 6 7 8 91011 …141516 … 192021 … 242526 … 3031 60 65 70 75 … … … 8 … … 360 365 370 375 7 420 425 430 435 8

8

15

Shared Memory

4 i0 5 i1 0 5 10 15 3 60 65 70 75 i3 120 … 135 … i … 2 5 420 … 435

8

112

0 … 1 2 15

Output Tensor

Figure 2.5: FVINoMatchOverlap Scheme

16

t outpu

input Source: https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/linear_regression.pdf

Figure 2.6: Linear Regression

Source: https://towardsdatascience.com/decision-trees-and-random-forests-for-classification-and-regression-pt-1-dbb65a458df

Figure 2.7: Decision Tree

17 Image Source: https://towardsdatascience.com/decision-trees-and-random-forests-for-classification-and-regression-pt-2-2b1fcd03e342

Figure 2.8: Random Forest

18 Chapter 3: Challenges

The task of creating linear regression model for each of the five kernels comes with its own set of challenges. The first one being feature engineering. It requires selec- tion or generation of relevant features that have higher influence on the output, these features can be tensor size, slice choices, warp efficiency and thread occupancy etc.

Since only no. of dimensions, size of each dimension of tensor and the output permu- tation are given as input, these need to be converted to into numerical attributes that have correlation to tensor transpose performance. This requires in-depth knowledge of each kernel of TTLG and on what factors their performance is dependent. For example, FVIMatchG32 is independent of shared-memory, it only depends on thread occupancy and size of tensor. On the other hand, a scheme like FVINoMatchOverlap relies heavily on three different types of indirection arrays, stored in texture mem- ory of GPU, that are used to provide staggered access to shared memory, which in turn helps to provide full or partial coalesced access to global memory. In such cases size fvi’s of input and output tensors also play an important role in determining the overall performance of tensor transpose.

Prediction time of the regression model is again of prime importance. A complex model that yields better results but has longer prediction time degrades the overall performance of the library. Hence, we need to strike a balance between error in

19 bandwidth prediction and model’s prediction time. Linear regression with proper choice of input attributes provides fast prediction with high accuracy.

Another challenge is tuning model parameters for optimal performance on GPU’s with different architectures. Performance of transpose also depends on hardware lim- itations like shared-memory limit per SM, maximum no. of active threads per thread block, maximum shared memory per thread block per SM and total no. of SM’s in a GPU. These values vary with architecture and compute capability of GPU. Hence, the linear regression models differ for different architectures(different coefficients for same attribute on different GPU architectures) and creates the necessity for automat- ing the tuning process of the model so it could adjust its coefficients based on the underlying architecture.

20 Chapter 4: Linear Regression Model

This section describes in detail the process of building the linear regression model.

The sub-sections elaborate upon the following points:

1. Features used for building the model and their relevance.

2. The process of input data collection for building the predictive model

3. Use of the trained model inside TTLG to select correct kernel and optimal size

for input and output slice.

4.1 Efficiency Calculation

The slice that is being transposed can be thought of as consisting of 4 sub-slices, please refer to Figure 4.3 to better understand the sub-slices as described below:

• Type I: these are the prefect sub-slices whose dimensions are perfect multiple

of 32 and hence efficiency is 100% due to no idle threads in a warp at any time

during the execution of such slices.

• Type II: these type of sub-slices have idle threads in a warp when accessing them

in row major fashion but have zero or very less idle threads when accessing them

in column major fashion

21 • Type III: these are just the reverse of Type II sub-slices. Here row major access

ensures higher efficiency as compared to column major.

• Type IV: since both the rows and columns of this type of sub-slice is small in

number, both read and write accesses are inefficient. There exists either a single

or no Type IV sub-slice in a chosen slice size.

The efficiency of each sub-slice, in general, can be calculated a follows:

active threads eff = sub−slice total threads

Efficiency of each sub-slice is different because each has different no. of idle threads in a warp depending on whether they are reading from or writing to global memory.

This is depicted in Figures 4.1 and 4.2. The efficiency of transpose for each sub-slice can be computed as follows:

f1 = 1 ilimit%32! f = 2 32 olimit%32! f = 3 32 olimit%32 × olimit%32! f = 4 32 × 32

Total efficiency of transposing a slice is calculated using the weighted average of efficiencies of all four types of sub-slices. If n1, n2, n3, n4 refer to no. of slices for type

I, II, III and IV respectively. Then total efficiency of a transpose of a complete slice is given by: ! n1 × f1 + n2 × f2 + n3 × f3 + n4 × f4 Fi = n1 + n2 + n3 + n4

22 ilimit%32 32 ilimit

32

olimit Idle Threads

olimit%32

Idle Threads

Figure 4.1: Read slice from Global Memory

Similarly, the net efficiency of the tensor transpose operation is given by taking a weighted mean of all four types of slices(refer to Figure 4.4), :

N × F + N × F + N × F + N × F  F = 1 1 2 2 3 3 4 4 N1 + N2 + N3 + N4

Here, N1,N2,N3,N4 represent the total no. of type I, II, III and IV slices respectively.

4.2 Feature Engineering

4.2.1 Features

• Volume: This refers to the total size of the tensor which is basically the prod-

uct of sizes of all dimensions. This directly relates to the total work to be done.

More work would ensure all the SMs of the GPU to be active most of the time.

23 ilimit%32 32 ilimit

32

olimit

olimit%32

Idle threads Idle Threads

Figure 4.2: Write slice to Global Memory

• Numthreads: number of threads in a thread block. This is used in warp

efficiency calculations since there are sub slices within a slice that are being

transposed. When all the threads in a thread block are not active it leads to

reduced thread occupancy.

• FVI input(fastest varying index of input): refers to the size of fvi on input

tensor. If the fvi of input tenor is large enough then global memory reads can

be coalesced.

• FVI output(fastest varying index of output): refers to the size of fvi on

output tensor. If the fvi of output tenor is large enough then global memory

24 32 ilimit%32 ilimit

Type I 32 Type II Type III Type IV

olimit

olimit%32

Figure 4.3: Type of sub-slices in a Slice

writes can be coalesced.

• NumBlocks: total number of thread blocks that will be launched and exe-

cuted in a GPU to perform the transpose operation. All of them need not be

executing simultaneously. This effects load balancing of total work in a GPU.

• ilimit: extent of a slice along the columns in a row. Larger ilimit more type I

and type II subslices compared to type III.

• olimit: extent of a slice along rows in a column. Larger olimit means more

type I and type III subslices compared to type II.

25 asize%ilimt Type I ilimit asize Type II Type III olimit Type IV

ilimit : length along indices mapped from input dimensions that is processed by a thread block olimit : length along indices mapped from bsize output dimensions that is processed by a thread block asize : Total size of input dimensions mapped to 2D superslice bsize : Total size of output dimensions mapped to 2D superslice bsize%olimit

Figure 4.4: Types of slices in Tensor

• blockA or blockB: blocking factor for indices next to fvi in input/output

tensor. This is used to increase the total size of input indices being mapped

to a slice such that size of a single dimension of a slice is greater than 32 or a

mutiple of it. This is simply referred to as b in FVIMatchL32 scheme.

• F: is the net efficiency with which a slice is transposed. Efficiency calculation

is explained in Section 4.1.

4.3 Derived Features

The above features capture different types of information about the transpose scheme to be used from the given input sizes and desired output permutation. But

26 more complex schemes like FVIMatchL32, FVINoMatchGeneral and FVINoMatchOver- lap can benefit further from certain attributes that are derived from them. These derived attributes can be linear or non-linear functions of features described in the previous section.

Following is a list of synthesized attributes that are found to improve the performance along with types of transpose kernels they are useful in:

1. log(totalsize): this non-linear transformation of total size of tensor has been

found to reduce significant amount of error in prediction of all models as com-

pared to when not using it.

2. ilimit%32: this attribute indicates the number of threads in a warp that remain

active towards the end of the combined dimension comprising of fvi and next

fvi’s of input tensor.

3. olimit%32: this attribute indicates the number of threads in a warp that remain

active towards the end of the combined dimension comprising of fvi and next

fvi’s of output tensor.

4. (b × No)%128: useful only in FVIMatchL32 kernel. It uses fixed thread block

size of 128 threads.

5. fvi%256: useful only for FVIMatchG32 kernel. It uses a fixed thread block size

of 256 threads.

The value of number of threads per block have been computed based on hardware specifications mentioned in CUDA programming guide[7] in order to maximize thread occupancy.

27 4.4 Data Collection

This section describes the process of collecting data for training and testing the predictive model.

In order to collect enough data for regression modelling an input set of 26,100 different cases were used as input to the TTLG library. For collecting data for FVIMatchL32 and FVIMatchG32, only cases that could be handled by them were given as input to the library and their output bandwidth along with other input attributes were recorded.

For collecting data for FVINoMatchG32, cases where fvi’s of input and output don’t match and are greater than 32 were supplied and TTLG was forced to use only this kernel. Output bandwidth along with relevant input attributes were recorded.

In order to collect data to build predictive model for FVINoMatchGeneral input cases where fvi’s don’t match were considered as input and each input tensor and output permutation was evaluated for 64 different possible combination of indices to be mapped to a slice for transpose. TTLG library was forced to use only this kernel to perform the transpose operation. So the data generated was 64 times the input size.

Since some of the slice choices were not suitable to execute on GPU due to memory constraints the kernel call resulted in failure. Such cases were removed before passing them as input to the model.

To collect data to build predictive model for FVINoMatchOverlap all input cases except for the ones used for FVIMatchG32 were considered as input and each input tensor and output permutation was evaluated for 64 different possible combination of indices to be mapped to a slice for transpose. TTLG library was forced to use only this kernel to perform the transpose operation. Again, the data generated was 64

28 times the input size and some of the slice choices resulted in kernel call failure. Such

cases were removed before passing them as input to the model.

4.5 Utilizing regression model to improve performance of tensor transposition

The use of a predictive model is to give user a rough estimate about the speed at

which TTLG will perform transpose of a given tensor and desired output permuta-

tion. For Kernels FVIMatchL32 and FVIMatchG32 the choice is simply based on a

few conditions and these are the best performing kernels for such cases. But when the

fvi’s of input and output tensor don’t match only then there is more than one possi-

ble kernel that can execute the given permutation and we need the best performance

of each kernel and then select the one which performs better of the two/three. For

kernel FVINoMatchG32 slice size is fixed but from FVINoMatchGeneral and FVI-

NoMatchOverlap there are 64 different possible choices for slice sizes and each slice

choice gives different performance. Hence the model predicts the performance of all

the 64 choices and reports the best out of 64 and hence eliminates evaluation of tensor

transpose for each slice choice, saving lot of time. Kindly refer to Figure 4.5 to get

a better understanding of the model, here b1, b2, b3 and b4 represent various boolean conditions which when evaluated to true, TTLG selects that kernel(if there is a clear choice) or kernels(if there is more than one possible kernel that can perform transpose and then select the one with the best performance). But being a predictive model its predictions are not 100% accurate, as we shall see in results section there are cases where it selects a suboptimal kernel where the difference in performance of the ideal kernel and the one selected is significant.

29 Internal Models for each kernel External Model FVIMatchG32

b1 FVIMatchL32 Predicted b Bandwidth(GB/s) 2 FVINoMatchG32 nDim, Perm, Sizes (Output) (Input) b3 FVINoMatchGeneral

FVINoMatchOverlap Bandwidth predicted Select Kernel with Maximum Maximum with Select Kernel b 4 FVINoMatchOverlap

Figure 4.5: Regression Model for TTLG

30 Chapter 5: Experiments and Results

The experimental setup used hardware described in the Table 5.1 to create a diverse dataset, we consider several transpose test cases that cover different ranks, volumes, extents (sizes of dimensions) and orderings among the extents. Tensor ranks range from 3 to 6 and include all possible permutations. Volumes of the tensors range from 16MB to 2GB. Different orderings among extents include (example shown in brackets for a 3D tensor with indices i0, i2, i3 ): (1) all same (i0 = i1 = i2) , (2) monotonically increasing (i0 < i1 < i2), (3) monotonically decreasing (i0 > i1 > i2), (4) increasing till the center dimension and then decreasing (i0 < i1 > i2), (5) decreasing till the center dimension and then increasing (i0 > i1 < i2). We randomly select three-fourths of all the test cases to form training data and the remaining as test data.

Internal model for each kernel was trained using the training data and its per- formance during training was evaluated on test data. To ensure that there is no over fitting on the training data, the generated model was evaluated on completely separate evaluation data. Evaluation data consisted of rank 6 tensors with sizes all

15(tensor of size 15 15 15 15 15 15), all 16(tensor of size 16 16 16 16 16 16) and all 17(tensor of size 17 17 17 17 17 17) for 6! = 720 output permutations and 64 slice choices(for kernels whose performance rely on slice sizes). The results obtained

31 are summarized in Table 5.2. These results show that for linear regression the error decreases for the evaluation dataset. This suggests that the model is generic and does not overfit the training data. This is also the case for random forest regression except for the kernel FVINoMatchGeneral where the model seems to overfit the training data as test data error is high. Figure 5.2 is a pictorial representation of Table 5.2 and the case of overfitting can be easily identified from it. The maximum depth of decision trees inside random forest for each of the kernels was initialized with 2 and was increased till the error on evaluation dataset became very high compared to that on training data. The performance of each model during training phase is given un- der test-train split column and during the evaluation phase is given under evaluation dataset column.

Next, the trained models for each of the five kernels of TTLG were coded into the library and performance of the library was evaluated on rank 6 tensors of sizes all 15,

16 and 17 sizes. Figures 5.3, 5.6 and 5.9 give the comparison of TTLG library using regression model, best achievable performance of ttlg and cuTT on tensors of sizes all 15, 16 and 17 respectively. These figures show that the performance decreases as we move from 15 to 16 and is the worst for 17.

To further investigate the cases where TTLG using regression model didn’t per- form well, predicted kernels and ideal kernels for each test cases are plotted on line graphs(Figures 5.4, 5.7 and 5.10). These along with the performance graph show that the cases where there is a significant difference in terms of bandwidths(GB/s)(achieved during transpose) between the maximum performance achievable by TTLG and the one using predictive model is mostly due to incorrect kernel prediction which is the worst for tensors of size all 17.

32 Type Specification CPU Intel R Xeon R CPU E5-2680 v4 @ 2.40GHz GPU Nvidia Tesla P100-PCIE, 16GB global memory ECC off Software Red Hat Enterprise Linux Server 7.3, CUDA 8.0.44, gcc 4.8.5, Nvidia Driver 387.26

Table 5.1: Hardware Configuration

Figures 5.5, 5.8 and 5.11 show frequency of percentage error(difference between model performance and best performance) that fall in buckets of range 5%. They indicate that most of the cases are within range from 0 to 25 percent. But there still exist finite no. of cases where the error ranges start from 25% and go up to 60%. The errors lying within the range of 0-5% are due to small in performance due to repeated runs. The ones lying between 5 to 25% are due do incorrect selection of kernel parameters and remaining are due to execution of incorrect kernel.

To improve the coarsening in kernel FVIMatchG32, the maximum no. of active thread blocks are calculated based on hardware specifications. Then if including another dimension in the slice still keeps total no. of thread blocks to be launched greater than maximum mo. of active thread blocks then the next dimension is also included in the slice. Figure 5.1 compares the performance of previous implementation of kernel FVIMatchG32 with its new implementation.

33 600

500

400

300

Bandwidth(GB/s) 200

100

0

0 1 2 3 5 4 5 3 2 1 0 5 3 4 2 1 0 3 5 4 2 1 0 4 3 5 2 1 0 3 4 5 2 1 0 5 4 2 3 1 0 4 5 2 3 1 0 5 2 4 3 1 0 2 5 4 3 1 0 4 2 5 3 1 0 2 4 5 3 1 0 5 3 2 4 1 0 3 5 2 4 1 0 5 2 3 4 1 0 2 5 3 4 1 0 3 2 5 4 1 0 2 3 5 4 1 0 4 3 2 5 1 0 3 4 2 5 1 0 4 2 3 5 1 0 2 4 3 5 1 0 3 2 4 5 1 0 2 3 4 5 1 0 ttlg_best_prev ttlg_best

Figure 5.1: Performance Comparison of previous versus new implementation of FVI- MatchG32

MAEP (Linear) MAEP (Random Forest) Kernel Test Train Evaluation Test Train Evaluation split(25:75) Dataset split(25:75) Dataset FVIMatchL32 11.279 9.11 14.269 9.875 FVIMatchG32 14.295 12.24 10.153 9.184 FVINoMatchG32 24.234 20.897 17.395 16.563 FVINoMatchGeneral 24.367 19.015 10.963 18.627 FVINoMatchOverlap 19.537 14.374 16.997 11.734

Table 5.2: Mean of Absolute Error Percentage for Linear and Random Forest Regres- sion

34 30

25

20

15 MAEP

10

5

0 FVIMatchL32 FVIMatchG32 FVINoMatchG32 FVINoMatchGeneral FVINoMatchOverlap MAEP(Linear Regression) Test train split(25:75) MAEP(Linear Regression) Evaluation dataset MAEP(Random Forest) Test train split(25:75) MAEP(Random Forest) Evaluation dataset

Figure 5.2: MAEP during training phase and prediction phase

All 15 600

500

400

300

Bandwidth (GB/s) Bandwidth 200

100

0

2 3 0 1 4 5 4 1 0 3 2 4 1 0 5 3 2 0 1 2 3 4 5 4 3 2 1 0 4 3 2 5 1 0 5 3 1 4 2 0 5 4 1 2 3 0 5 3 2 1 4 0 3 2 1 5 4 0 4 2 1 3 5 0 5 4 2 3 0 1 5 4 3 0 2 1 4 3 0 5 2 1 5 2 0 4 3 1 5 3 0 2 4 1 4 3 2 0 5 1 3 2 0 4 5 1 5 3 1 4 0 2 5 4 0 3 1 2 5 1 0 3 4 2 4 3 0 1 5 2 5 4 2 1 0 3 4 2 1 5 0 3 5 2 0 4 1 3 5 4 0 1 2 3 5 2 1 0 4 3 2 1 0 5 4 3 4 1 0 2 5 3 5 3 1 2 0 4 5 3 2 0 1 4 3 2 0 5 1 4 5 1 0 3 2 4 5 2 0 1 3 4 3 2 1 0 5 4 2 1 0 3 5 4 4 2 1 3 0 5 4 3 0 2 1 5 4 3 1 0 2 5 3 1 0 4 2 5 4 1 0 2 3 5 3 2 0 1 4 5 Repeated Use cutt_measure Repeated Use ttlg_best Repeated Use lr_ttlg

Figure 5.3: TTLG Performance on All 15

35 9

8

7

6

5 Kernel Kernel Name 4 No. Kernel No. Kernel 0 FVIMatchL32 3 1 FVIMatchG32 2 6 FVINoMatchG32 7 FVINoMatchGeneral 1 8 FVINoMatchOverlap

0

1 2 5 0 4 3 4 0 5 2 1 2 5 0 1 3 4 0 1 2 3 5 4 5 3 2 1 0 3 4 2 5 1 0 3 5 1 4 2 0 4 5 1 2 3 0 3 5 2 1 4 0 2 3 1 5 4 0 2 4 1 3 5 0 4 5 2 3 0 1 4 5 3 0 2 1 2 5 0 4 3 1 3 5 0 2 4 1 3 4 2 0 5 1 2 3 0 4 5 1 3 5 1 4 0 2 4 5 0 3 1 2 4 5 1 0 3 2 1 4 0 5 3 2 1 5 0 3 4 2 3 4 0 1 5 2 4 5 2 1 0 3 2 4 1 5 0 3 2 5 0 4 1 3 4 5 0 1 2 3 2 5 1 0 4 3 1 2 0 5 4 3 1 4 0 2 5 3 3 5 1 2 0 4 3 5 2 0 1 4 2 3 0 5 1 4 1 5 0 3 2 4 2 3 1 0 5 4 1 2 0 3 5 4 2 4 1 3 0 5 3 4 0 2 1 5 3 4 1 0 2 5 1 3 0 4 2 5 1 4 0 2 3 5 2 3 0 1 4 5 best_kernel lr_kernel

Figure 5.4: Kernel Prediction by Model for All 15

400 350 350

300

250

200 Frequency 150 99 100 61 54 48 50 34 24 14 21 5 7 1 0 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 Error Ranges

Figure 5.5: Error Frequencies for All 15

36 All 16 600

500

400

300

Bandwidth(GB/s) 200

100

0

0 1 1 0 3 2 5 4 1 0 5 4 2 3 2 0 5 3 4 1 3 0 4 1 2 5 3 0 2 5 4 1 4 0 1 3 2 5 5 0 1 2 4 3 5 0 3 4 1 2 0 1 5 4 3 2 2 1 4 3 0 5 3 1 4 0 5 2 3 1 0 5 2 4 4 1 0 3 5 2 5 1 4 0 2 3 5 1 3 4 2 0 0 2 3 4 1 5 1 2 4 3 5 0 3 2 1 0 4 5 3 2 0 5 4 1 4 2 5 1 0 3 5 2 4 0 3 1 5 2 1 4 0 3 0 3 2 4 5 1 1 3 0 2 4 5 2 3 1 0 5 4 2 3 5 4 0 1 4 3 5 1 2 0 5 3 2 0 1 4 5 3 1 4 2 0 0 4 1 3 2 5 1 4 0 2 5 3 1 4 3 5 0 2 2 4 5 3 1 0 3 4 2 1 0 5 5 4 2 0 3 1 5 4 0 3 1 2 0 5 1 3 4 2 1 5 4 0 2 3 1 5 3 4 2 0 2 5 1 3 0 4 3 5 2 1 4 0 4 5 1 0 2 3 4 5 0 3 2 1

Repeated Use cutt_measure Repeated Use ttlg_best Repeated Use lr_ttlg

Figure 5.6: Performance on All 16

9

8

7

6

5 Kernel Kernel Name No. 4

Kernel No. Kernel 0 FVIMatchL32 3 1 FVIMatchG32 2 6 FVINoMatchG32 7 FVINoMatchGeneral 1 8 FVINoMatchOverlap

0

0 1 2 3 5 4 5 3 2 1 0 3 4 2 5 1 0 3 5 1 4 2 0 4 5 1 2 3 0 3 5 2 1 4 0 2 3 1 5 4 0 2 4 1 3 5 0 4 5 2 3 0 1 4 5 3 0 2 1 3 4 0 5 2 1 2 5 0 4 3 1 3 5 0 2 4 1 3 4 2 0 5 1 2 3 0 4 5 1 3 5 1 4 0 2 4 5 0 3 1 2 4 5 1 0 3 2 1 4 0 5 3 2 1 5 0 3 4 2 3 4 0 1 5 2 4 5 2 1 0 3 2 4 1 5 0 3 2 5 0 4 1 3 4 5 0 1 2 3 2 5 1 0 4 3 1 2 0 5 4 3 1 4 0 2 5 3 3 5 1 2 0 4 3 5 2 0 1 4 2 3 0 5 1 4 1 5 0 3 2 4 2 5 0 1 3 4 2 3 1 0 5 4 1 2 0 3 5 4 2 4 1 3 0 5 3 4 0 2 1 5 3 4 1 0 2 5 1 3 0 4 2 5 1 4 0 2 3 5 2 3 0 1 4 5 Repeated Use best_kernel 8 Repeated Use lr_kernel 8`

Figure 5.7: Kernel Prediction by Model for All 16

37 450

400

350

300

250

200 Frequency

150

100

50

0 0-5 5-10 10-15 15-20 20-25 45-50 50-55 Error(%) range

Figure 5.8: Error Frequencies for All 16

All 17 600

500

400

300

Bandwidth (GB/s) Bandwidth 200

100

0

0 1 2 3 4 5 4 3 2 1 0 4 3 2 5 1 0 5 3 1 4 2 0 5 4 1 2 3 0 5 3 2 1 4 0 3 2 1 5 4 0 4 2 1 3 5 0 5 4 2 3 0 1 5 4 3 0 2 1 4 3 0 5 2 1 5 2 0 4 3 1 5 3 0 2 4 1 4 3 2 0 5 1 3 2 0 4 5 1 5 3 1 4 0 2 5 4 0 3 1 2 5 4 1 0 3 2 4 1 0 5 3 2 5 1 0 3 4 2 4 3 0 1 5 2 5 4 2 1 0 3 4 2 1 5 0 3 5 2 0 4 1 3 5 4 0 1 2 3 5 2 1 0 4 3 2 1 0 5 4 3 4 1 0 2 5 3 5 3 1 2 0 4 5 3 2 0 1 4 3 2 0 5 1 4 5 1 0 3 2 4 5 2 0 1 3 4 3 2 1 0 5 4 2 1 0 3 5 4 4 2 1 3 0 5 4 3 0 2 1 5 4 3 1 0 2 5 3 1 0 4 2 5 4 1 0 2 3 5 3 2 0 1 4 5 Repeated Use cutt_measure Repeated Use ttlg_best Repeated Use lr_ttlg

Figure 5.9: Performance on All 17

38 9

8

7

6

5 Kernel Kernel Name No. 4

Kernel No. Kernel 0 FVIMatchL32 3 1 FVIMatchG32 2 6 FVINoMatchG32 7 FVINoMatchGeneral 1 8 FVINoMatchOverlap

0

0 4 5 1 3 2 3 1 5 4 0 3 4 2 0 5 1 1 4 0 2 3 5 0 1 2 3 5 4 5 3 2 1 0 3 4 2 5 1 0 3 5 1 4 2 0 4 5 1 2 3 0 3 5 2 1 4 0 2 4 1 3 5 0 4 5 2 3 0 1 4 5 3 0 2 1 3 4 0 5 2 1 2 5 0 4 3 1 3 5 0 2 4 1 2 3 0 4 5 1 3 5 1 4 0 2 4 5 0 3 1 2 4 5 1 0 3 2 1 4 0 5 3 2 1 5 0 3 4 2 3 4 0 1 5 2 4 5 2 1 0 3 2 4 1 5 0 3 2 5 0 4 1 3 4 5 0 1 2 3 2 5 1 0 4 3 1 2 0 5 4 3 1 4 0 2 5 3 3 5 1 2 0 4 3 5 2 0 1 4 2 3 0 5 1 4 1 5 0 3 2 4 2 5 0 1 3 4 2 3 1 0 5 4 1 2 0 3 5 4 2 4 1 3 0 5 3 4 0 2 1 5 3 4 1 0 2 5 1 3 0 4 2 5 2 3 0 1 4 5 Repeated Use best_kernel 1 Repeated Use lr_kernel 1

Figure 5.10: Kernel Prediction by Model for All 17

300

250

200

150 Frequency

100

50

0 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 Error(%) Range

Figure 5.11: Error Frequencies for All 17

39 Chapter 6: Conclusion and Future Work

Results show that TTLG performance with regression model is best for all 15 size tensors as compared to all 16 and 17 size tensors. In most of the cases the model is correctly predicting the optimal kernel and its parameters but, there are finite no. of cases where it is not selecting the right kernel for the transpose operation. This can be observed from Figures 5.3, 5.6 and 5.9 where there are deep troughs in the blue line in all three performance graphs. Though the performance using the model is still better than cuTT in most of the cases but the objective is to bring performance of the library using the model as close as possible to the absolute best for all test cases. This requires further analysis of nvprof metrics. Comparison between ideal vs predicted kernel parameters should be done to come up with new features that can capture the changes in performance more accurately.

Another alternative approach could be to use two separate classifier models, one for kernel selection and other for parameter selection. This way the regression problem reduces to multi-class classification problem. The previous regression model can be used for giving bandwidth estimates that can be utilized for optimizing other complex operations that use transpose as an intermediate step. Random Forest regression can be explored further as they have shown promising results in terms of accuracy and

40 they are faster than most of the other regression methods available. Using them for classification would be a better choice.

Similar model can be built for tensor contraction based on TTGT(transpose- transpose-GEMM-transpose) approach. We already have a model for the transpose step, we need to build another performance prediction model for GEMM(General matrix ) operation.

41 Bibliography

[1] So Hirata. Tensor contraction engine: Abstraction and automated parallel im- plementation of configuration-interaction, coupled-cluster, and many-body per- turbation theories. The Journal of Physical Chemistry A, 107(46):9887–9897, 2003.

[2] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

[3] Antti-Pekka Hynninen and Dmitry I Lyakh. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. arXiv preprint arXiv:1705.01598, 2017.

[4] Jyothi Vedurada, Arjun Ramesh, Aravind Sukumaranrajam, Jinsung Kim, Changwan Hong, Sriram Krishna Moorthy, Ajay Panyala, Rohit Kumar Sri- vastava and P.Sadayappan. TTLG: An Efficient Tensor tranpose Library for GPUs. IPDPS, May 2018.

[5] Michael H Kutner, Chris Nachtsheim, and John Neter. Applied linear regression models. McGraw-Hill/Irwin, 2004.

[6] Andy Liaw, Matthew Wiener, et al. Classification and regression by randomfor- est. R news, 2(3):18–22, 2002.

[7] CUDA Nvidia. Programming guide, 2010.

[8] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. TTC: A tensor trans- position compiler for multiple architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for , pages 41–46. ACM, 2016.

[9] PN Tan and M Steinbach. e kumar, v.(2005) introduction to data mining.

[10] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

42 [11] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

43