<<

High-Performance Sparse -Multi Vector on Multi-Core Architecture

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Kunal Singh,

Graduate Program in Science and

The Ohio State University

2018

Master’s Examination Committee:

Dr. P. Sadayappan, Advisor Dr. Atanas Rountev © Copyright by

Kunal Singh

2018 Abstract

SpMM is a widely used primitive in many domains like Fluid Dynamics, Data

Analytics, Economic Modelling and . In Machine Learning and Ar- tificial Neural Network domain SpMM is used iteratively and is the main bottleneck in many kernels. Due to its prime importance, many Machine Learning frameworks like Tensorflow, PyTorch, etc offer SpMM as a primitive. When compared to SpMV,

SpMM has a higher theoretical operational intensity. However, the fraction of roofline performance achieved by SpMM is lower than SpMV suggesting possible improve- ments. In this paper, we systematically explore different design choices for SpMM primitive and develop a high-performance SpMM targeted at Multi-core and Many-core architectures. In , we also developed an analytical model to guide the tile size selection. As shown in our experimental section we achieve up to

3.4x speedup when compared to Intel MKL library.

ii This thesis is dedicated to my parents

iii Acknowledgments

Foremost, I would like to express my sincerest gratitude to my advisor Professor

P. Sadayappan for giving me an opportunity to work with him. This thesis would not have been possible without his immense support and guidance. I am grateful for his invaluable advice, motivation and his ever-positive spirit. His hard-working attitude motivates everyone around him to work harder and push oneself beyond the limit.

I am grateful to Professor Atanas Rountev for his valuable insight and guidance.

Valuable discussions with him helped me immensely in this thesis. He has been an amazing teacher and got me interested in compilers. I cannot thank him enough.

I will always be in debt to Dr. Aravind Sukumaran Rajam for his support and guidance. He has gone above and beyond to help me in every part of my thesis from its inception to the writing.

I would also like to thanks my lab mates Changwan, Vineeth, Israt, Prashant,

Emre, Gordon and Rohit for the intellectual discussions and support. I would also thank my friends Harsh, Piyush, Pragya, Pravar, Anshu, Anand, Shashank and Atul for being such awesome people. They have always supported me through my difficult times and I am indebted to them in every way. I have enjoyed the time I spent with them playing, traveling and studying. They are one of the smartest and the most hardworking people I have ever met and I will always look up to them.

iv My acknowledgment would be incomplete without thanking my colleagues Ian

Cottingham, Konstantin Tereshko, Nitin Sinha, Madhavi Anand and Deepak Jha for their guidance and support. They have helped me grow professionally and provided me invaluable life experience.

Finally, I would like to thank my parent for their immense support and love.

Words could never be enough to express my gratitude towards them. They have always supported me unconditionally and have been my inspiration in life. I would not be the person I am today without their encouragement and confidence in me.

v Vita

February 23, 1992 ...... Born - Ranchi, India

2013 ...... B.E. and Engineer- ing 2013-2016 ...... Senior Systems Engineer, Siemens Healthcare, India 2017-present ...... Graduate Research Associate, Ohio State University University.

Fields of Study

Major : Computer Science and Engineering

vi Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... vi

List of Tables ...... ix

List of Figures ...... x

1. Introduction ...... 1

1.1 Motivation ...... 1 1.2 SpMM formulation ...... 2 1.3 Challanges in SpMM ...... 3 1.4 Contributions ...... 4 1.5 Organization of this thesis ...... 4

2. Background ...... 5

2.1 Related Work ...... 7 2.1.1 Taco ...... 7 2.1.2 Intel Math Library ...... 7 2.1.3 Compressed Sparse Blocks based SpMM ...... 8

3. Analysis of SpMM ...... 9

vii 4. SpMM: Multi-Vector Multiplication ...... 16

4.0.1 Overview ...... 16 4.0.2 Data structure ...... 17 4.0.3 Algorithm ...... 19 4.0.4 Optimizations ...... 22 4.0.5 Model ...... 25 4.0.6 Reordering Matrices ...... 28

5. Experiments ...... 29

5.0.1 Dataset ...... 29 5.0.2 SpMM CPU ...... 31 5.0.3 SpMM comparision with CuSparse on GPU ...... 31 5.0.4 Reordering ...... 37 5.0.5 Pre-processing ...... 38

6. Conclusion and Future Work ...... 39

6.1 Conclusion ...... 39 6.2 Future Work ...... 39

Bibliography ...... 40

viii List of Tables

Table Page

5.1 Data ...... 30

ix List of Figures

Figure Page

1.1 SpMM ...... 2

3.1 Streaming SpMM ...... 10

3.2 Data Movement Analysis ...... 15

4.1 blocked double compressed sparse column representation ...... 18

4.2 Our SpMM Algorithm ...... 20

4.3 Heat Map for mip1 in GFLOPS ...... 26

4.4 Heat Map for Si41Ge41H72 in GFLOPS ...... 27

5.1 SpMM KNL double lower K ...... 32

5.2 SpMM KNL double precision higher K ...... 33

5.3 SpMM Xeon E5 double precision lower K ...... 34

5.4 SpMM Xeon E5 double precision higher K ...... 35

5.5 Our algorithm on Intel KNL vs CuSparse on Nvidia K80 and Nvidia P100 ...... 36

5.6 Performance after reordering matrices ...... 37

5.7 Normalized pre-processing cost ...... 38

x Chapter 1: Introduction

1.1 Motivation

Sparse Matrix Multi-vector multiplication (SpMM) or Sparse Matrix Dense Ma- trix multiplication (SpMDM) is a widely used in many domains like Fluid

Dynamics, Data Analytics, Economic Modelling, Graph BLAS [15, 16] and Machine

Learning. In areas like Machine Learning and Artificial Neural Networks SPMM is used iteratively over and over again, therefore making it an important kernel in Ma- chine Learning frameworks, like Tensorflow [2] and PyTorch[29]. Several recent efforts have sought to exploit sparsity in , using a SpMM formulation[23, 13, 17].

Examples of the use of SpMM from numerical simulation include the Locally

Optimal Block Preconditioned Conjugate (LOBPCG) method for finding eigenvalues of a matrix[22, 4], and iterative solvers with multiple right-hand sides, like the Krylov sub-space iterative solvers which have SpMV is at its core. Although

SpMM can be implemented as a sequential iterative loop over SpMV, better data reuse and higher performance can be achieved by a direct implementation of SpMM.

1 K

k

j Input N Matrix (I)

N j x x x M i x x M i x Output Matrix (O) x x k Sparse K Matrix (S)

Figure 1.1: SpMM

1.2 SpMM formulation

In Sparse Matrix Multi-vector multiplication (SpMM), a sparse matrix is multi- plied with a dense matrix to form a dense output matrix. SpMM can be defined as

O(M,K) = S(M,N) × I(N,K), where matrix S is the input sparse matrix of size

M × N, I is the input dense matrix of size N × K and O is the output dense matrix of size M × K. The majority of elements of the sparse matrix S are zeros, therefore it is generally stored in a compressed formats like Compressed Sparse Row (CSR),

Compressed Sparse Column (CSC) and Compressed Co-ordinate (COO)[31]. Only the non-zero elements of the sparse matrix contribute to the result, as can be seen in

figure 1.1.

2 SpMM can also be performed by executing Sparse Matrix Vector multiplication

(SpMV) repeatedly by using different columns of input dense matrix I as vectors.

This method can never achieve optimal performance due to the fact that SpMV itself has a lower roofline performance than SpMM. The actual performance achieved by the SpMV are even lower [10].

1.3 Challanges in parallel SpMM

There are two major challenges in designing a parallel SpMM algorithm. First is achieving work-load balance for the threads. Unlike dense ,

SpMM has to multiply a sparse matrix with a dense matrix. The major issue here is that the sparse matrices can have very diverse sparsity patterns, some can have power law distribution and some may have non-zeros clustered only in a few rows and columns. This can create an imbalance when assigning rows or columns to threads and makes it difficult for an algorithm to work efficiently on all matrices.

The other major challenge is data reuse. When using CSR or CSC representation for SpMM it is very difficult to get data reuse on both input I and output O dense matrices. In case of CSR we get reuse on output matrix while going row-wise, but the elements of input dense matrix are evicted form the cache before being reused.

Similarly, in CSC we get reuse on input dense matrix while traversing along columns, but we lose the reuse on output dense matrix. Depending upon the algorithm the elements of the sparse or dense matrix has to be read multiple times. If these elements are not cached then a slower main memory access is required which considerably reduces the performance.

3 1.4 Contributions

This thesis makes the following contributions:

• Systematic exploration of streaming choices for SpMM

• New SpMM algorithm

• Model-driven tile size selection for SpMM

• Extensive evaluation over a wide range of sparse matrices

1.5 Organization of this thesis

The rest of this thesis is organized in the following manner: Chapter 2 presents the background of SpMM along with the architecture of Intel Xeon Phi. Chap- ter 3 presents the analysis of streaming choices for SpMM and formulates the data movement. Chapter 4 presents our SpMM algorithm and its implementation details.

Chapter 5 presents the experimental evaluation and comparisons. Chapter 6 provides the conclusion and future works.

4 Chapter 2: Background

SpMM can be defined as O(M,K) = S(M,N) × I(N,K), where matrix S is

a sparse matrix and I and O are dense matrices. A majority of elements of real-

world matrices are zeros and these elements do not contribute to the matrix .

Representing them as a dense array uses up a huge amount of unnecessary memory

and precious CPU cycles are wasted in computing zero valued product. This moti-

vates one to use a different representation for sparse matrices to save storage space

and improve performance. The most commonly used sparse is

Compressed Sparse Row or CSR representation[31]. In CSR we maintain three arrays

rowptr, colidx and values. The index of rowptr array denotes the row and

its value represents the starting offset of elements in colidx and values array which

corresponds to this row. The values array stores the actual non-zero and the colidx array stores the column number of the non-. Since the representation of sparse matrix is different we also need a different specialized algorithm to effectively exploit this structure. Figure 1 show a naive algorithm to multiply a CSR matrix with a dense matrix.

This fairly simple algorithm faces many issues when executed which severely de- grades its performance. One of the most common issue is the elements of input matrix

5 Algorithm 1: Sequential SpMM input : CSR S[M][N], float I[N][K] output: float O[M][K] for i = 0 to M-1 do for j = S.rowptr[i] to S.rowptr[i+1]-1 do for k = 0 to K-1 do O[i][k] += S.value[j] * I[S.colidx[j]][k] end end end

I being evicted from cache and getting no reuse. The whole matrix I is used for cal- culating one row of matrix O which put a lot of pressure on the cache as matrix I has to be read again and again. This plays a major role in diminishing the performance.

To improve performance and to alleviate the above-mentioned issues several op- timizations have been proposed for both CPUs and GPUs[35, 21, 33]. Due to the highly SIMD structure of SpMM, Graphics Processing Units (GPU) are a popular choice and a lot of previous work has been done to optimize SpMM on GPUs[24, 28].

But GPUs have one major drawback, they have a radically different architecture and the x86 compliant programs have to be re-written according to the GPU APIs.

To write an optimized program for GPU one must be familiar with its APIs like

CUDA[26] or OpenCL[34] and have an in-depth knowledge of GPUs architecture like streaming multiprocessors, thread block, shared memory, warp, etc. Moreover, in many application we require serial operations and branched instructions in between matrix operations. In such cases a lot of time can be wasted in copying the data to and from the GPU. The Xeon Phi platform solves these issues by allowing us to run the same x86 code without any modifications. The OpenMP "parallel" pragma used in the traditional x86 CPU is enough to run a parallel code on Xeon Phi[30].

6 Furthermore, enabling vector instructions (using xMIC-AVX512) while compiling will exploit SIMD parallelization on the 512-bit Vector units. The data transfer between host and device is not required as the Xeon Phi can efficiently execute sequential instructions. Moreover, the high bandwidth MCDRAM[30] further reduces the cost of accessing the slower main memory. Even with so many automatic optimizations, there is still a lot of scope left to optimize SpMM on Xeon Phi and very few previous research works [33, 11, 3] have considered this.

2.1 Related Work

Since SpMM is a widely used kernel there has been a lot of previous research done to optimize it. Below mentioned are a few state of the art frameworks.

2.1.1 Taco

Taco[18] is a C++ library which uses compiler techniques to generate kernels for operation. These operations can be for sparse or dense having any possible . The kernels generated are already optimized and use

OpenMP parallel pragma to parallelize. This library and its online code generation tool can be used to generate SpMM kernel, where all the tensors are 2D.

2.1.2 Intel

Intel MKL is one of the most commonly used BLAS and Sparse BLAS libraries for

CPU’s. This library has highly optimized kernels for many sparse BLAS operations like SpMM, SpMV and SpGEMM. MKL supports various matrix representation like

CSR, CSC, COO, etc. MKL library also supports AVX512 instructions and has

7 kernels optimized especially for Xeon Phi architecture which results in significant performance gains.[1]

2.1.3 Compressed Sparse Blocks based SpMM

Compressed Sparse Blocks (CSB) is a sparse matrix storage format which parti- tions and stores the matrices in smaller square blocks. This representation does not require any extra space than the commonly used CSR or CSC representations. Using

CSB format for SpMM kernels shows significant improvement in SpMM as well as for

SpMM [3].

8 Chapter 3: Analysis of SpMM

This chapter highlights the effects of sparsity pattern on SpMM and presents an

algorithm to optimally tile the computation to minimize data movement from main

memory.

SpMM for multiplying an M × N sparse matrix with nnz non-zero elements with

an N × K dense matrix to produce a dense matrix of size M × K has an operational intensity OImax is 2 × K × nnz/(4MK + 4NK + 12nnz).

Algorithm 2: Tiled SpMM input : CSR S[M][N], float I[N][K] output: float O[M][K] for ii = 0 to M-1 step Ti do for jj = 0 to N-1 step Tj do for kk = 0 to K-1 step Tk do for i = Ti to min((ii + 1) * Ti, M)-1 do for j = s_tile[jj].rowptr[i] to s_tile[jj].rowptr[i+1]-1 do for k = kk to min((kk+1)* Tk, K-1) do O[i][k] += S.value[j] * I[S.colidx[j]][k] end end end end end end

9 K K K k Kept in Tk Tk fast memory Tj Streamed Streamed Tj Input along j along k N Matrix (I) N j N Streamed Kept in Kept in along i fast memory fast memory N K N K N K Tj Tk j Tk Tj k

x x Ti x x Ti Ti x x Ti x x x M i x x M i M x x M M x x M x Output x x Matrix (O) x x x x x x Sparse Matrix (S) Case 1 Case 2 Case 3

Figure 3.1: Streaming SpMM

Algorithm 1 shows the sequential pseudo-code for SpMM (O = S ∗ I) for a Com- pressed Sparse Row matrix format, where S is an M × N sparse input matrix, I is a dense input matrix (N × K), and O is the resulting dense output matrix (M × K).

Figure 1.1 depicts the operations involved in SpMM.

Many dense algorithms such as Dense-Dense matrix multiplication (DGeMM) employ a streaming approach to reduce the data movement volume. With streaming, one of the three matrices is held stationary and the loop index that does not explicitly appear in the indexing of that matrix is chosen as the streaming .

The streaming choices for SpMM (and their impact) can be explained with the help of tiled SpMM algorithm shown on Algorithm 2. Each of the three matrices

(S,I,O) can be chosen the stationary matrix, and each matrix has two dimension and one "independent dimension". The streaming choices are

10 • Streaming along M(i): A tile of I of size T j × T k is kept stationary in fast

memory/ cache

• Streaming along N(j): A tile of O of size T i × T k is kept stationary in fast

memory/ cache

• Streaming along K(k): A tile of S of size T i × T j is kept stationary in fast

memory/ cache

In the tiled version, the streaming dimension can be represented by the inner- most tile dimension. A tile of the matrix that is not characterized by the streaming dimension is kept stationary in fast memory. Figure 3.1 depicts the streaming choices. i) Streaming along M(i): In this case each I element is only read once and gets

the full reuse. Hence the total volume of data moved for I is N × K. Each S

element is read in once for every T k tile. Hence the data movement volume for S

is (nnz × K)/T k. A simple approximation for the number of times an element of

O has to be read and written is (2 × N)/T j; in other words, for each tile of size

T j each O element is read and written once. However, depending on the sparsity

level and sparsity structure, there may be many empty rows in a tile of size T j, in which case the corresponding O elements are not brought into to memory. The total volume of O elements, after accounting for empty rows of S, can be expressed as

(2 × nars(T j) × K) where nars(T j) represents the number of active rows, which is a function of T j. In other words, for every active row, we have to read and write K elements of O. Thus the total volume is:

(N + nnz/T k + 2 × nars(T j)) × K

11 ii) Streaming along N(j): This scheme is similar to streaming along M. This scheme keeps O stationary. Hence, the total volume of data transferred for O is

M × K. Each S element is bought into memory once for every T k tile. Hence, the data movement volume for S is (nnz × K)/T k. Similar to O when streaming along

M, the total data transfer volume for I can be expressed as (nacs(T i) × K) where nacs(T j) represents the number of active column-segments, which is a function of T i.

Thus the total volume is:

(M + nnz/T k + nacs(T i)) × K iii) Streaming along K(k): In this scheme the S matrix is kept stationary, as depicted in Algorithm 3. Hence, the total volume of data transferred for S is nnz.

Similar to streaming along M, the total volume data volume transferred for O is

(2 × nars(T j) × K). Similar to when streaming along N, the total data transfer volume for I is (nacs(T i) × K). Thus the total volume is:

nnz + (2 × nars(T j) + nacs(T i)) × K

Algorithm 3: SpMM streaming along K(k) input : CSR S[M][N], float I[N][K] output: float O[M][K] for ii = 0 to M step Ti do for jj = 0 to N step Tj do for k = 0 to K-1 step 1 do for i = Ti to min((ii + 1) * Ti, M) do for j = s_tile[jj].rowptr[i] to s_tile[jj].rowptr[i+1] do O[i][k] += S.value[j] * I[S.colidx[j]][k] end end end end end

12 The analysis for case (i) and case (ii) are similar. Here, we present the analysis for case (ii). In case ii, O is kept in fast memory. The tile size for O is limited by the fast memory capacity. Thus T i×T k < C, where C is capacity of fast memory. The higher

T i is, the lower the data movement cost for I. Similarly, increasing T k lowers the amount of data movement for S. Let nnz_per_col_seg(T i) be the average number of non-zero elements per active column of size T i in S. Then, the total number of elements is:

nnz = nacs(T i) × nnz_per_col_seg(T i) (3.1)

From 3.1:

nnz nacs(T i) = (3.2) nnz_per_col_seg(T i)

Our objective is:

Min((M + nnz/T k + nacs(T i)) × K) (3.3)

Subjected to the constraint

T i ∗ T j ≤ C (3.4)

Since M and K are constants they can be removed from the minimization objec- tive. Thus the minimization objective from 3.3 can be re-written as:

Min((nnz)/T k + nacs(T i)) (3.5)

Equation 3.2 can be substituted in Equation 3.5 to obtain:

13 1 1 Min((nnz) × ( + )) (3.6) T k nnz_per_col_seg(T i)

Since nnz is constant, the minimization objective can be written as

1 1 Min( + ) (3.7) T k nnz_per_col_seg(T i)

The optimal sizes of T i and T k (if treated as real variables to enable analytical closed form solution of the optimization problem) will result in equal contributions to the above expression to be minimized.

Figure 3.2 contains the data movement calculation and operational intensity for two matrices, Bone010 and Cage12 for K = 64 and 128. These two matrices are chosen for this evaluation as they have the most contrasting performance, Bone010 shows very high GFLOPS for SpMM and Cage12 on the other hand shows the lowest.

From the Figure 3.2 it can be observed that the matrix Cage12 has a low operational intensity and low FLOPS/data movement which is one of the factor causing the low performance.

14 BONE010 K TI TK NACS DATA MOVEMENT Operational Intensity FLOPS/DATA MOVEMENT 64 64 64 8730231 693550000 6.72 13.23 64 64 32 8730231 765216000 6.72 11.99 64 64 16 8730231 908549000 6.72 10.10 64 64 8 8730231 1195210000 6.72 7.68 64 128 64 7392162 607914000 6.72 15.09 64 128 32 7392162 679580000 6.72 13.50 64 128 16 7392162 822913000 6.72 11.15 64 128 8 7392162 1109580000 6.72 8.27 128 64 8 8730231 2390430000 9.81 7.68 128 64 16 8730231 1817100000 9.81 10.10 128 64 32 8730231 1530430000 9.81 11.99 128 64 64 8730231 1387100000 9.81 13.23 128 64 128 8730231 1315430000 9.81 13.95 128 128 8 7392162 2219160000 9.81 8.27 128 128 16 7392162 1645830000 9.81 11.15 128 128 32 7392162 1359160000 9.81 13.50 128 128 64 7392162 1215830000 9.81 15.09 128 128 128 7392162 1144160000 9.81 16.03 CAGE12 K TI TK NACS DATA MOVEMENT Operational Intensity FLOPS/DATA MOVEMENT 64 64 64 1403454 100188000 2.86 2.60 64 64 32 1403454 102221000 2.86 2.55 64 64 16 1403454 106286000 2.86 2.45 64 64 8 1403454 114416000 2.86 2.27 64 128 64 1268461 91548600 2.86 2.84 64 128 32 1268461 93581200 2.86 2.78 64 128 16 1268461 97646200 2.86 2.66 64 128 8 1268461 105776000 2.86 2.46 128 64 8 1403454 228832000 3.30 2.27 128 64 16 1403454 212572000 3.30 2.45 128 64 32 1403454 204441000 3.30 2.55 128 64 64 1403454 200376000 3.30 2.60 128 64 128 1403454 198344000 3.30 2.62 128 128 8 1268461 211553000 3.30 2.46 128 128 16 1268461 195292000 3.30 2.66 128 128 32 1268461 187162000 3.30 2.78 128 128 64 1268461 183097000 3.30 2.84 128 128 128 1268461 181065000 3.30 2.87

Figure 3.2: Data Movement Analysis

15 Chapter 4: SpMM: Sparse Matrix Multi-Vector Multiplication

4.0.1 Overview

This section discusses the new algorithm, the data structures used and the opti- mization techniques involved.

The performance of a SpMM algorithm depends on several factors ranging from the processor architecture to the characteristic of the matrices. For Intel Xeon Phi one of the most important factors to achieve high performance is to make sure the code is vectorized and the 512bit vector units are optimally used. We also need to make sure all the cores of the processor have equal work allocated to them, otherwise the performance deteriorates due to load imbalance caused by the varied sparsity patterns.

Latency hiding should also be incorporated as much as possible by designing schemes to optimally reuse elements already in the cache memory.

In this section scheme 2 from figure 3.1 is used as it proved to be the fastest for

Intel KNL in the experiments conducted. The algorithm optimizes SpMM by reducing data communication and by exploiting temporal locality of the output matrix. Intel

Xeon Phi Processor 7250 has a total of 34Mb L2 cache, where every tile has 1Mb of L2 cache shared between its two cores. Our algorithm revolves around effectively using this cache and creating data access patterns to aid automatic vectorization.

16 4.0.2 Data structure

The data structure is designed to promote the maximum reuse of the output matrix

elements. The reuse of the output matrix is of utmost importance as it requires both

read and write operations. The lack of an L3 cache in Intel Xeon Phi causes few

restrictions in designing a good data structures while maintaining a high cache hit

ratio. The 1MB L2 cache of Intel KNL is shared among two cores of a tile and

there are a total 34 such tiles. One shortcoming of this design is that when the same

cache is being used by multiple cores, each core will maintain its own copy of the

cache line in its local L2 cache. This can reduce the overall L2 cache capacity for the

shared memory parallel application. The work distribution and data access pattern

promoted by the data structure helps to alleviate this issue.

The sparse matrix S is split into several row_panels (T i × N) [Figure 3.1, case

2] based on a model discussed later and is stored in blocked double compressed CSC

representation. We maintain a simple array segment whose elements contain off-

sets to a row_panel. Each row_panel is stored in double compressed sparse col- umn(dCSC) format[6, 7, 5, 12]. For each dCSC matrix we maintain 4 arrays, namely column_number, column_index, row_number and values, shown in Figure 4.1.

The array column_number stores the column number only if there are non-zeros in that column of that row_panel. The array column_index points to the ele- ments of the column in the arrays row_number and values using offsets. The array row_number stores the row number of the non-zero element and the values ar- ray stores the actual non-zero element. Both column_number and column_index array have the size nacs (number of active column segments) for the correspond- ing row_panel. Whereas, row_number and values have the size nnz(elements)

17 Figure 4.1: blocked double compressed sparse column representation

18 present in the row_panel. Using double compressed representation in our case saves a significant amount of storage space, especially in hyper-sparse matrices, as in this representation we don’t have to maintain the index of columns which do not have any elements in a row_panel. Using dCSC representation also aids in increasing the cache hits on the input matrix which is not possible in case of CSR representation.

This helps us to hide the latency and also improves the L1 cache reuse. Since the input matrix is dense, achieving vectorization is quite simple if the innermost loop iterates over the K dimension.

4.0.3 Algorithm

Algorithm 4: Our SpMM algorithm input : CSR S[M][N], float I[N][K] output: float O[M][K] for row_panel = 0 to nseg step 1 do for kt = 0 to K step Tk do for j = segment[row_panel] to segment[row_panel+1] step 1 do for i = 0 to column_index[j+1] - column_index[j] step 1 do for k = kt to kt + Tk step 1 do O[row_number[i + column_index[j]] * K + k] += S[i + column_index[j]] * I[column_number[j] * K + k] end end end end end

Once we split the sparse matrix S into multiple row_panels and store them using our representation, the row_panels are distributed among the threads using

OpenMP’s dynamic schedule. Each row_panel is processed by only one thread to

19 Timeline 1 Timeline 2

Slice1 Slice2 K ( k )

Matrix I Streaming along j Streaming Slice Width

N ( j )

Tk

Thread 1 Ti

Thread 2 = Row Panel

)

i

M ( Row Partition Width

Matrix S Matrix O Matrix O O(i, k) = S(i, j) x I(j, k)

Figure 4.2: Our SpMM Algorithm

20 avoid any contention. The matrix I is also split, theoretically, into slices and a thread

working on a row_panel goes over these slices one by one.

The product computation for one thread goes as follows: The thread iterates

over one column segment of the row_panel and multiplies it with one row of the

current slice of the input dense matrix I. This results in a partial product of output

matrix O’s (T i×T k) block. The thread then moves to the next column segment with non-zeros and multiplies it with the corresponding row of matrix I’s slice. Once the

thread finishes iteration over all the column segments of the row_panel we get the

final (T i × T k) block of the output matrix. The thread repeats the above steps using

the next slice of matrix I. As soon as a thread completes processing its row_panel,

it moves to the next row_panel assigned dynamically. Listing 4.1: SpMM code 1 #pragma ivdep 2 #pragma vector aligned 3 #pragma omp parallel for num_threads(136) schedule(dynamic, 1) 4 for(int row_panel=0; row_panel

21 35 { 36 O[row_number[i + 0 + column_index[j]] * K + k] += 37 S[i + 0 + column_index[j]] * I[colNumber * K + k]; 38 O[row_number[i + 1 + column_index[j]] * K + k] += 39 S[i + 1 + column_index[j]] * I[colNumber * K + k]; 40 O[row_number[i + 2 + column_index[j]] * K + k] += 41 S[i + 2 + column_index[j]] * I[colNumber * K + k]; 42 O[row_number[i + 3 + column_index[j]] * K + k] += 43 S[i + 3 + column_index[j]] * I[colNumber * K + k]; 44 O[row_number[i + 4 + column_index[j]] * K + k] += 45 S[i + 4 + column_index[j]] * I[colNumber * K + k]; 46 O[row_number[i + 5 + column_index[j]] * K + k] += 47 S[i + 5 + column_index[j]] * I[colNumber * K + k]; 48 O[row_number[i + 6 + column_index[j]] * K + k] += 49 S[i + 6 + column_index[j]] * I[colNumber * K + k]; 50 O[row_number[i + 7 + column_index[j]] * K + k] += 51 S[i + 7 + column_index[j]] * I[colNumber * K + k]; 52 } 53 } 54 } 55 } 56 }

4.0.4 Optimizations

The code with all optimizations is present in Listing 4.1.

This algorithm is designed to use several optimization techniques while tuning

them to benefit from Intel KNL’s architecture.

1. Blocking/Tiling: Blocking is performed on input matrix S by splitting it into

several row_partitions. The first loop [line 4, Listing 4.1] chooses one block

or row_panel of input matrix S. Blocking helps in exploiting both spatial and

temporal locality. Blocking input sparse matrix S also automatically splits the

output matrix O into blocks. This in the relevant size of output

matrix O for a thread, make it possible to store the block of matrix O in the L2

cache and reap the benefits of spatial locality. In our case, blocking also helps

to avoid contention as only one thread works on one row_panel at a time.

2. Streaming and Slicing: Streaming as explained earlier in Chapter 3 is done along

[ N(j) dimension] rows of matrix S is in line 10, Listing 4.1. Slicing is performed

22 by the second loop, line 8 in Listing 4.1. Slicing as similar to blocking divides

the input matrix I into vertical slices. The resultant block of dense output

matrix O after blocking/tiling step is still often bigger than the available L2

cache. This happens when the input dense matrix I is wide (K greater than

512). So, slicing matrix I, in turn, splits the block of output matrix O into even

smaller vertical blocks. This gives us enough reduction to maintain the block

of output matrix in the cache.

One thing to notice here is that we can also reduce the width of row_partition

of S to reduce the size of the output matrix’s block instead of slicing matrix I.

But this approach reduces the performance as matrix I has to be read again

and again for each row_partition and this approach would increase the number

of row_panels.

3. Unroll and Jam: The innermost loops can be explicitly unrolled or we can use

pragmaunroll. Unrolling can sometimes help the compiler in vectorization,

though Intel’s ICC compiler generally has no trouble in generating vector in-

structions. When the innermost loop is small, i.e. K is very low then jamming

can help to improve the performance by providing more work in the innermost

loop. Jamming in our situation is a bit tricky as the number of rows in a column

is not predefined. We split the "i" loop [line 15] into two parts: in the first part

the peel loop runs for number_of_rows mod 8, the second loop runs in multiple

of 8 and the multiplication instructions are explicitly jammed. This unrolling

and jamming also enables us to get reuse on the row’s of matrix I.

23 4. Temporal/ nontemporal: The pragma vector temporal [1] instructs the ICC

compiler to use temporal or non-streaming stores. This enables us to maintain

the T i × T k portion of O matrix in L2 cache. The non temporal argument

instructs the compiler to use non-temporal or streaming stores [1]. This is

useful to evict the elements of S matrix as they won’t be used ever again.

5. Parallelization: Parallelization is performed in the first loop [line 3] which is

preceded by the omp parallel[8]. Each thread works on one row_panel, of

the sparse matrix S, at a time, which is scheduled using OpenMP’s dynamic

scheduling. This removes the necessity of any costly atomic operations as only

one thread works on the part of output matrix O corresponding to a row_panel

(thread 1 marked with green in Figure 4.2). The input dense matrix I is

also partitioned into slices, so one thread completes multiplying row_panel(1)

with slice(1), then moves on to multiply the same row_panel(1) with the next

slice(2). This way there is no write conflicts.

6. Vectorization: Vectorization (SIMD) provides us data level parallelism and is

done in the innermost loop. The innermost loops iterate for K or the slice_size

Kt, which is predetermined, and this gives us ample opportunity to exploit the

Advanced Vector Extensions 512 instructions (AVX-512). Our code is compiled

using the xMIC-AVX512 flag which instructs Intel’s ICC compiler to generate

AVX-512 instructions and we rely on its automatic vectorization. For other

compilers, their specific 512-bit vectorization flag should be used, example for

GCC use mavx512f.

24 4.0.5 Model

The choice of correct row_partition_width T i and slice width T k is essential for the optimal performance of our algorithm. Our Model chooses the optimal row_partition_width T i and slice width T k based on the input sparse matrix and dense matrix characteristics. We measure the standard deviation of the number of elements in the rows of the sparse matrix. If the standard deviation is found to be very high ( > 200 ) then the small row panel is selected ( row_panel_width =

16). Choosing small row panels helps in load-balancing as the dense rows are divided among different row_panels and the adjacent panels are processed by a different thread. Otherwise, a panel width of 256, 128 and 64 rows is selected depending upon the K dimension of the input dense matrix. As the K dimension increases, we de- crease the row_partition_width. The effect of different T i and T k values can be seen in the Figure 4.3 and 4.4. Matrix "mip1" has a high standard deviation of elements in a row (350), therefor higher performance can be seen when T i is small and work is evenly distributed. In matrices such as "Si41Ge41H72" the matrix elements are evenly distributed and it has a lower standard deviation of elements in a row (126).

In this case larger "Ti" show better performance as the reuse on input dense matrix increases.

Since our approach relies on the output matrix being in the L2 cache, we choose the slice size T k such that the total size of the portion of the output matrix being processed by a thread remains in its local L2 cache. This maximizes the L2 cache hits on the thread’s own tile. Xeon Phi has 1Mb L2 cache shared by 2 cores so we assume 256Kb or 512Kb for each thread depending on the number of threads being

25 Figure 4.3: Heat Map for mip1 in GFLOPS

26 Figure 4.4: Heat Map for Si41Ge41H72 in GFLOPS

27 used. For K ≤ 128 we use 136 threads and for K 128 we use 64 threads. If K is large (>512) then we make T k as 512 to minimize.

These choices for T i and T k are chosen by calculating data movement and using performance data collected by executing SpMM with varying choices of T i and T k

over a large set of sparse matrices.

4.0.6 Reordering Matrices

In previous sections we saw that increasing non-zero per active column segment

ratio can increase reuse and the performance of the SpMM kernel. Thus, rows of

sparse matrix are reordered to increase this ratio. For reordering the matrix we

used hypergraph partitioning [20]. For a sparse matrix S, we define the hypergraph

H(V,N) as following:

V = {vi : ∃j(i, j) ∈ S},

N = {nj : ∃i(i, j) ∈ S}, nj = {vi :(i, j) ∈ S}

Each row is vertex and each column as a hyper-edge; so, the problem turns into a dual

problem: Partition a hypergraph into partitions with equal number of vertices, miniz-

ing total number of hyper-edge cuts. Partitioning Tool for Hypergraphs (PaToH) does

exactly this optimization; therefore, PaToH is used to reorder sparse matrices. After

reordering rows that falls into same partition moved next to each other. Since, only

rows are reordered result matrix is permuted, but dense input matrix doesn’t have to

be reordered.

28 Chapter 5: Experiments

The experiments were performed on 4 different machine with different architec- tures. For the CPU experiments 2 machines were used:

• Intel Xeon Phi 7250 (68 cores with 1.40 GHz and 34Mb L2 cache)

• Intel() Xeon(R) CPU E5-2680 v4 (28 cores with 2.40GHz and 35Mb L3 cache)

GPUs used for experiments:

• NVIDIA Tesla K80

• NVIDIA Tesla P100 (Pascal)

We haven’t included the CudaMemCpy time for any GPU experiments. Also the pre-processing time for creating our data structure is not included in the performance benchmarks.

5.0.1 Dataset

We use datasets from two previous papers on sparse matrix multiplication for our experiments [32, 25]. All these data sets are available at the SuiteSparse Matrix

Collection [9]. Some matrices whose dimensions were too large and were causing the program to crash with dense matrix width 2048 have been removed to maintain uniformity. The matrices are listed in Figure 5.1.

29 Table 5.1: Data set MATRIX Rows Columns nnz 2cubes_sphere 101492 101492 1647264 bmw3_2 227362 227362 11288630 bone010 986703 986703 47851783 cage12 130228 130228 2032536 cant 62451 62451 4007383 cop20k_A 121192 121192 2624331 crankseg_2 63838 63838 14148858 F1 343791 343791 26837113 hood 220542 220542 9895422 inline_1 503712 503712 36816170 ldoor 952203 952203 42493817 mac_econ_fwd500 206500 206500 1273389 mip1 66463 66463 10352819 msdoor 415863 415863 19173163 nd24k 72000 72000 28715634 pdb1HYS 36417 36417 4344765 pre2 659033 659033 5834044 pwtk 217918 217918 11524432 scircuit 170998 170998 958936 shallow_water1 81920 81920 327680 shipsec1 140874 140874 3568176 Si41Ge41H72 185639 185639 15011265 webbase-1M 1000005 1000005 3105536

30 5.0.2 SpMM CPU

In this section, we compare our algorithm against the latest Intel’s MKL library

[Intel Math Kernel Library (Intel MKL) version 2018], SpMM based on Compressed

Sparse Blocks and the Taco library [19]. We use different dense matrix width, in-

creasing with powers of two for our experiments. Figure 5.1 and 5.2 shows the per-

formance of our algorithm on Intel Xeon Phi using double precision. Figure 5.3 and

5.4 shows the performance of our algorithm on Xeon E5 using double percision. The

mkl_dcsrmm and mkl_scsrmm routines from Intel MKL [14] was used for the ex-

periments. The code generated by Taco compiler was used for the experiments. For

CSB based SpMM we used the test spmm Cilk Plus [3] implementation code. From

our experiments, we observed that when the dense matrix width K is small (lt32)

MKL library performs similar to the performance of our algorithm. But when we

increase K, we can see that our algorithm perform significantly better than MKL,

CSB or Taco. This is due to the fact that as K increases it becomes exceedingly

difficult to get cache hits on matrix I while getting good vectorization. We can see similar trends on Xeon E5 machines.

5.0.3 SpMM comparision with CuSparse on GPU

In Figure 5.5 we compare the performance of our algorithm with that of CuSparse cusparseDcsrmm2 [26, 27]on Nvidia K80 and Nvidia P100 GPU’s. Here we used double precision for all the computations and used the NON_TRANSPOSE [26, 27] option for both input and output matrix in CuSparse. Though we are comparing

CPU performance with GPU which has a very different architecture, both Intel KNL and GPGPU are used as accelerators for matrix computation and so the comparison is

31 K = 16 K = 32

250 350

300 200

250

150 200

GFLOPS GFLOPS 150 100

100

50 50

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

K = 64 K = 128

350 300

300 250

250 200

200 150

GFLOPS 150 GFLOPS

100 100

50 50

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

Figure 5.1: SpMM KNL double precision lower K

32 K = 256 K = 512

300 250

250 200

200 150

150 GFLOPS GFLOPS 100 100

50 50

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

K = 1024 K = 2048

250 200

180

200 160

140

150 120

100 GFLOPS GFLOPS 100 80

60

50 40

20

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

Figure 5.2: SpMM KNL double precision higher K

33 K = 16 K = 32

140 250

120 200

100

150 80 GFLOPS 60 GFLOPS 100

40

50 20

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk mip1 pwtk mip1 hood ldoor hood inline ldoor inline bmw3 bmw3 nd24k nd24k cage12 2cubes scircuit cage12 2cubes scircuit cop20k cop20k shallow msdoor shallow msdoor shipsec1 shipsec1 crankseg webbase crankseg bone010 webbase pdb1HYS bone010 pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

K = 64 K = 128

300 250

250 200

200 150

150 GFLOPS GFLOPS 100 100

50 50

0 0 F1 F1 mac cant pre2 mac cant pwtk mip1 pre2 hood ldoor inline pwtk mip1 hood ldoor inline bmw3 nd24k bmw3 nd24k cage12 2cubes scircuit cop20k cage12 2cubes scircuit shallow msdoor cop20k shallow msdoor shipsec1 crankseg webbase bone010 pdb1HYS shipsec1 crankseg webbase Si41Ge41 bone010 pdb1HYS Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

Figure 5.3: SpMM Xeon E5 double precision lower K

34 K = 256 K = 512

250 250

200 200

150 150 GFLOPS GFLOPS 100 100

50 50

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

K = 1024 K = 2048

180 140

160 120 140 100 120

100 80

GFLOPS 80 GFLOPS 60

60 40 40 20 20

0 0 F1 F1 mac mac cant cant pre2 pre2 pwtk pwtk mip1 mip1 hood hood ldoor ldoor inline inline bmw3 bmw3 nd24k nd24k cage12 cage12 2cubes 2cubes scircuit scircuit cop20k cop20k shallow shallow msdoor msdoor shipsec1 shipsec1 crankseg crankseg webbase webbase bone010 bone010 pdb1HYS pdb1HYS Si41Ge41 Si41Ge41 OURS MKL TA CO CSB OURS MKL TA CO CSB

Figure 5.4: SpMM Xeon E5 double precision higher K

35 300

250

200

150 GFLOPS

100

50

0 F1 mac cant pre2 mip1 pwtk hood ldoor inline nd24k bmw3 cage12 2cubes scircuit cop20k msdoor shallow shipsec1 crankseg bone010 webbase pdb1HYS Si41Ge41 OURS Intel KNL CuSparse Nvidia K80 CuSparse Nvidia P100

Figure 5.5: Our algorithm on Intel KNL vs CuSparse on Nvidia K80 and Nvidia P100

36 K = 128

250

200

150

100

50

0

OURS Reordered Ours MKL Reordered MKL

Figure 5.6: Performance after reordering matrices

justified. We can observe that our algorithm performs relatively better than CuSparse even on Nvidia P100 which is better than Xeon Phi in terms of maximum processing power and memory bandwidth.

5.0.4 Reordering

As mentioned in Chapter 4, reordering helps to reduce the number of active column segments. As shown in Figure 5.6 the ratio of performance improvement by reordering is higher for MKL than ours. In our scheme, since we keep the data is blocked double compressed CSC, we get full reuse of the dense matrix elements within a row panel.

Hence, our performance is only improved for cases where the active column segments is significantly reduced. Reordering seems to aid MKL more significantly than our

SpMM, possibly because its implementation has not been designed to be as locality- aware as our design.

37 120 100 80 60 40 20 0 F1 1M - cant pre2 pwtk mip1 hood ldoor nd24k cage12 scircuit msdoor inline_1 shipsec1 bmw3_2 bone010 pdb1HYS cop20k_A crankseg_2 webbase Si41Ge41H72 2cubes_sphere shallow_water1 mac_econ_fwd500

Normalized K = 128 Normalized K = 1024

Figure 5.7: Normalized pre-processing cost

5.0.5 Pre-processing

The calculation of the standard deviation of the number of elements in the row and transforming the input sparse matrix into our data-structure incurs an additional overhead. This overhead shown in Figure 5.7 is a one time cost and will be amortized over the next few iterative SpMM calculations. Certain neural networks and machine learning algorithms which use SpMM, perform SpMM on the same sparse matrix iteratively, where only the input dense matrix is changed. In these applications, the pre-processing cost is easily compensated by the increase in performance.

38 Chapter 6: Conclusion and Future Work

6.1 Conclusion

A systematic exploration of different streaming choices for SpMM was done and data movement for different tile choices was calculated. A model was made to choose the optimal tile choice based upon the performance measurement experiments and data movement calculations. An efficient SpMM algorithm based upon 2D tiling was developed for CPU using the tile choices from the aforementioned model. This new algorithm demonstrated better performance than state of art Intel Math Kernel

Library on Intel Xeon and Xeon Phi processors using the matrices from the data-set.

6.2 Future Work

The streaming strategies used in this thesis show a very good performance im- provement over the existing state of the art frameworks. These strategies can be applied to Graphic Processing Units and other accelerators and a library can be cre- ated for SpMM for different architectures. The new SpMM algorithm can be used in different applications like multi-source BFS and Betweenness Centrality to show the performance improvement in real world applications.

39 Bibliography

[1] “Intel Vector”. Intel Developer Documentation, https://software.intel.com/en- us/node/524559.

[2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef- frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Mur- ray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.

[3] H. Metin Aktulga, Aydin BuluÃČÂğ, Samuel Williams, and Chao Yang. Op- timizing sparse matrix-multiple vector multiplication for nuclear configuration interaction calculations. 05 2014.

[4] Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. Accelerating the method on gpus using a blocked sparse matrix vector product. In Proceedings of the Symposium on High Performance Computing, HPC ’15, pages 75–82, San Diego, CA, USA, 2015. Society for Computer Simulation International.

[5] Ariful Azad, Grey Ballard, Aydin BuluÃČÂğ, James W. Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. 38, 10 2015.

[6] A. Buluc and J. R. Gilbert. On the representation and multiplication of hy- persparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1–11, April 2008.

[7] Aydin Buluç and John R. Gilbert. Highly parallel sparse matrix-matrix multi- plication. CoRR, abs/1006.2183, 2010.

[8] Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng., 5(1):46–55, January 1998.

40 [9] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collec- tion. ACM Trans. Math. Softw., 38(1):1:1–1:25, December 2011.

[10] Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq Malas, Jean-Luc Vay, and Henri Vincenti. Applying the roofline performance model to the intel xeon phi knights landing processor. 9945:339–353, 06 2016.

[11] Douglas Doerfler, Jack Deslippe, Samuel Williams, Leonid Oliker, Brandon Cook, Thorsten Kurth, Mathieu Lobet, Tareq M. Malas, Jean-Luc Vay, and Henri Vincenti. Applying the roofline performance model to the intel xeon phi knights landing processor. In ISC Workshops, 2016.

[12] M. Eleyat, L. Natvig, and J. Amundsen. Cache-aware matrix multiplication on multicore systems for ipm-based lp solvers. In 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 431–438, Sept 2011.

[13] Benjamin Graham. Spatially-sparse convolutional neural networks. CoRR, abs/1409.6070, 2014.

[14] Intel. “Developer Reference for IntelÂő Math Kernel Library - C”. https://software.intel.com/en-us/mkl-developer-reference-c.

[15] Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Buluç, Franz Franchetti, John R. Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Hen- ning Meyerhenke, Scott McMillan, José E. Moreira, John D. Owens, Carl Yang, Marcin Zalewski, and Timothy G. Mattson. Mathematical foundations of the graphblas. CoRR, abs/1606.05790, 2016.

[16] Jeremy Kepner, David A. Bader, Aydin Buluç, John R. Gilbert, Timothy G. Mattson, and Henning Meyerhenke. Graphs, matrices, and the graphblas: Seven good reasons. CoRR, abs/1504.01039, 2015.

[17] Martin Kiefel, Varun Jampani, and Peter V. Gehler. Sparse convolutional net- works using the permutohedral lattice. CoRR, abs/1503.04949, 2015.

[18] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1–77:29, October 2017.

[19] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1–77:29, October 2017.

41 [20] S. E. Kurt, V. Thumma, C. Hong, A. Sukumaran-Rajam, and P. Sadayappan. Characterization of data movement requirements for sparse matrix computations on gpus. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pages 283–293, Dec 2017.

[21] Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. The cache perfor- mance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63–74, April 1991.

[22] Ilya Lashuk, Merico Argentati, Evgueni Ovtchinnikov, and Andrew Knyazev. Preconditioned eigensolver lobpcg in and petsc. In Olof B. Widlund and David E. Keyes, editors, Domain Decomposition Methods in Science and Engi- neering XVI, pages 635–642, Berlin, Heidelberg, 2007. Springer Berlin Heidel- berg.

[23] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pen- sky. Sparse convolutional neural networks. In The IEEE Conference on and Pattern Recognition (CVPR), June 2015.

[24] K. Matam, S. R. Krishna Bharadwaj Indarapu, and K. Kothapalli. Sparse matrix-matrix multiplication on modern architectures. In 2012 19th Interna- tional Conference on High Performance Computing, pages 1–10, Dec 2012.

[25] D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multipli- cation. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 678–689, Nov 2016.

[26] Nvidia. “CUDA C Programming Guide”. http://docs.nvidia.com/cuda/cuda-c- programming-guide/index.html.

[27] Nvidia. “cuSPARSE”. http://docs.nvidia.com/cuda/cusparse/index.html.

[28] Gloria Ortega, Francisco VÃązquez, Inmaculada GarcÃŋa, and Ester M. GarzÃşn. Fastspmm: An efficient library for sparse matrix matrix product on gpus. The Computer Journal, 57(7):968–979, 2014.

[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

[30] Rezaur Rahman. Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers. Apress, Berkely, CA, USA, 1st edition, 2013.

[31] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and , second edition, 2003.

42 [32] Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. CoRR, abs/1302.1078, 2013.

[33] Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski, editors, Parallel Processing and Applied Mathematics, pages 559–570, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.

[34] John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel program- ming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66– 73, May 2010.

[35] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. Optimization of sparse matrixÃćÂĂÂŞvector multiplication on emerging multicore platforms. , 35(3):178 – 194, 2009. Revolutionary Technologies for Acceleration of Emerging Petascale Applications.

43