Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi Wai Teng Tang†, Ruizhe Zhao§, Mian Lu†, Yun Liang§, Huynh Phung Huyng†, Xibai Li§, Rick Siow Mong Goh† †Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore §Center for Energy-Efficient Computing and Applications, School of EECS, Peking University, China Abstract of non-zeros in each row is similar), a sparse matrix that ex- Recently, the Intel Xeon Phi coprocessor has received in- hibits scale-free properties is highly irregular. It has many creasing attention in high performance computing due to its rows with very few non-zeros but has only a few rows with a simple programming model and highly parallel architecture. large number of non-zeros. As such, SpMV computation on In this paper, we implement sparse matrix vector multiplica- such matrices is particularly challenging due to the highly ir- tion (SpMV) for scale-free matrices on the Xeon Phi archi- regular distribution of non-zeros. Many existing implemen- tecture and optimize its performance. Scale-free sparse ma- tations such as Intel MKL perform well for regular matrices, trices are widely used in various application domains, such but are inefficient for scale-free sparse matrices. Previous as in the study of social networks, gene networks and web works have also studied partitioning algorithms for scale- graphs. We propose a novel SpMV format called vector- free SpMV computation on distributed memory comput- ized hybrid COO+CSR (VHCC). Our SpMV implementa- ers [6, 13]. However, such partitioning schemes are expen- tion employs 2D jagged partitioning, tiling and vectorized sive and do not scale well for applications that require on- prefix sum computations to improve hardware resource uti- line analysis, e.g. dynamically changing web graphs. More lization, and thus overall performance. As the achieved per- importantly, these implementations are not designed for the formance depends on the number of vertical panels, we also Intel Xeon Phi architecture. develop a performance tuning method to guide its selection. In this work, we present an efficient scale-free SpMV Experimental results demonstrate that our SpMV implemen- implementation named VHCC for Intel Xeon Phi. VHCC tation achieves an average 3 speedup over Intel MKL for a makes use of a vector format that is designed for efficient ⇥ wide range of scale-free matrices. vector processing and load balancing. Furthermore, we employ a 2D jagged partitioning method together with tiling 1. Introduction in order to improve the cache locality and reduce the over- head of expensive gather and scatter operations. We also Sparse matrix-vector multiplication (SpMV) is a critical ker- employ efficient prefix sum computations using SIMD and nel that finds applications in many high performance com- masked operations that are specially supported by the Xeon puting (HPC) domains including structural mechanics, fluid Phi hardware. The optimal panel number in the 2D jagged dynamics, social network analysis and data mining. Many partitioning method varies for different matrices due to their algorithms use SpMV iteratively for their computation. For differences in non-zero distribution. Therefore, we develop a example, the conjugate gradient method [16] uses SpMV to performance tuning technique to guide its selection. Exper- solve a system of linear equations, whereas the PageRank iments indicate that our SpMV implementation achieves an algorithm [14] uses SpMV to determine the ranks of web average 3 speedup over Intel MKL for scale-free matrices, ⇥ pages. SpMV computation is a performance bottleneck for and the performance tuning method achieves within 10% of many of these algorithms [10, 20, 21]. However, efficient the optimal configuration. implementation of the SpMV kernel remains a challenging The remainder of the paper is organized as follows. In task due to its irregular memory access behavior. Section 2, we present background details on Intel Xeon Phi In this paper, we focus on optimizing SpMV for scale- and scale-free sparse matrices. We also discuss the bottle- free sparse matrices for the Xeon Phi architecture. Scale-free necks that these matrices face during SpMV computation. sparse matrices arise in many practical applications, such as Section 3 presents our implementation for computing scale- in the study of web links, social networks and transportation free SpMV on Xeon Phi, as well as our proposed perfor- networks [2]. Unlike sparse matrices from engineering ap- mance tuning method. Experiment results are then presented plications, which are more regular in nature (i.e. the number 2015 IEEE/ACM International Symposium on Code Generation and Optimization 978-1-4799-8161-8/15/$31.00 ©2015 IEEE 1-4799-8161-8/15/$31.00 ©2015 IEEE 136 in Section 4. Section 5 discusses prior research which is re- lated to ours, and Section 6 concludes the paper. 2. Background and Motivation 2.1 The Intel Xeon Phi Coprocessor The Intel Xeon Phi coprocessor is based on the Intel Many Integrated Core (MIC) Architecture. In this paper, we use (a) FEM/cantilever (b) Stanford the Xeon Phi 5110P coprocessor which integrates 60 cores on the same package, each running at 1.05GHz. Up to 4 hardware threads per core are supported, and a maximum limit of 240 threads can be scheduled on the coprocessor. The MIMD (Multiple Instructions, Multiple Data) execution model allows different workloads to be assigned to different threads. The amount of off-chip memory that is available on the coprocessor is 8GB. There are 8 memory controllers which support 16 GDDR5 channels, and altogether they op- (c) mouse gene (d) R-MAT(18, 16) erate at a peak theoretical bandwidth of 320GB/s. Every core Figure 1: Comparison of regular and irregular (scale-free) in the processor contains a local 64KB L1 cache, equally di- matrices. x-axis: non-zeros per row, y-axis: frequency. vided between the instruction and data caches, and a 512KB L2 unified cache. All the 512KB L2 caches on the 60 cores ces “regular matrices”. In contrast, matrices such as those are fully coherent via the tag directories, and are intercon- derived from web graphs (see Fig. 1b) and networks (see nected by a 512-bit wide bidirectional ring bus. When an L2 Fig. 1c) tend to exhibit scale-free properties. These matrices cache miss occurs on a core, requests will be forwarded to are termed “irregular matrices” or “scale-free matrices”. As other cores via the ring network. L2 cache miss penalty is a comparison, Fig. 1d shows the distribution of the non-zeros on the order of hundreds of cycles, thus optimizing the data per row of a scale-free sparse matrix generated from a Kro- locality for L2 cache is important for Xeon Phi. necker graph model [7]. In this recursively defined model, One of the major features of Xeon Phi is the wide 512- two key parameters, s and e, define a scale-free R-MAT(s, bit vector processing unit (VPU) present on each of the e) matrix. For a given s and e, a square matrix with dimen- cores, effectively doubling the 256-bit vector width of the sions 2s 2s and an average number of non-zeros per row ⇥ latest Intel Xeon CPUs. In addition, new SIMD (Single In- (e) is obtained. One can easily see that Fig. 1b and Fig. 1c struction, Multiple Data) instructions including scatter and look very similar to Fig. 1d, whereas there is no resemblance gather, swizzle, maskable operations, and fused multiply- at all between the regular matrix in Fig. 1a and the R-MAT add (FMA) operations are supported. To achieve high per- in Fig. 1d. formance on Xeon Phi, it is crucial to utilize the VPUs effectively. The Xeon Phi is capable of yielding peak double 2.3 Performance Bottlenecks of Intel MKL’s SpMV gflops precision performance of 1TFlops. Compared to alterna- The coordinate (COO) and compressed sparse row (CSR) ⇠ tive coprocessor architectures such as that of GPUs, XeonMKL format arethroughput two (GFLOPS) commonly used SpMV formats that are provided by Phi features low cost atomic operations that can be used for Intel MKL [1]. COO stores both the row and column indices efficient parallel algorithm implementations. of all the non-zeros. On the other hand, CSR does not store the row indices; instead, it stores a pointer to the start of 2.2 Scale-Free Sparse Matrices each row. Because of this, CSR uses less memory bandwidth The occurrence of sparse matrices or equivalently, networksregular matricesand thus performs better than COO. In this paper, we will scale-free ma Si41Ge41H72 crankseg_2 torso1 mip1 ldoor xenon2 cant consph Standford exhibiting scale-free nature in practical settings, is attributedGflops 10.9 15.9 8.7 8.4 15.2 11.5 18.609935 14.541358 0.83 to a self-organizing behavior called preferential attachment 20 which has attracted a lot of studies [2]. Figure 1 shows the Regular matrices 16 Scale-free matrices differences between regular and scale-free matrices in terms 12 8 of the distribution of the non-zeros per row of a matrix. The Gflops 4 plots are logarithmic in both axes; the horizontal axis de- 0 notes the number of non-zeros per row and the vertical axis denotes the number of rows having that specified number of non-zeros. Figure 1a demonstrates a matrix from the engineering sciences (e.g. constructed using FEM). Such matrices tend to have multi-modal distributions and their struc- Figure 2: Performance of MKL CSR SpMV for regular and tures are more regular in nature. We will call these matri- scale-free matrices. Page 1 137 profiling Standford IMDB wiki-Talk

Load more