Towards Efficient SpMV on Many-core Architectures

Changxi Liu Biwei Xie Xin Liu School of Computer Science and State Key Laboratory of Computer National Research Centre of Parallel Engineering, Beihang University, Architecture, Institute of Computing Computer Engineering and China Technology, Chinese Academy of Technology, China [email protected] Sciences, University of Chinese [email protected] Academy of Sciences, China [email protected] Wei Xue Hailong Yang Xu Liu Department of Computer Science and School of Computer Science and Department of Computer Science, Technology, Tsinghua University, Engineering, Beihang University, College of William and Mary, USA China China [email protected] [email protected] [email protected]

ABSTRACT ACM Reference Format: Sparse Matrix-Vector Multiplication (SpMV) is an essential compu- Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. tation kernel for many data-analytic workloads running in both Towards Efficient SpMV on Sunway Many-core Architectures. In ICS ’18: 2018 International Conference on Supercomputing, June 12–15, 2018, Beijing, and data centers. The intrinsic irregularity in SpMV China. ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10. is challenging to achieve high performance, especially when port- 1145/3205289.3205313 ing to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the 1 INTRODUCTION Sunway architecture, we have designed a dual-side multi-level parti- Sparse Matrix-vector Multiply (SpMV) is an indispensable kernel tion mechanism on both sparse matrices and hardware resources to for many applications from various domains. In the domain of high improve locality and parallelism. On one hand, we partition sparse performance computing (HPC), applications such as computational matrices into blocks, tiles, and slices for different granularities. On fluid dynamics (CFD) and molecular dynamics (MD) highly rely the other hand, we partition cores in a Sunway into fleets, on linear algebra algorithms, where SpMV plays an important role. and further dedicate part of cores in a fleet as computation and I/O Moreover, many machine learning algorithms such as support vec- cores. Moreover, we have optimized the communication between tor machine (SVM) and sparse convolution neural network (CNN) partitions to further improve the performance. Our scheme is gener- extensively invoke SpMV computation. Finally, graph algorithms ally applicable to different SpMV formats and implementations. For such as page rank and breadth-first search can be abstracted asan evaluation, we have applied our techniques atop a popular SpMV SpMV problem. format, CSR. Experimental results on 18 datasets show that our SpMV is well-known for its irregular computation and memory optimization yields up to 15.5× (12.3× on average) speedups. access patterns. Such irregularity occurs due to random memory references, which are difficult to exploit data locality. From the com- CCS CONCEPTS piler and programmer’s perspective, the intrinsic irregular pattern • Mathematics of computing → Mathematical software per- is unpredictable at compile time because the pattern highly relies on formance; Computations on matrices; • Theory of computa- the sparsity pattern of input matrices. From the hardware perspec- tion → Parallel algorithms; tive, such irregular pattern incurs potential write conflicts, limiting instruction- and thread-level parallelism. Thus, it is challenging to KEYWORDS efficiently implement SpMV algorithms. It becomes even more challenging when porting SpMV algo- Sunway Architecture, Sparse Matrices, SpMV, Parallelism, Locality rithms to new architectures. In this paper, we target Sunway [12], an emerging architecture that is developed for clusters in the HPC Permission to make digital or hard copies of all or part of this work for personal or domain. The —Sunway TaihuLight—is powered by classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 10,649,600 SW26010 many-core RISC processor based on the Sun- on the first page. Copyrights for components of this work owned by others than the way architecture [12]. Sunway TaihuLight achieves peak perfor- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or mance of 125 PFlops, ranked as the 1st place in list since republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. June 2016. Sunway TaihuLight has demonstrated it’s powerful com- ICS ’18, June 12–15, 2018, Beijing, China puting capacity; two applications, Dynamic Model [31] and Earth- © 2018 Copyright held by the owner/author(s). Publication rights licensed to the quake Simulation [11] running on the whole system of Sunway Association for Computing Machinery. ACM ISBN 978-1-4503-5783-8/18/06...$15.00 TaihuLight, have received the ACM Gordon Bell Prize. Recent ef- https://doi.org/10.1145/3205289.3205313 forts on porting various computation kernels, such as DNN [10], ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

Memory Memory implementation: CSR, CSR5, and Block-Ellpack [3]. Experimental CPE 0 CPE 1 CPE 2 CPE 3 CPE 7 results show that BT-CSR achieves the highest throughput and MC MC …

CPE CPE CPE scalability, yielding speedups up to 15.5× (12.3× on average) over M M CPE 9 8*8 CPEs 8*8 CPEs CPE 8 10 11 15 P CG 2 CG 2 P meshes … E meshes E the baseline CSR algorithm.

CPE CPE CPE CPE CPE CG 0 CG 1 18 19 16 17 … 23 The remainder of this paper is organized as follows. In Section 2, Noc we describe the background and summarize the challenges for CPE CPE CPE CPE CPE 24 25 26 27 … 31 CG 2 CG 3 implementing SpMV on the Sunway architecture. Section 3 presents … … … … … M 8*8 CPEs M 8*8 CPEs P the design of our methodology: the dual-side multi-level partition P meshes meshes E E CG 2 CG 2 CPE CPE CPE CPE CPE 57 58 59 63 mechanism. Section 4 gives the details of SpMV implementation 56 … MC MC based on BT-CSR. Section 5 elaborates the experiment setup and

LDM analyzes the experimental results. Section 6 describes related work Memory Memory and Section 7 presents some conclusions from this paper. Figure 1: The architecture of the Sunway processor. 2 BACKGROUND AND CHALLENGES In this section, we introduce the Sunway architecture, give an BFS [17], SpTRSV [27], and Stencil [1] reveal unique techniques re- overview of the SpMV algorithm, and highlight the challenges in quired to optimize application performance on Sunway architecture. porting SpMV to Sunway. However, the Sunway architecture still lacks of basic computation libraries, such as SpMV. This paper is the first effort to perform the 2.1 Sunway SW26010 Many-Core Processor study of porting SpMV to the Sunway architecture. As known as an unconventional architecture, Sunway differs The Sunway SW26010 many-core processor is the basic building significantly from any existing architectures, such as GPGPU, Intel block of the Sunway TaihuLight supercomputer. Figure 1 shows Xeon Phi, and general-purpose CPU, so SpMV algorithms such as the Sunway architecture in detail. CSR [22], CSR5 [18], and Block Ellpack [8] that are designed for The whole Sunway processor consists of four core groups (CG). existing architectures cannot adapt to the cache-less design of Sun- Each CG has 765 GFlops double-precision peak performance and way architecture for high performance. We detail the architectural 34.1 GB/s theoretical memory bandwidth. One CG includes a DDR3 differences and highlight the challenges in the next section. (MC), a Management Processing Element (MPE), To address these challenges such as load balance with massive and a Computing Processor Elements (CPE) cluster with 64 CPEs parallelism and efficient memory access with cache-less design, connected through a 8× mesh. MPE and CPE, with the same 1.45 we present a novel SpMV scheme on the Sunway architecture. GHz frequency, have different architectures that are designed for Our technique is generally applicable to any SpMV format that is different purposes. MPE is designed for task management because designed for existing platforms and bridges the gap between the it supports complete interrupt functions and out-of-order execu- existing SpMV design and the Sunway architecture. To fully exploit tion, similar to most mainstream processors. In contrast, CPE is a Sunway’s new architectural features on the SpMV algorithm, we set of simplified 64-bit RISC cores for high computing throughput. have designed a dual-side multi-level partition mechanism. From Each CPE core consists of two pipelines—P0 and P1; P0 is used the hardware perspective, we partition the cores in a single Sun- for floating point and vector operations, while P1 is dedicated to way processor (also known as a core group) into eight fleets, each memory-related operations. Moreover, Sunway introduces a new of which consists of eight cores as a basic processing unit. Cores register communication mechanism that the CPEs in the same row in each fleet are further partitioned into seven computation cores or column of the mesh can communicate with each other within ten and one I/O core. The computation cores perform the SpMV com- cycles, which is much lower than a memory access. This register putation, while the I/O cores write the results back to the main communication mechanism exchanges data between CPEs without memory. moving data across other costly layers in the memory hierarchy. From the software perspective, we first partition the input sparse As for Sunway’s memory hierarchy, each MPE has a 32KB L1 matrix into blocks, which are assigned to fleets. Each block is fur- data cache and a 256KB L2 data/instruction cache, while each CPE ther decomposed into several tiles, which are processed by the has a 16 KB L1 instruction cache and a 64KB scratchpad memory computation cores in a fleet one by one. Moreover, we partition a (SPM). This SPM can be configured to be a programmable buffer tile into a few slices to gain benefit from vectorization and register or automatic data cache. The programmable SPM, which is also communication provided by the Sunway architecture. Intuitively, named as the local device memory (LDM), should be managed our novel partition technique naturally maps the SpMV algorithm explicitly by software. The data movement between the LDM and to the Sunway architecture, which is able to benefit from both the memory is performed through the direct memory access (DMA) parallelism and locality. to guarantee the efficiency. The automatic data cache is based on While our technique is general to various SpMV formats, for global load/store (Gload/Gstore) operations, which is transparent evaluation, we apply our technique to a popular SpMV format— to the programmers and invoked automatically. The difference CSR. We denote our new SpMV implementation as BT-CSR. We between the LDM and automatic data cache is that DMA is suitable evaluate BT-CSR on Sunway using 18 matrices, covering both scale- for moving large data blocks, while Gload/Gstore prefers small and free and HPC datasets. We compare BT-CSR with existing SpMV random data references. Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

Sparse Matrix A In summary, we identify the following features that are critical δ for applications to fully exploit the computation capability of the Sunway architecture. θ Block 0 Tile Tile Tile Empty CPE - In order to take advantage of the massive parallelism of 0 1 (N/δ) the CPEs, applications should be carefully parallelized. Block 1 Register communication - Since the register communication 0 0 0 only supports data exchange in the same row/column of the mesh, Empty 0 0 0 Block 2 adapting the communication pattern of the application is important Slice 1 0 0 0 ω to achieve efficient data communication across CPEs. Slice 2 Empty 1 0 1 LDM - LDM provides much higher bandwidth as well as shorter 0 1 0 latency to access than the main memory. Therefore, it requires a 0 1 0 delicate mechanism to better leverage the limited size of LDM to speed up the data access during runtime. Empty Partition Block (M/θ) 2.2 Sparse Matrix-vector Multiplication (SpMV) Non-Empty Partition

Figure 2: The multi-level partition of a sparse matrix. M and Algorithm 1 scalar-SpMV with the CSR format. N is the number of rows and columns of the input matrix respectively. 1: for i = 0 to numRows − 1 do 2: sum ← 0 3: for j = row_ptr[i] to row_ptr[i + 1] − 1 do S1 S2 S3 S4 S5 S6 S7 S8 4: y[i] ← y[i] + vals[j] × x[col_idx[j]] 2.3 Challenges in Porting SpMV on Sunway 5: end for Given the unique characteristics of the Sunway architecture, exist- 6: end for ing SpMV algorithms are far from the bare-metal performance. We mainly study three state-of-the-art algorithms—CSR, Block Ellpack, and CSR5, and show all of them, with naive porting efforts, receive In this paper, we use matrix A, vector x, and vector y to describe poor performance in Section 5. We further identify the reasons for the computation of SpMV (y = y+A×x). We illustrate the algorithm such poor performance as follows: × procedure of SpMV (y = y + A x), and analyze the characteris- • The large number of cores in a Sunway processor require fine- tics of the corresponding SpMV implementation with the widely grained parallelism management on the SpMV algorithm. With used CSR format [22]. There are three vectors in CSR: vector vals careless design, it is easy to introduce load imbalance across stores the values of the non-zero element; vector col_idx stores cores. the column indices of the non-zero elements; and vector row_ptr • As the Sunway architecture provides a unique shared-memory stores the indices of the first non-zero element of each row in vec- communication strategy via registers, only using default com- tor vals and vector col_idx. Algorithm 1 shows the pseudo code of munication through main memory without using the register a SpMV implementation based on the CSR format. As shown in communication does not yield high performance on the Sunway Algorithm 1, the intrinsic characteristics of SpMV including poor lo- architecture. cality, write conflict, and load imbalance that raise the challenges in • The LDM in the Sunway processor requires the manual efforts for achieving high performance on modern multi-core and many-core data placement. SpMV algorithms without explicit data manage- architecture. ment with LDM can suffer from significant performance degra- Poor locality - SpMV is inherently memory bound due to the dation. Moreover, the software-managed LDM incurs new data random memory references of vector x, which also leads to poor coherence problem in the whole memory hierarchy. It introduces cache locality and low memory bandwidth utilization. The memory new overhead that requires careful control to guarantee the pro- access pattern of SpMV is highly dependent on the sparsity of the gram correctness besides the performance. matrix, which is unpredictable at compile time. Write conflict - The writes to vector y from multiple threads or In the next section, we describe our approach that addresses all SIMD lanes may lead to potential conflicts at the runtime, especially of these challenges raised by the Sunway architecture. when multiple threads or lanes write to the same location of vector y simultaneously. Although the write conflict can be resolved by 3 METHODOLOGY using atomic operations, it depends on special hardware support; In this section, we present our dual-side multi-level partitioning otherwise the overhead is unaffordable. technique, specifically designed for the SpMV running on the Sun- Load imbalance - Due to the irregularity of the sparse matrix, way architecture. The high-level idea is to partition the computation the number of the non-zero elements in each row of the matrix can of SpMV into three levels from both sides—the input matrix and be imbalanced. Moreover, the distribution of non-zero elements the hardware resources. The computation on each level is naturally can also vary from row to row. Thus, it is challenging to devise an mapped to the corresponding level of hardware partitions to benefit efficient SpMV implementation due to the inherent load imbalance. from both parallelism and locality. The input matrix is divided into ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

3.2 Partitioning the Cores Computation Core I/O Core Vectors in Memory Register Due to the high latency of memory accesses on the Sunway pro- cessor, it deteriorates the performance significantly to write the Piece of data LDM Format of Register Communication result to memory every time when a non-zero element is processed. Register Communication Memory reference Fortunately, the Sunway processor supports a unique feature of register communication, based on which we can design a partition- Fleet 0 C0S1 C1 C2 C3 C4 C5 C6 C7 ing method for better memory efficiency. We divide all the cores of Sunway processor into eight fleets, each of which consists of eight CPE cores in the same row of the CPE mesh. These fleets Fleet 1 C8S1 C9 C10 C11 C12 C13 C14 C15 are assigned with different rows of the input matrix and thus inde- pendent from each other when performing the SpMV computation. Row We further assignFin Data the 1 Data cores 2 Data in 3 the same fleet with two different Idx Fleet 7 C56 C57 C58 C59 C60 C61 C62 C63 roles: computation4B 4B 8B core8B and I/O8B core. The computation cores are responsible for the computation of the SpMV, whereas the I/O cores Reg Reg Reg Reg Register Communication are used to buffer the intermediate results and write the final results LDM LDM LDM Buffer back to memory when the computation is done. (LDM) In Figure 3(a), we show the detail on how to partition the hard- δ Slice δ Slice ware resources. Each computation core processes the corresponding δ Slice input slices and transfers the results to the I/O core through the register communication. The I/O core is dedicated for writing the θ computation results to the vector y in memory. The I/O core main- Sparse Matrix A (slice) tains a buffer to store θ values of vector y, which are frequently accessed during the computation of the corresponding block. Vector x (δ) Vector y (θ) We further design a data format to facilitate the vectorization (a) The partition of the cores. and the data transfer from the computation core to the I/O core. The size of the message in the register communication on the Sunway processor is 32 , which is also the width of a register for Fin RowIdx Data 0 Data 1 Data 2 vectorization. We divide the register into two parts: one for the auxiliary information and the other for the result. As shown in 4B 4B 8B 8B 8B Figure 3(b), the first 8 bytes are occupied by two variables, Fin and RowIdx. Fin denotes whether there are still tiles left in this block for processing. RowIdx indicates the index of the first row in the (b) The format for register communication. current slice. Figure 3: The multi-level partition of the hardware re- sources. 3.3 Mapping SpMV to Many-core Sunway three-level partitions: block, tile, and slice, which provides differ- In our approach, the input matrix and the hardware resources ent granularities for task management. In the meantime, the many are partitioned separately. We perform the SpMV computation cores on the Sunway processor are separated into: fleet, computa- by mapping the partitions to the Sunway processor at each level tion core, and I/O core. In the rest of this section, we elaborate on as shown in Figure 4. At Level 1, the original sparse matrix is this partitioning technique. partitioned into (M/θ) blocks. These blocks are stored in a global block queue, which is shared by all the fleets. Thus, each fleet can 3.1 Partitioning the Sparse Matrix fetch another block from this queue when its resources become As shown in Figure 2, we partition the input matrix (M × N ) with available. When there is no block left in this queue, the entire SpMV our multi-level strategy. We introduce three concepts to describe finishes processing. At Level 2, a fleet processes the blocks. First, the data partitions at different levels: block, tile, and slice. First, we the block is split into (N/δ) tiles, which forms a tile queue. This tile partition the original sparse matrix into blocks, which consists of queue is shared among the cores within the same fleet, including θ rows from the input matrix. Thus the size of a block is θ × N . seven computation cores and one I/O core. Each computation core The block is further divided into tiles, each with the size of θ × δ. fetches a tile from the tile queue when it is available. At Level 3, the Finally, the tile is divided into slices, each with the size of ω ×δ. We tile is further divided into (θ/ω) slices to facilitate the vectorization defer the discussion of how to choose appropriate values for these and data transfer from computation cores to the I/O core. Slice parameters in Section 4.4. An empty block, tile, or slice means that is the basic unit to be processed by the computation core in our all elements are zero within that partition, which does not need to method. The work sharing mechanism in the block and slice queues be processed. We elaborate on how we map these data partitions guarantee the workload balance across fleets and cores. for computation to the Sunway processor in Section 3.3. Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

N different computation cores. When a computation core processes Initial the tile, it further divides the tile into slices for vectorization and M register communication. Prefetching the data into the LDM before (Level 0) computation can speedup the memory access, however the tile is

Sparse Matrix (M x N) Sunway Processor (64 Cores) too large to be loaded, whereas the slice is too small that loses the Process a sparse matrix on the Sunway processor efficiency. Therefore, we combine multiple slices into a batchto achieve the optimal data transfer between the memory and LDM. Level We pre-load the data from memory to the LDM with batches and Num=8 make sure that there is always enough data for processing in the 1 Fleet (8 Cores) Num= (M / θ) Block (θ x N) LDM. For example, batch n+1 is pre-loaded while we are processing Fleets process independently block by block the batch n. Note that the computation core uses batch to pre-load data, but still uses slice as its basic computation unit. Level Reduction Algorithm 2 elaborates the implementation details. We first de- Num= (N / δ) scribe all the notations used in the algorithm. bl_set, tl_set, and 2 7 Calculation Cores Tile(θ x δ) I/O Core sl_batch_set denote the set of blocks in current matrix, the set of Computation Cores process independently tile by tile tiles in current block, and the set of slices in current batch. bln, tln, and sln_batch denote the size of block set, tile set, and slice set Level Num= (θ / ω) respectively. The values of x_tile are pre-loaded according to the 3 Slice(ω x δ) Calculation Core with VPU values of vector x, which will be used in the current tile. dt_batch The Computation Cores process and write the result and ci_batch stores the values and column indices of the non-zero to the I/O Core slice by slice elements in the current batch. msg stores the message that will be transferred to the I/O core. Each slice contains ω rows, which means there are ω results Figure 4: Mapping the computation of a sparse matrix to the after current slice is processed. Here we use vector Data, the size hardware resources using the three-level partitions. of which is ω, to store the intermediate results. Every time the computation core finishes processing a non-zero element, it adds 4 IMPLEMENTATION DETAILS the result to the corresponding element in vector Data. When the computation core finishes processing a slice, vector Data that stores To demonstrate the effectiveness of our dual-side multi-level parti- the intermediate results needs to be written back to vector y (line tioning approach, we apply it to CSR, one of the most popular SpMV 21-25). We use the format shown in Figure 3(b) to store the results formats. We refer to our customized format as BT-CSR. It is worth before transferring them to the I/O core (line 27). Note that Fin is noting that our technique is generally applicable to other SpMV the flag indicating whether current computation core needs tofetch formats. In this section, we elaborate on the implementation details another block for processing. RowIdx is the index of the starting of BT-CSR, which includes the processing logic of computation row in the current slice. If the Fin in msg is set to be zero, all tiles cores and I/O core in each fleet and how they collaborate with each in current blocks have been processed by other computation cores. other. Naturally, our multi-level partitioning maps the computation Therefore, no extra work needs to be done by current computation to the hardware at each level. At the top level, the fleets take charge core (line 31-32). of processing blocks. Within each fleet, the computation cores and A batch contains sln slices to process. After the computation I/O core form a logic pipeline, where the computation cores iterate core finishes processing a batch (line 18-28), it triggers thepre- through the lower-level partitions (tile and slice) of the input matrix load action for the next batch and moves the data pointer to the to perform the calculation and the I/O core buffers the intermediate pre-loaded batch on the computation core (line 12-29). After the results before writing them to the memory. computation core finishes processing all the batches for a tile, it carries on to the next tile until there is no tiles left for processing 4.1 Processing Logic of Computation Core in the current block (line 6-30). Then current computation core As aforementioned, each fleet is divided into the computation cores stays idle until it receives the restart notification from the I/O core. and the I/O core, with computation cores responsible for performing This notification is a special message sending from I/O core tothe the SpMV computation and I/O core for writing results back to the computation cores in the same fleet, indicating that all non-zero memory. One critical factor that affects the performance of the elements in current block have been processed, and the computation computation procedure is the frequent data accesses to the memory. core proceeds to next block until there is no blocks left in the block There are two ways on Sunway for memory references: GLOAD queue. When all blocks have been processed, the computation core and DMA. GLOAD supports loading random data from memory, finishes its work (line 1-35). while DMA supports loading continuous data in batches. Since our approach prefetches vector ’x’, DMA is preferred over GLOAD. 4.2 Processing Logic of I/O Core Although LDM has the potential to reduce the data access latency, it requires careful management due to its limited size. On Sunway processor, the latency of directly accessing the memory As discussed in Section 3, we divide the sparse matrix into blocks is quite high. The random memory accesses of SpMV exacerbates and further into tiles. Tiles are the basic task units that we assign to this performance penalty. To solve this problem, we leverage the ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

Algorithm 2 Processing logic on the computation core. Algorithm 3 Processing logic on the I/O core. 1: for bid = 1 → bln do 1: for bid = 1 → bln do 2: /* Iterate through all the blocks */ 2: yb ← bl_set(bid).yb 3: F in ← RUN 3: yn ← bl_set(bid).yn 4: tln ← bl_set(bid).tln 4: ty ← y(yb : yn) 5: tl_set ← bl_set(bid).tl_set 5: while (1) do 6: for tid = 1 → tln do 6: ReдRecv(RecvInf o) 7: /* Iterate through all the tiles in a block */ 7: {F in, RowIdx, Data(1 : ω)} ← RecvInf o 8: tl = tl_set(tid) 8: if F in == EXIT then 9: x_tile ← tl .x 9: f nc ← f nc + 1 10: sl_batch_set ← tl .sl_batch_set 10: if f nc == CCN then 11: sln_batch ← tl .sln_batch 11: ReдSend(SendInf o, 1 : CCN ) 12: for slid_batch = 1 → sln_batch do 12: f nc ← 0 13: /* Iterate through all the batches in a tile */ 13: Break 14: sl_batch = sl_batch_set(slid_batch) 14: end if 15: dt_batch ← sl_batch.dt 15: end if 16: ci_batch ← sl_batch.ci 16: ty(RowIdx : ω)+ = RecvInf o(1 : ω) 17: sln ← sl_batch.sln 17: end while 18: for slid = 1 → sln do 18: y(yb : yn) ← ty 19: /* Iterate through all the slices in a batch */ 19: end for 20: RowIdx ← sl_batch(slid).ri 21: for ω_id = 1 → ω do 22: /* Store intermediate results */ 23: Data(ω_id) ← 24: dt_batch(slid)(ω_id) ∗ x_tile(ci_batch(slid)(ω_id)) 25: end for 26: msд = {F in, RowIdx, Data(1 : ω)} vector ty using the indices rowIdx, rowIdx+1 and rowIdx+2 respec- 27: ReдSend(msд, ResReдIndex) /* Send message to I/O core */ tively (line 16). If the field Fin within the reduction request equals 28: end for zero, the counter fnc increases itself by one to record the number 29: end for 30: end for of computation cores that have finished their work for the current 31: F in ← EXIT block (line 9). When fnc equals to CCN, which is the total number 32: msд = {F in, 0} of the computation cores in a fleet, it means that all computation 33: ReдSend(msд, ResReдIndex) /* Send finish message to I/O core */ 34: ReдRecv(rcf ) /* Receive synchronization message */ cores have finished their work. Then the I/O core notifies thecom- 35: end for putation cores in the same fleet to process next block, and reset fnc to zero (line 10-12). The fleet continues to process next block until exhausting all the blocks (line 1-19).

I/O core to reduce the number of writes to the memory. The I/O core is dedicated to buffer the intermediate results that are received 4.3 Synchronization across Cores from the computation cores, and write results back to vector y in We design a synchronization mechanism based on the Bulk Synchro- memory when an entire block has been processed. We assign one nous Parallel (BSP) model to enable the synchronization between core as the I/O core within each fleet and introduce vector ty to the computation cores and the I/O core in the same fleet. However, store the intermediate results. The size of vector ty is θ, since there the computation across different fleets is performed independently. are at most θ rows in each block. There are two situations that trigger the synchronization between Algorithm 3 shows the implementation details of the I/O proce- the computation cores and I/O core: the reduction request and the dure. We use vector ty to buffer the intermediate results that would finish notification. On Sunway processor, we use register communi- finally be written back to vector y. In other words, we cache a seg- cation to send messages from the computation cores to the I/O core. ment of vector y in vector ty in the LDM. yb indicates the position Even though I/O core may receive multiple messages from different of the first element of vector ty in vector y. yn denotes the size of computation cores simultaneously, the message processing mecha- the vector ty. Each time a computation core finishes a slice, the nism on Sunway guarantees that these messages can be received intermediate result is sent to the I/O core and then written to the ty. reliably, which ensures the correctness of the computational results. CCN is the number of computation cores in a fleet. fnc is a counter For the finish notification message, each of them indicates thatno to record the number of computation cores that have finished their tiles in the block need to be processed by the current computation processing in the current block. Each time the I/O core receives core. After sending the message, the computation core stays idle a message from the computation core, it firstly examines the Fin until it receives the restart notification from the I/O core, indicating field (line 8). If Fin is non-zero, this message is a reduction request, that the whole fleet is to process the next block. The I/O core usesa which carries the data that needs to be reduced into ty. Otherwise, variable fnc to record the number of finish notifications it receives. it is a finish notification message indicating that there are notiles Thus, when the fnc reaches CCN, it means all the computation left for processing in the current block. The reduction request is cores have finished their work. Then the I/O core broadcasts the raised by the computation core when it finishes processing a slice. restart notifications to each computation core, writes vector ty back The format of the reduction request is shown in Figure 3(b). The to vector y in main memory and restores its data structures. The rowIdx field in the reduction request indicates the row indexof procedure of restoring the data structures includes: 1) change the the slice, which triggers current reduction request. To deal with pointer of the task queue to the next block; 2) reset the range of a reduction request, the I/O core stores data1, data2 and data3 to vector ty on the I/O core. Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

4.4 Parameter Tuning Table 1: The datasets used for evaluation. There are three parameters in our dual-side multi-level partitioning Matrix shape Matrix row×col nnz nnz/row method: θ, δ and ω. θ indicates the number of buffered elements in vector ty on the I/O core. δ determines the number of pre-loaded dense 2K × 2K 4M 2K elements of vector x during the calculation procedure. ω is the × number of rows in a slice and also the number of intermediate crankseg_2 63K 63K 14.1M 221 results that a reduction request carries. When a fleet has finished F1 343K × 343K 26.8M 78 processing a block, the intermediate results are buffered in vector ty in the LDM on the I/O core. It is quite straightforward to figure nd24k 72K × 72K 28.7M 398 out the size of vector ty, which is the number of rows (θ) in a block. pdb1HYS 36K × 36K 4.3M 119 As the maximum size of vector ty is limited by the size of the LDM (64KB), the value of θ is also determined. For double precision float cant 62K × 62K 4.0M 64 / data, the value of θ is 64KB 8B = 8092. The value of δ relies on pwtk 218K × 218K 11.5M 52 the sparse pattern of the input matrix. We discuss its impact on performance in Section 5.4. On the Sunway processor, the reduction ldoor 952K × 952K 42.5M 44 request transferred through register communication can carry three qcd5_4 49k × 49k 1.9M 39 float data at most. We set ω to be three, so that each time a slice is processed, the intermediate results from this slice can be written cop20k_A 121K × 121K 2.6M 21 back at one time. cage14 1.5M× 1.5M 27.1M 18

5 EVALUATION 2cubes_sphere 101K × 101K 1.6M 16 5.1 Experiment Setup atmosmodd 1.2M× 1.2M 8.8M 6

Our experiments are conducted on a CG of a Sunway SW26010 pro- mac_econ_fwd500 206K×206K 1.3M 6 cessor. The performance of SpMV on a CG is important for scientific applications, since many of them decompose the computation prob- scircuit 171K×171K 959K 5 lem at the granularity of CG and perform SpMV intensively within shallow_water1 82K×82K 327K 4 CG. We use the native compilers swcc and sw5cc on Sunway for C and C++ programs, respectively. To evaluate our method BT-CSR, webbase-1M 1M×1M 3.1M 3 we select 18 representative sparse matrices from the University of bcsstm38 8K×8K 10K 1 Florida Sparse Matrix Collection [9], which are listed in Table 1. Note that SW26010 gives two different methods to fetch data from memory. One is to access the memory through Gload/Gstore in- struction controlled by the compiler, and the other is to use DMA required for computation is transferred from memory to LDM controlled by the programmer. In our implementations, we try using DMA. We set the block size to 8. to make the memory access as efficient as possible with either Gload/Gstore instruction or DMA. All the experiments use double 5.2 Isolated SpMV Performance Analysis precision. We run each experiment five times and report the aver- age result. Based on our observation, the variance of execution time Figure 5 presents the isolated SpMV performance of our approach across runs is quite small on Sunway processor. For comparison, and four other SpMV implementations on Sunway processor. It we also implement four other SpMV formats on SW26010. We use is clear that our approach gives the best performance across all CSR-MPE as our baseline for performance comparison. datasets compared to other implementations. In general, our ap- proach BT-CSR achieves 12.3× speedup on average compared to • CSR-MPE, which implements SpMV on the MPE of Sunway pro- the baseline when using 64 cores. It is also interesting to notice that cessor based on Algorithm 1. except our approach, other implementations actually experience • CSR-CPE, which is also based on Algoritm 1, however it leverages performance degradation to certain extent in quite a few cases com- the CPEs of a Sunway processor. The data, column-indexed and pared to the baseline. The reason is that although CPE provides row-indexed matrix and vector y are transferred from memory to much higher bandwidth than MPE, the irregular access pattern of LDM using DMA. Since the access pattern of vector x is irregular, vector x restricts the implementations to use Gload/Gstore instruc- its accesses are through Gload/Gstore instruction. tion to transfer data from/to memory instead of more efficient DMA • CSR5-CPE, which is implemented on Sunway with CPEs based on with LDM. For the CSR5-CPE implementation, another problem lim- CSR5 [18]. CSR5 is the most cutting-edge SpMV implementation its its performance on Sunway processor is that the original CSR5 on various platforms. All the data required for computation is implementation heavily relies on SIMD instructions to boost its transferred from memory to LDM using DMA except vector x. performance. However, to the best of our knowledge, Sunway pro- • Block-Ellpack-CPE, which is implemented on Sunway with CPEs cessor supports quite limited number of SIMD instructions, which based on Block Ellpack [8]. Block Ellpack has been proved quite makes CSR5 less appealing to perform SpMV efficiently on Sunway efficient on many-core architectures such as GPU. All the data processor. ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

6 CSR-MPE CSR-CPE 5 CSR5-CPE 4 Block-Ellpack-CPE BT-CSR-CPE 3

GFLOPS 2 1 0 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 (a) dense2 (b) nd24k (c) crankseg_2 (d) pdb1HYS (e) F1 (f) cant

6 CSR-MPE CSR-CPE 5 CSR5-CPE 4 Block-Ellpack-CPE BT-CSR-CPE 3

GFLOPS 2 1 0 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 (g) pwtk (h) ldoor (i) qcd5_4 (j) cop20k_A (k) cage14 (l) 2cubes_sphere

6 CSR-MPE CSR-CPE 5 CSR5-CPE Block-Ellpack-CPE 4 BT-CSR-CPE 3

GFLOPS 2 1 0 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 (m) atmosmodd (n) mac_econ_fwd500 (o) scircuit (p) shallow_water1 (q) webbase-1M (r) bcsstm38

Figure 5: Comparison of isolated SpMV performance (sharing the same y-axis on the left) between BT-CSR-CPE and four other SpMV implementations (CSR-MPE, CSR-CPE, CSR5-CPE and Block-Ellpack-CPE) on Sunway Processor with the number of cores equals to 8, 16, 32 and 64 (along the x-axis). We use CSR-MPE as our baseline for performance comparison, which performs SpMV on the MPE sequentially. It is worth noting that without thorough optimization, SpMV parallelized on the CPE runs even slower than the serial version on the MPE.

To address the irregular accesses of vector x, previous litera- show that the overhead of preprocessing can be amortized by tens tures [2, 4, 8, 20, 30] have pointed out that the blocking technique of iterations for most matrices. is effective. However, as shown in Figure 5, Block-Ellpack does not Figure 5 also shows the scalability of each approach. As the num- behave well across all datasets. Based on the observation in [8], ber of cores increases, the number of fleets available to BT-CSR we set the block size of Block-Ellpack to 8. The performance of also increases, which enables BT-CSR to process more blocks simul- Block-Ellpack is much better than the baseline on datasets such taneously. It is clear in Figure 5 that the performance of BT-CSR as dense2 and nd24k. However, on dataset webbase-1M, the per- increases as the number of cores increases, which demonstrates the formance of Block-Ellpack is even worse than the baseline. This is good scalability of our approach across all datasets. For instance, because the limited size of the block prevents the reuse of vector with dense2, BT-CSR achieves 2.0×, 4.0× and 7.6× speedup (com- x efficiently. If we set the size of the block too large, a lotofuse- pared to running at 8 cores) respectively as the number of cores less data will be transferred into LDM due to the sparsity of the scales from 16 to 64. In contrast, Block-Ellpack does not scales well matrix. Our approach solves this problem with traditional blocking on datasets such as cop20k_A, 2cube_sphere, webbase-1M com- methods. The vector y is shared by the entire fleet, and vector x pared to BT-CSR. Similarly, CSR5 exhibits poor scalability on all is shared by the slice. As seen in Figure 5, BT-CSR achieves better datasets except scircuit and webbase-1M. Both CSR-MPE and performance on datasets such as webbase-1M and 2cubes_sphere CSR-CPE show similar scalability trends as CSR5 on most datasets. with 10.3× and 10× speedup respectively compared to the baseline, where Block-Ellpack performs poorly. Moreover, our experiments Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

12 CSR-CPE 10 CSR5-CPE 8 Block-Ellpack-CPE 6

Ratio BT-CSR 4 2 0

Figure 6: The locality inefficiency of all SpMV implementations across all datasets. Lower is better.

5.3 Locality Analysis The Size of δ To further understand the performance of SpMV, we analyze the 0.9 8 locality of memory accesses of all approaches. However, due to 0.8 the unique design of LDM and limited support of hardware per- formance counters on Sunway, it is infeasible for us to report the 0.7 4 traditional locality metrics such as cache misses. Instead, we mea- 0.6 sure the number of memory accesses during the SpMV computation 0.5 as the locality metric since all cache misses eventually require mem- 2 ory accesses. We also define the locality inefficiency as shown in 0.4 Equation 1, which represents the ratio of the actual number of mem- The Number of Fleets ory accesses to the theoretical number of memory accesses for the 1 0.3 SpMV computation. The theoretical number of memory accesses 0.2 (memory_accessesthe ) is easy to calculate, which equals to the sum 32 64 128 256 512 1024 of the number of elements within matrices A, vector x and y. For a specific dataset, the memory_accessesthe is the same across all Figure 7: The performance sensitivity of SpMV using differ- SpMV implementations. The actual number of memory accesses ent value of δ and different number of fleets. Each in (memory_accessesact ) is difficult to measure directly due to the the heatmap is the harmonic mean of performance devia- limited support of performance counters on Sunway. Therefore, tions to the optimal settings across all datasets calculated we manually instrument all SpMV implementations to record the by Equation 2. memory_accessesact on each dataset, which includes the accesses to matrices A, vector x and y, as well as the auxiliary data structures used in each SpMV implementation. to pad more zeroes for them, which leads to more useless memory traffic. The fundamental reason for the better locality efficiency memory_accessesact Locality_Inef ficiency = (1) of our approach can be attributed to the ability to reuse the data memory_accessesthe within LDM, which effectively reduces the number of accesses Figure 6 shows the locality inefficiency of the four SpMV imple- to the memory. In addition, as we further break down the actual mentations on all datasets. BT-CSR achieves the lowest locality inef- memory accesses into DMA and Gload/Gstore instruction through ficiency across all datasets, which demonstrates our approach isef- instrumentation, the results show that the DMA requests domi- fective to reduce the memory traffic of SpMV on Sunway processor. nate the memory accesses of BT-CSR, whereas the Gload/Gstore Note that the actual number of memory accesses is commonly larger instructions dominate the other three approaches. In sum, BT-CSR than the theoretical number due to the auxiliary data structures achieves the best locality by not only performing less accesses to used in each SpMV implementation. For instance, CSR5 requires ad- the memory, but also using efficient DMA to access the memory. ditional data structures such as bitflag and tile_ptr to facilitate its methodology. We also notice that the Block-Ellpack-CPE exhibits 5.4 Parameter Sensitivity Analysis much worse locality inefficiency when dealing with datasets such The value of δ is important for the performance of BT-CSR. Because as cop20k_A, 2cubes_sphere, mac_econ_fwd500 and scircuit. δ determines the number of elements in vector x pre-loaded during These datasets are too sparse so that Block-Ellpack-CPE needs the computation of a tile. If it is too small, there will be more tiles, ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu which increases the accesses to vector x. In contrast, if it is too memory reference efficiency. Merrill et al.[21] propose a merge- large, there will be fewer tiles, which could lead to load imbalance based parallel method, aiming at better SpMV performance on GPU. among cores within a fleet as well as more useless values of vector Liu et al. [19] present a method for SpMV on CPU+GPU using x to be loaded. speculative segmented sum. Buono et al. [7] optimize SpMV for To evaluate the performance sensitivity of SpMV to the setting scale-free matrices on POWER8 using a two-phase method. of δ, we test all datasets by varying the value of δ from 32 to 1024 As the formats show various performance on sparse matrices as well as the number of fleets from 1 to 8. We use γ to denote with different sparsity patterns, there is also some work focusing on the number of fleets. Figure 7 shows the performance heatmap of selecting the optimal format for the input matrix by analyzing its SpMV using different values of δ and γ . Each cell within Figure 7 sparsity pattern. SMAT [16] extracts features from the input matrix represents the harmonic mean of the performance deviations to and uses a decision tree to predict the optimal format. Sedaghati et the optimal settings across all datasets. The harmonic mean is cal- al. [23] use the features that extracted from both the input matrix culated based on Equation 2. Harmonicγ,δ is the harmonic mean and the hardware platform for the model training. Su et al. [25] of the performance deviations (Ratioγ,δ,α ) across all datasets un- present clSpMV framework based on openCL and propose Cocktail der specific setting of δ and γ , which is effective to measure the format that consists of multiple formats for automatic selection. performance impact of δ across datasets. We define DS as the set Zhao et al. [32] apply deep learning in SpMV format selection of all datasets and α to be a specific dataset in DS. NumDS de- by considering a sparse matrix as an image. Sedaghati et al. [24] notes the number of elements in DS. ∆ is the set of δ, which is propose a decision model using machine learning to automatically {32, 64, 128, 256, 512, 1024}. Γ is the set of γ , which is {1, 2, 4, 8}. select the best format for sparse matrix on GPU platform. NumDS To the best of our knowledge, this is the first work proposes Harmonic = (2) γ,δ Í 1 a dual-side multi-level partitioning mechanism for efficient SpMV Ratio α ∈DS γ ,δ,α implementation on Sunway architecture. It leverages the unique features of the Sunway architecture at each level with a set of Ratioγ,δ,α represents performance deviation on specific dataset under specific setting of δ and γ . It is calculated based on Equation 3. new techniques to efficiently map the computation of SpMV tothe hardware resources. T γ,δ,α denotes the execution time on dataset α.

min (T i, j,α ) i ∈Γ, j ∈∆ Ratio = (3) 7 CONCLUSIONS γ,δ,α T γ,δ,α This paper presents a novel SpMV scheme targeting the Sunway As Figure 7 shows, the performance difference is small across architecture. The novelty exists in the multiple levels of partitions, different settings of δ as long as γ is settled. This means the perfor- designed according to the new Sunway architecture, on both soft- mance of SpMV using our approach is less sensitive to the setting ware and hardware sides, which enhances data locality and fully of δ. In our evaluation, we set δ to be 256, with which BT-CSR exploits the hardware parallelism. Our technique is generally appli- achieves the best performance across all datasets. cable to any existing SpMV format and is able to efficiently map 6 RELATED WORK the SpMV algorithms to the Sunway architecture. To demonstrate the effectiveness of our technique, we have applied it to one ofthe Plenty of work has been published on SpMV optimization from most popular SpMV format—CSR and developed its variance BT- diverse perspectives [6, 13, 15, 19, 28]. Many new SpMV formats, CSR. We evaluate BT-CSR with 18 representative sparse matrices techniques, and auto-tuners have been proposed to fully exploit on the Sunway TaihuLight supercomputer. Although our approach the underlying architectures. is designed for Sunway, it is generally applicable to other emerging CSR5 [18] can be applied to multiple platforms, with delicate manycore architectures, especially with cache-less design. Experi- vectorization and tiling method based on segmented sum. BSR [5] mental results show that BT-CSR can efficiently utilize Sunway’s introduce blocking mechanism based on the classical CSR format. parallel resources with balanced workloads across cores. BT-CSR CVR [29] is an vectorization oriented format, aiming at better vec- outperforms existing SpMV approaches running on the Sunway torization efficiency and memory locality. Liu et20 al.[ ] introduce architecture, specifically yielding speedups up to 15.5× (12.3× on sorting and blocking based on ELLPACK and present a new for- average) over the baseline CSR approach. mat named ESB. Kourtis et al. [15] propose Compressed Sparse eXtended (CSX) to compress metadata by exploiting substructures within the matrix. Aydin et al. [6] introduce compressed sparse ACKNOWLEDGMENTS blocks (CSB), which can deal with both Ax and AT x efficiently. Yan The authors would like to thank all anonymous reviewers for et al. [30] present blocked compressed common coordinate (BC- their insightful comments and suggestions. This work is par- COO), which is also known as yet another SpMV framework. It uses tially supported by the National Key R&D Program of China bit flags to compress the data and reduce the memory references a (Grant No. 2016YFB1000304, 2016YFA0602100, 2017YFA0604500 and lot. Ashari et al. [2] introduce a two-dimensional blocking mecha- 2016YFA0602200), National Natural Science Foundation of China nism and present a format named blocked row-column (BRC). Tang (Grant No. 61502019, 91530323, 41776010 and 61732002) and Na- et al. [26] propose VHCC using a 2D jagged partition mechanism tional Science Foundation (NSF) under Grant No. 1618620. Hailong for better locality and segmented sum for vectorization. Greathouse Yang is the corresponding author. et al. [14] present CSR-adaptive, aiming at better load balance and Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

REFERENCES 10.1145/2462156.2462181 [1] Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu, [17] Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations for Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Atmospheric Modeling on Sunway TaihuLight. In 2017 IEEE International Parallel Sunway TaihuLight with Ten Million Cores. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. 535–544. https://doi.org/10.1109/IPDPS.2017.9 June 2, 2017. 635–645. https://doi.org/10.1109/IPDPS.2017.53 [2] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An [18] Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross- Efficient Two-dimensional Blocking Strategy for Sparse Matrix-vector Multi- Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM plication on GPUs. In Proceedings of the 28th ACM International Conference on International Conference on Supercomputing (ICS ’15). ACM, New York, NY, USA, 339–350. https://doi.org/10.1145/2751205.2751209 Supercomputing (ICS ’14). ACM, New York, NY, USA, 273–282. [19] Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse [3] Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Mul- Matrix-Vector Multiplication on Heterogeneous Processors. Parallel Comput. 49 tiplication on Throughput-oriented Processors. In Proceedings of the ACM/IEEE (2015), 179 – 193. https://doi.org/10.1016/j.parco.2015.04.004 Conference on High Performance Computing Networking, Storage and Analysis (SC [20] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient ’09). ACM, New York, NY, USA, Article 18, 11 pages. Sparse Matrix-vector Multiplication on x86-based Many-core Processors. In [4] Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector Proceedings of the 27th ACM International Conference on Supercomputing (ICS ’13). multiplication on throughput-oriented processors. In Proceedings of the conference ACM, New York, NY, USA, 273–282. https://doi.org/10.1145/2464996.2465013 on high performance computing networking, storage and analysis. ACM, 18. [21] Duane Merrill and Michael Garland. 2016. Merge-based Parallel Sparse Matrix- [5] Luc Buatois, Guillaume Caumon, and Bruno Levy. 2009. Concurrent number vector Multiplication. In Proceedings of the ACM/IEEE International Conference cruncher: a GPU implementation of a general sparse linear solver. International for High Performance Computing, Networking, Storage and Analysis (SC ’16). IEEE, Journal of Parallel, Emergent and Distributed Systems 24, 3 (2009), 205–223. Piscataway, NJ, USA, Article 58, 12 pages. https://doi.org/10.1109/SC.2016.57 [6] Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E [22] Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi- Industrial and Applied Mathematics, Philadelphia, PA, USA. plication using compressed sparse blocks. In Proceedings of the twenty-first annual [23] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. symposium on Parallelism in algorithms and architectures. ACM, 233–244. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. [7] Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS Long, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplication ’15). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/2751205.2751244 for Large-Scale Data Analytics. In Proceedings of the 30th International Conference [24] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. on Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 37, 12 pages. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs. https://doi.org/10.1145/2925426.2926278 In Proceedings of the 29th ACM International Conference on Supercomputing (ICS [8] Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuning ’15). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/2751205.2751244 of Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010), [25] Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMV 115–126. https://doi.org/10.1145/1837853.1693471 Framework on GPUs. In Proceedings of the 26th ACM International Conference on [9] Timothy A. Davis. 1997. The University of Florida sparse matrix collection. NA Supercomputing (ICS ’12). ACM, New York, NY, USA, 353–364. https://doi.org/ DIGEST (1997). 10.1145/2304576.2304624 [10] J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang. 2017. swDNN: A [26] Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-tuning Scale-free 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 615–624. https://doi.org/10.1109/IPDPS.2017.20 13th IEEE/ACM International Symposium on Code Generation and Optimization [11] Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang (CGO ’15). IEEE Computer Society, Washington, DC, USA, 136–145. https: Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, //doi.org/10.1109/CGO.2015.7054194 and Xiaofei Chen. 2017. 18.9Pflopss Nonlinear Earthquake Simulation on Sunway [27] Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A Fast TaihuLight: Enabling Depiction of 18-Hz and 8-meter Scenarios. In Proceedings Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. of the International Conference for High Performance Computing, Networking, In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Storage and Analysis (SC ’17). ACM, New York, NY, USA, Article 2, 12 pages. Parallel Programming (PPoPP ’18). ACM, New York, NY, USA, 338–353. https: https://doi.org/10.1145/3126908.3126910 //doi.org/10.1145/3178487.3178513 [12] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng [28] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, Wei Zhao, Xunqiang Yin, James Demmel. 2007. Optimization of Sparse Matrix-vector Multiplication on Chaofeng Hou, Chenglong Zhang, Wei Ge, Jian Zhang, Yangang Wang, Chunbo Emerging Multicore Platforms. In Proceedings of the 21st ACM/IEEE Conference Zhou, and Guangwen Yang. 2016. The Sunway TaihuLight supercomputer: system on Supercomputing (ICS ’07). ACM, New York, NY, USA, Article 38, 12 pages. and applications. Science China Information Sciences 59, 7 (21 Jun 2016), 072001. https://doi.org/10.1145/1362622.1362674 https://doi.org/10.1007/s11432-016-5588-7 [29] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin [13] Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis, Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceed- and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vector ings of the 2018 International Symposium on Code Generation and Optimization multiplication on modern architectures. The Journal of Supercomputing 50, 1 (01 (CGO ’18). ACM, New York, NY, USA, 149–162. https://doi.org/10.1145/3168818 Oct 2009), 36–77. https://doi.org/10.1007/s11227-008-0251-8 [30] Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet [14] Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vector Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, ACM/IEEE International Conference for High Performance Computing, Networking, New York, NY, USA, 107–118. https://doi.org/10.1145/2555243.2555255 Storage and Analysis (SC ’14). IEEE Press, Piscataway, NJ, USA, 769–780. https: [31] Jian Zhang, Chunbao Zhou, Yangang Wang, Lili Ju, Qiang Du, Xuebin Chi, Dong- //doi.org/10.1109/SC.2014.68 sheng Xu, Dexun Chen, Yong Liu, and Zhao Liu. 2016. Extreme-scale phase field [15] Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris. simulations of coarsening dynamics on the supercomputer. 2011. CSX: An Extended Compression Format for Spmv on Shared Memory In Proceedings of the International Conference for High Performance Computing, Systems. SIGPLAN Not. 46, 8 (Feb. 2011), 247–256. https://doi.org/10.1145/ Networking, Storage and Analysis. IEEE Press, 4. 2038037.1941587 [32] Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap [16] Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Between Deep Learning and Sparse Matrix Format Selection. In Proceedings of the Adaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedings 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming of the 34th ACM SIGPLAN Conference on Programming Language Design and (PPoPP ’18). ACM, New York, NY, USA, 94–108. Implementation (PLDI ’13). ACM, New York, NY, USA, 117–126. https://doi.org/