Towards Efficient Spmv on Sunway Many-Core Architectures

Towards Efficient SpMV on Sunway Many-core Architectures Changxi Liu Biwei Xie Xin Liu School of Computer Science and State Key Laboratory of Computer National Research Centre of Parallel Engineering, Beihang University, Architecture, Institute of Computing Computer Engineering and China Technology, Chinese Academy of Technology, China [email protected] Sciences, University of Chinese [email protected] Academy of Sciences, China [email protected] Wei Xue Hailong Yang Xu Liu Department of Computer Science and School of Computer Science and Department of Computer Science, Technology, Tsinghua University, Engineering, Beihang University, College of William and Mary, USA China China [email protected] [email protected] [email protected] ABSTRACT ACM Reference Format: Sparse Matrix-Vector Multiplication (SpMV) is an essential compu- Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. tation kernel for many data-analytic workloads running in both Towards Efficient SpMV on Sunway Many-core Architectures. In ICS ’18: 2018 International Conference on Supercomputing, June 12–15, 2018, Beijing, supercomputers and data centers. The intrinsic irregularity in SpMV China. ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10. is challenging to achieve high performance, especially when port- 1145/3205289.3205313 ing to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the 1 INTRODUCTION Sunway architecture, we have designed a dual-side multi-level parti- Sparse Matrix-vector Multiply (SpMV) is an indispensable kernel tion mechanism on both sparse matrices and hardware resources to for many applications from various domains. In the domain of high improve locality and parallelism. On one hand, we partition sparse performance computing (HPC), applications such as computational matrices into blocks, tiles, and slices for different granularities. On fluid dynamics (CFD) and molecular dynamics (MD) highly rely the other hand, we partition cores in a Sunway processor into fleets, on linear algebra algorithms, where SpMV plays an important role. and further dedicate part of cores in a fleet as computation and I/O Moreover, many machine learning algorithms such as support vec- cores. Moreover, we have optimized the communication between tor machine (SVM) and sparse convolution neural network (CNN) partitions to further improve the performance. Our scheme is gener- extensively invoke SpMV computation. Finally, graph algorithms ally applicable to different SpMV formats and implementations. For such as page rank and breadth-first search can be abstracted asan evaluation, we have applied our techniques atop a popular SpMV SpMV problem. format, CSR. Experimental results on 18 datasets show that our SpMV is well-known for its irregular computation and memory optimization yields up to 15.5× (12.3× on average) speedups. access patterns. Such irregularity occurs due to random memory references, which are difficult to exploit data locality. From the com- CCS CONCEPTS piler and programmer’s perspective, the intrinsic irregular pattern • Mathematics of computing → Mathematical software per- is unpredictable at compile time because the pattern highly relies on formance; Computations on matrices; • Theory of computa- the sparsity pattern of input matrices. From the hardware perspec- tion → Parallel algorithms; tive, such irregular pattern incurs potential write conflicts, limiting instruction- and thread-level parallelism. Thus, it is challenging to KEYWORDS efficiently implement SpMV algorithms. It becomes even more challenging when porting SpMV algo- Sunway Architecture, Sparse Matrices, SpMV, Parallelism, Locality rithms to new architectures. In this paper, we target Sunway [12], an emerging architecture that is developed for clusters in the HPC Permission to make digital or hard copies of all or part of this work for personal or domain. The supercomputer—Sunway TaihuLight—is powered by classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 10,649,600 SW26010 many-core RISC processor based on the Sun- on the first page. Copyrights for components of this work owned by others than the way architecture [12]. Sunway TaihuLight achieves peak perfor- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or mance of 125 PFlops, ranked as the 1st place in top500 list since republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. June 2016. Sunway TaihuLight has demonstrated it’s powerful com- ICS ’18, June 12–15, 2018, Beijing, China puting capacity; two applications, Dynamic Model [31] and Earth- © 2018 Copyright held by the owner/author(s). Publication rights licensed to the quake Simulation [11] running on the whole system of Sunway Association for Computing Machinery. ACM ISBN 978-1-4503-5783-8/18/06...$15.00 TaihuLight, have received the ACM Gordon Bell Prize. Recent ef- https://doi.org/10.1145/3205289.3205313 forts on porting various computation kernels, such as DNN [10], ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu Memory Memory implementation: CSR, CSR5, and Block-Ellpack [3]. Experimental CPE 0 CPE 1 CPE 2 CPE 3 CPE 7 results show that BT-CSR achieves the highest throughput and MC MC … CPE CPE CPE scalability, yielding speedups up to 15.5× (12.3× on average) over M M CPE 9 8*8 CPEs 8*8 CPEs CPE 8 10 11 15 P CG 2 CG 2 P meshes … E meshes E the baseline CSR algorithm. CPE CPE CPE CPE CPE CG 0 CG 1 18 19 16 17 … 23 The remainder of this paper is organized as follows. In Section 2, Noc we describe the background and summarize the challenges for CPE CPE CPE CPE CPE 24 25 26 27 … 31 CG 2 CG 3 implementing SpMV on the Sunway architecture. Section 3 presents … … … … … M 8*8 CPEs M 8*8 CPEs P the design of our methodology: the dual-side multi-level partition P meshes meshes E E CG 2 CG 2 CPE CPE CPE CPE CPE 57 58 59 63 mechanism. Section 4 gives the details of SpMV implementation 56 … MC MC based on BT-CSR. Section 5 elaborates the experiment setup and LDM analyzes the experimental results. Section 6 describes related work Memory Memory and Section 7 presents some conclusions from this paper. Figure 1: The architecture of the Sunway processor. 2 BACKGROUND AND CHALLENGES In this section, we introduce the Sunway architecture, give an BFS [17], SpTRSV [27], and Stencil [1] reveal unique techniques re- overview of the SpMV algorithm, and highlight the challenges in quired to optimize application performance on Sunway architecture. porting SpMV to Sunway. However, the Sunway architecture still lacks of basic computation libraries, such as SpMV. This paper is the first effort to perform the 2.1 Sunway SW26010 Many-Core Processor study of porting SpMV to the Sunway architecture. As known as an unconventional architecture, Sunway differs The Sunway SW26010 many-core processor is the basic building significantly from any existing architectures, such as GPGPU, Intel block of the Sunway TaihuLight supercomputer. Figure 1 shows Xeon Phi, and general-purpose CPU, so SpMV algorithms such as the Sunway architecture in detail. CSR [22], CSR5 [18], and Block Ellpack [8] that are designed for The whole Sunway processor consists of four core groups (CG). existing architectures cannot adapt to the cache-less design of Sun- Each CG has 765 GFlops double-precision peak performance and way architecture for high performance. We detail the architectural 34.1 GB/s theoretical memory bandwidth. One CG includes a DDR3 differences and highlight the challenges in the next section. Memory Controller (MC), a Management Processing Element (MPE), To address these challenges such as load balance with massive and a Computing Processor Elements (CPE) cluster with 64 CPEs parallelism and efficient memory access with cache-less design, connected through a 8× mesh. MPE and CPE, with the same 1.45 we present a novel SpMV scheme on the Sunway architecture. GHz frequency, have different architectures that are designed for Our technique is generally applicable to any SpMV format that is different purposes. MPE is designed for task management because designed for existing platforms and bridges the gap between the it supports complete interrupt functions and out-of-order execu- existing SpMV design and the Sunway architecture. To fully exploit tion, similar to most mainstream processors. In contrast, CPE is a Sunway’s new architectural features on the SpMV algorithm, we set of simplified 64-bit RISC cores for high computing throughput. have designed a dual-side multi-level partition mechanism. From Each CPE core consists of two pipelines—P0 and P1; P0 is used the hardware perspective, we partition the cores in a single Sun- for floating point and vector operations, while P1 is dedicated to way processor (also known as a core group) into eight fleets, each memory-related operations. Moreover, Sunway introduces a new of which consists of eight cores as a basic processing unit. Cores register communication mechanism that the CPEs in the same row in each fleet are further partitioned into seven computation cores or column of the mesh can communicate with each other within ten and one I/O core. The computation cores perform the SpMV com- cycles, which is much lower than a memory access. This register putation, while the I/O cores write the results back to the main communication mechanism exchanges data between CPEs without memory. moving data across other costly layers in the memory hierarchy. From the software perspective, we first partition the input sparse As for Sunway’s memory hierarchy, each MPE has a 32KB L1 matrix into blocks, which are assigned to fleets.

Load more