Matrix:“MM” 3/9 HCSR 1.3E+04 4/9 HCOO 2.0E+03 – Sparse Triangular Solve by Matrix:“SM” Untuned

Auto-tuning shared memory parallel Sparse BLAS operations in librsb-1.2 Michele MARTONE ([email protected]) Max Planck Institute for Plasma Physics, Garching bei Munchen,¨ Germany Intro 2/9 HCSR 1.5e+04 Sparse BLAS (Basic Linear Algebra Subroutines) [2] speci- • 1/9 HCOO 2.0e+03 fies main kernels for iterative methods: – sparse Multiply by Matrix:“MM” 3/9 HCSR 1.3e+04 4/9 HCOO 2.0e+03 – sparse triangular Solve by Matrix:“SM” Untuned. Tuned for NRHS=1. Tuned for NRHS=3. Tuning example on symmetric matrix audikw 1. • Focus on MM: C C + αop(A) B , with Here only lower triangle, ca. 1M 1M, 39M nonzeroes. • ← × • × 5/9 HCOO 9.8e+02 6/9 HCSR 1.0e+04 – A has dimensions m k and is sparse (nnz nonzeroes) On a machine with 256 KB sized L2 cache. × • T H Left one (625 blocks, avg 491 KB) is before tuning. – op(A) can be either of A, A ,A (parameter transA) 9/9 HCOO 5.0e+03 • { } Middle one (271 blocks, avg 1133 KB) after tuning (1.057x – left hand side (LHS) B is k n, • × speedup, 6436.6 ops to amortize) for MV (MM with NRHS=1). right hand side (RHS) C is n m (n=NRHS), × 7/9 HCSR 8.5e+03 8/9 HCSR 7.0e+03 Right one (1319 blocks, avg 233 KB) after tuning (1.050x both dense (eventually with strided access incB, incC, ...) • speedup, 3996.2 ops to amortize) for MM with NRHS=3. – α is scalar Finer subdivision at NRHS=3 consequence of increased • – either single or double precision, either real or complex Instance of matrix bayer02 (ca. 14k 14k, 64k nonzeroes). cache occupation of per-block LHS/RHS operands. • × The black-bordered boxes are sparse blocks. librsb implements the Sparse BLAS using the RSB • • Greener have fewer nnz than average, redder have more. (Recursive Sparse Blocks) data structure [3]. • Sparse BLAS autotuning extension Blocks rows(columns) of LHS (RHS) range during MM. • Hand tuning for each operation variant is impossible. Larger submatrices like ”9/9” can have fewer nonzeroes than 1 ! Matrix-Vector Multiply: y alpha*op(A)*x+y • • ← smaller ones like ”4/9”. 2 c a l l USMV(transA ,alpha,A,x,incx ,y,incy , istat) We propose empirical auto-tuning for librsb-1.2. • 3 ! Tuning request for the next operation 4 c a l l USSP(A, blas autotune next operation,istat) Merge / split based autotuning 5 RSB: Recursive Sparse Blocks ! Matrix structure and threads tuning 6 c a l l USMV(transA ,alpha,A,x,incx ,y,incy , istat) Optimal default blocking ? • 7 ... Sparse blocks in COO or CSR [1]. – Irregular matrix patterns ! 8 do ! A is now tuned for y alpha*op(A)*x+y • ← – Operands (especially transA, NRHS) change memory foot- 9 c a l l USMV(transA ,alpha,A,x,incx ,y,incy , istat) ...eventually with 16-bit indices (“HCOO” or “HCSR”). print ! 10 ... • 11 ! Request autotuning again Empirical auto-tuning: 12 c a l l USSP(A, blas autotune next operation,istat) cache blocks suitable for thread parallelism. • • 13 ! Now tune for C C + alpha * op(A) * B – Given a Sparse BLAS operation, probe for a better perform- ← 14 c a l l USMM(order ,transA ,nrhs,alpha ,A,B,ldB,C,ldC, istat ) Recursive partitioning of submatrices results in Z-ordered ing blocking. • 15 ! The RSB representation of A is probably different than before USMM blocks. – Search among slightly coarser or finer ones. Experiment in MM tuning and comparison to MKL avg: 1884x avg: 1.62x avg: 1.19x avg: 1.23x avg: 2.64x avg: 1935x avg: 2.84x avg: 1.12x avg: 1.19x avg: 17.08x Symmetric 0 2000 4000 6000 8000 1 2 4 8 16 0 1 2 0.5 1 1.5 2 0 10 20 30 avg: 1548x avg: 1.07x avg: 0.99x avg: 1.20x avg: 2.98x avg: 2507x avg: 2.61x avg: 0.97x avg: 1.11x avg: 11.31x General 0 2000 4000 6000 1 2 4 8 16 0.5 1 1.5 Transposed 0.5 1 1.5 0 5 10 15 avg: 1283x avg: 1.07x avg: 1.01x avg: 1.24x avg: 1.32x avg: 2068x avg: 2.25x avg: 0.97x avg: 1.09x avg: 1.04x General 0 2000 4000 6000 1 2 4 8 0.5 1 1.5 Untransposed 0.5 1 1.5 1 1.5 2 NRHS=1 Speedup: Speedup: Ratio: Ratio: Ratio: NRHS=2 tuned vs untuned RSB tuned RSB vs ’best threads’ MKL RSB autotuning time to autotuned MM time RSB blocks count RSB index bytes per nnz (tuned to untuned) (tuned to untuned) Setup Highlight: symmetric MM vs MV performance Outlook librsb (icc -O3 -xAVX, v15) vs Intel MKL (v.11.2) CSR. One may improve via: • Reversible in-place merge and split: no need for copy 2 “Intel Xeon E5-2680”, 16 OpenMP threads. Symmetric MM time to MV time, RSB Symmetric MM time to MV time, MKL • • × 1.19x speedup 0.32x slowdown while tuning. MM with NRHS= 1,2 , four BLAS numerical types. • { } Best merge/split choice not obvious: different merge and 27 matrices in total (as in [3]), including symmetric. • • split rules. Non-time efficiency criteria (e.g. use an energy measuring Results Summary • API when picking better performing). Few dozen percent improvement over untuned, costing few 0 1 2 3 4 0 5 10 15 20 • thousand operations. References Significantly faster than Intel MKL on symmetric and trans- • posed operation with NRHS=2( > 1). NRHS=2 [1] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and NRHS=4 1.10x speedup 0.15x slowdown H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Autotuning more effective on symmetric and unsymmetric Edition. SIAM, Philadelphia, PA, 1994. • untransposed with NRHS=1. RSB Symmetric MM performance increases when NRHS>1, [2] BLAS Forum. Sparse BLAS specification (BLAS specification, chapter 3). Technical report. [3] M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multipli- Tuning mostly subdivided further for NRHS=2. while Intel MKL CSR’ falls. See here for NRHS=2 and NRHS=4. cation with the recursive sparse blocks format. Parallel Computing, (40):251–270, 2014. • EXASCALE15, Workshop Sparse Solvers for Exascale: From Building Blocks to Applications, 23-25 March 2015, Greifswald, Germany Auto-tuning shared memory parallel Sparse BLAS operations in librsb-1.2 Michele MARTONE ([email protected]) Max Planck Institute for Plasma Physics, Garching bei Munchen,¨ Germany If θ is specified, that will be used to compute execution time of operation Ω on instance A and 1. Introduction, definitions • operands from ω. The Sparse BLAS (Basic Linear Algebra Subroutines) standard [2] specifies the pro- If θ is not specified, it will be determined by repeating the above on a reasonably restricted gramming interface to performance critical computing kernels on sparse matrices: • multiply by dense matrix (“MM”) and triangular solve by dense matrix (“SM”). range. Here, MM is defined as: C C + αop(A) B , where: A minimum of 10 executions is required. ← × • A has dimensions m k and is sparse (nnz nonzeroes) Algorithm 2: Sketch of PROBE procedure, determining reference performance of a specified • × op(A) can be either of A, AT ,AH (parameter transA) Sparse BLAS operation Ω on matrix A and operands from ω. Thread count specification θ may • { } request threads probing as well. left hand side (LHS) B is k n, • × right hand side (RHS) C is n m, × With Algorithm 1 it shall be possible to remedy to overly fine or coarse partitionings. Subdivid- both dense (eventually with strided access incB, incC) ing or coarsening is thread parallel and one full step takes the time of a few memory copies of α is a scalar the interested area. Coarsening converts and combines quadrants of possibly different formats; • either single or double precision, either real or complex subdividing requires repeated searches to split a sparse submatrix structure. • Figure 2 shows the results of tuning for MV and MM with NRHS=3, leading to two different tuned The matrix A can be general (A = AT ) or symmetric (A = AT ), eventually Hermitian (A = AT ). In 6 partitionings. the following, using the BLAS jargon we will refer to n as NRHS. If NRHS=1, MM is equivalent to sparse matrix-vector multiply:“MV”. SM is defined as B αop(T ) 1B , being T triangular sparse. ← − Gearing a Sparse BLAS implementation to be efficient in all the above cases with all the different possible inputs is challenging. librsb is a library based on the RSB (Recursive Sparse Blocks) data structure [3] and implements Sparse BLAS with the aim of offering efficient multithreaded operation. Our approach in seeking both generality and efficiency employs empirical auto-tuning. Namely, the user specifies a Sparse BLAS operation to be iterated many times (using the same fixed-pattern matrix). Then a tuning procedure attempts to rearrange the sparse matrix data structure to perform that operation faster. If it fails, time taken by the autotuning procedure is lost. But if it succeeds, the now faster Untuned. Tuned for NRHS=1. Tuned for NRHS=3. iterations will ultimately lead to amortize the invested tuning time. Figure 2: Sample instances of symmetric matrix audikw 1 (ca. 1M 1M, 39M nonzeroes) on a × This poster introduces the idea of tuning RSB for Sparse BLAS operations. Although the autotuning machine with 256 KB sized L2 cache.

Matrix:“MM” 3/9 HCSR 1.3E+04 4/9 HCOO 2.0E+03 – Sparse Triangular Solve by Matrix:“SM” Untuned

Easybuild Documentation Release 20210907.0

Interfacing Epetra to the RSB Sparse Matrix Format for Shared-Memory Performance

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Best Practice Guide

Efficient Multithreaded Untransposed, Transposed Or Symmetric Sparse

Auto-Tuning Shared Memory Parallel Sparse BLAS Operations In

GRILLIX: a 3D Turbulence Code for Magnetic Fusion Devices Based on a Field Line

HLST Core Team Report 2013

PRIMME SVDS: a High-Performance Preconditioned SVD Solver For

WP 8 Project Presentations - 28 September 2020

Fast Matrix-Vector Multiplications for Large-Scale Logistic Regression on Shared-Memory Systems

Librsb: a Shared Memory Parallel Sparse BLAS Implementation Using the Recursive Sparse Blocks Format