View metadata, citation and similar papers at core.ac.uk brought to you by CORE Auto-tuning shared memory parallel Sparse BLAS provided by MPG.PuRe operations in librsb-1.2 Michele MARTONE ([email protected]) Max Planck Institute for Plasma Physics, Garching bei Munchen,¨ Germany

Intro

2/9 HCSR 1.5e+04

Sparse BLAS (Basic Linear Algebra Subroutines) [2] speci- • 1/9 HCOO 2.0e+03 fies main kernels for iterative methods:

– sparseMultiply byMatrix:“MM” 3/9 HCSR 1.3e+04 4/9 HCOO 2.0e+03 – sparse triangularSolve byMatrix:“SM” Untuned. Tuned for NRHS=1. Tuned for NRHS=3. Tuning example on symmetric matrix audikw 1. • Focus on MM: C+αop(A) B , with Here only lower triangle, ca.1M 1M, 39M nonzeroes. • ← × • × 5/9 HCOO 9.8e+02 6/9 HCSR 1.0e+04 –A has dimensionsm k and is sparse(nnz nonzeroes) On a machine with 256 KB sized L2 cache. × • T H Left one (625 blocks, avg 491 KB) is before tuning. – op(A) can be either of A, A ,A (parameter transA) 9/9 HCOO 5.0e+03 • { } Middle one (271 blocks, avg 1133 KB) after tuning (1.057x – left hand side(LHS)B isk n, • × speedup, 6436.6 ops to amortize) forMV(MM with NRHS=1). right hand side(RHS)C isn m( n=NRHS), × 7/9 HCSR 8.5e+03 8/9 HCSR 7.0e+03 Right one (1319 blocks, avg 233 KB) after tuning (1.050x both dense (eventually with strided access incB, incC, ...) • speedup, 3996.2 ops to amortize) forMM with NRHS=3. –α is scalar Finer subdivision at NRHS=3 consequence of increased • – either single or double precision, either real or complex Instance of matrix bayer02 (ca. 14k 14k, 64k nonzeroes). cache occupation of per-block LHS/RHS operands. • × The black-bordered boxes are sparse blocks. librsb implements the Sparse BLAS using the RSB • • Greener have fewer nnz than average, redder have more. (Recursive Sparse Blocks) data structure [3]. • Sparse BLAS autotuning extension Blocks rows (columns) of LHS (RHS) range duringMM. • Hand tuning for each operation variant is impossible. Larger submatrices like ”9/9” can have fewer nonzeroes than 1 ! Matrix-Vector Multiply: y alpha*op(A)*x+y • • ← smaller ones like ”4/9”. 2 c a l l USMV (transA ,alpha,A,x,incx ,y,incy , istat) We propose empirical auto-tuning for librsb-1.2. • 3 ! Tuning request for the next operation 4 c a l l USSP (A, blas autotune next operation , i s t a t ) Merge / split based autotuning 5 RSB: Recursive Sparse Blocks ! Matrix structure and threads tuning 6 c a l l USMV (transA ,alpha,A,x,incx ,y,incy , istat) Optimal default blocking ? • 7 ... Sparse blocks in COO or CSR [1]. – Irregular matrix patterns ! 8 do ! A is now tuned for y alpha*op(A)*x+y • ← – Operands (especially transA, NRHS) change memory foot- 9 c a l l USMV (transA ,alpha,A,x,incx ,y,incy , istat) ...eventually with 16-bit indices (“HCOO” or “HCSR”). print ! 10 ... • 11 ! Request autotuning again Empirical auto-tuning: 12 c a l l USSP (A, blas autotune next operation , i s t a t ) cache blocks suitable for parallelism. • • 13 ! Now tune for C C + alpha * op(A) * B – Given a Sparse BLAS operation, probe for a better perform- ← 14 c a l l USMM (order ,transA ,nrhs,alpha ,A,B,ldB,C,ldC, istat ) Recursive partitioning of submatrices results in Z-ordered ing blocking. • 15 ! The RSB representation of A is probably different than before USMM blocks. – Search among slightly coarser orfiner ones.

Experiment in MM tuning and comparison to MKL

avg: 1884x avg: 1.62x avg: 1.19x avg: 1.23x avg: 2.64x avg: 1935x avg: 2.84x avg: 1.12x avg: 1.19x avg: 17.08x

Symmetric 0 2000 4000 6000 8000 1 2 4 8 16 0 1 2 0.5 1 1.5 2 0 10 20 30

avg: 1548x avg: 1.07x avg: 0.99x avg: 1.20x avg: 2.98x avg: 2507x avg: 2.61x avg: 0.97x avg: 1.11x avg: 11.31x

General 0 2000 4000 6000 1 2 4 8 16 0.5 1 1.5 Transposed 0.5 1 1.5 0 5 10 15

avg: 1283x avg: 1.07x avg: 1.01x avg: 1.24x avg: 1.32x avg: 2068x avg: 2.25x avg: 0.97x avg: 1.09x avg: 1.04x

General 0 2000 4000 6000 1 2 4 8 0.5 1 1.5 Untransposed 0.5 1 1.5 1 1.5 2

NRHS=1 Speedup: Speedup: Ratio: Ratio: Ratio: NRHS=2 tuned vs untuned RSB tuned RSB vs ’best threads’ MKL RSB autotuning time to autotuned MM time RSB blocks count RSB index bytes per nnz (tuned to untuned) (tuned to untuned)

Setup Highlight: symmetric MM vs MV performance Outlook librsb(icc -O3 -xAVX, v15) vs Intel MKL (v.11.2) CSR. One may improve via: • Reversible in-place merge and split: no need for copy 2 “Intel Xeon E5-2680”, 16 OpenMP threads. Symmetric MM time to MV time, RSB Symmetric MM time to MV time, MKL • • × 1.19x speedup 0.32x slowdown while tuning. MM with NRHS= 1,2 , four BLAS numerical types. • { } Best merge/split choice not obvious: different merge and 27 matrices in total (as in [3]), including symmetric. • • split rules. Non-time efficiency criteria (e.g. use an energy measuring Results Summary • API when picking better performing). Few dozen percent improvement over untuned, costing few 0 1 2 3 4 0 5 10 15 20 • thousand operations. References Significantly faster than Intel MKL on symmetric and trans- • posed operation with NRHS=2 (>1 ). NRHS=2 [1] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and NRHS=4 1.10x speedup 0.15x slowdown H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Autotuning more effective on symmetric and unsymmetric Edition. SIAM, Philadelphia, PA, 1994. • untransposed with NRHS=1. RSB SymmetricMM performance increases when NRHS>1, [2] BLAS Forum. Sparse BLAS specification (BLAS specification, chapter 3). Technical report. [3] M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multipli- Tuning mostly subdivided further for NRHS=2. while Intel MKL CSR’ falls. See here for NRHS=2 and NRHS=4. cation with the recursive sparse blocks format. , (40):251–270, 2014. •

EXASCALE15, Workshop Sparse Solvers for Exascale: From Building Blocks to Applications, 23-25 March 2015, Greifswald, Germany