Numerical for High Performance Computing

SBD Workshop: Transfer of HPC and Data Know-How to Scientific Communities June 11th/12th 2018, Juelich Supercomputing Centre (JSC)

Hartwig Anzt, Terry Cojean, Goran Flegar, Thomas Grűtzmacher, Pratik Nayak,Tobias Ribizel Steinbuch Centre for Computing (SCC)

KIT – The Research University in the Helmholtz Association www.kit.edu reflecting hardware evolution

• Explosion in core count.

”Parallelism needed – synchronization kills performance.”

2 05/07/2018 Hartwig Anzt: for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

64bit read, 1 DP-FLOP

John D. McCaplin (TACC)

• Explosion in core count. • Compute power (#FLOPs) grows much faster than bandwidth.

”Parallelism needed – synchronization kills performance.” ”Operations are free, memory access is what counts.”

3 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

• Explosion in core count.

”Parallelism needed – synchronization kills performance.”

4 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks.

time →

• Explosion in core count.

”Parallelism needed – synchronization kills performance.”

5 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks. • Break fork-join-model. • Synchronize locally, avoid global synchronization. • OpenMP tasks, OmpSs, PARSEC, StarPU.

time →

• Explosion in core count.

”Parallelism needed – synchronization kills performance.”

6 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks. • Break fork-join-model. • Synchronize locally, avoid global synchronization. • OpenMP tasks, OmpSs, PARSEC, StarPU.

Reformulate algorithms in terms of fixed-point iterations • Fixed-point iteration converging in the asymptotic sense can accept lack of synchronization. • Element-wise independent iterations allow to scale one multi- and many-core architectures.

• Explosion in core count.

”Parallelism needed – synchronization kills performance.”

7 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

8 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization

• Decompose system matrix into product .A = L U · • Based on . • Triangular solves to solve a system :Ax = b Ly = b y Ux = y x ) ) ) • De-Facto standard for solving dense problems. • What about sparse? Often significant fill-in…

9 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . S • Triangular solves to solve a system :Ax = b n n L R ⇥ lower (unit-) triangular, sparse. • For 2ILU(0), is the sparsity pattern of A. Ly = b y Ux = y x n nS ) ) ) U• RWorks well for many problems.⇥ upper triangular, sparse. 2 • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? Lij = Uij =0 (i, j) / . • What about sparse? Often significant fill-in… 8 2 S R = A L U, R =0 (i, j) . • Fill-in in threshold ILU ( · ij ILUT8) bases on the2 S significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize.

10 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . S • Triangular solves to solve a system :Ax = b n n L R ⇥ lower (unit-) triangular, sparse. • For 2ILU(0), is the sparsity pattern of A. Ly = b y Ux = y x n nS ) ) ) U• RWorks well for many problems.⇥ upper triangular, sparse. 2 • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? Lij = Uij =0 (i, j) / . • What about sparse? Often significant fill-in… 8 2 S R = A L U, R =0 (i, j) . • Fill-in in threshold ILU ( · ij ILUT8) bases on the2 S significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize.

11 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Fill-in in threshold ILU (ILUT) bases on the significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize.

12 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Gaussian Elimination naturally sequential. • Fill•-in in threshold ILU (Level-scheduling brings limited parallelism.ILUT) bases on the significance of elements (e.g. magnitude).• A L U usually only a rough approximation.S ⇡ · • Often better preconditioners than level-based ILU. • Difficult to parallelize.

13 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U

Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Gaussian Elimination naturally sequential. • Fill•-in in threshold ILU (Level-scheduling brings limited parallelism.ILUT) bases on the significance of elements (e.g. magnitude).• A L U usually only a rough approximation.S ⇡ · • Often better preconditioners than We would be better off if we level-based ILU. could cheaply (parallel) • Difficult to parallelize.generate an approximation.

14 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner via an iterative process.

• Exploit the property (A L U = 0) · S

???? ? ???? ? ? ???? ? ???? ?? ??? ?? ? ? ?? ?? 0???? ? ? 1 0??? 1 0??? 1 0 ? 1 = B? ? ? ? ? C B??C B? ? C ⇥ B ? C B C B C B C B C B ? ? ??C B ???C B ? ? C B ??C B C B C B C B C B??? ? ??C B?? ??C B?? ?? C B ? C B C B C B C B C @ A @ A @ A @ A

15 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner via an iterative process.

• Exploit the property (A L U = 0) · S

• Approximate the values in the incomplete factors L and U via the element-parallel fixed-point iteration: 1 j 1 aij likukj ,i>j F (l ,u )= ujj k=1 ij ij i 1 (a ⇣ l u ,i⌘ j ij k=1Pik kj  P

???? ? ???? ? ? ???? ? ???? ?? ??? ?? ? ? ?? ?? 0???? ? ? 1 0??? 1 0??? 1 0 ? 1 = B? ? ? ? ? C B??C B? ? C ⇥ B ? C B C B C B C B C B ? ? ??C B ???C B ? ? C B ??C B C B C B C B C B??? ? ??C B?? ??C B?? ?? C B ? C B C B C B C B C @ A @ A @ A @ A

16 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner via an iterative process.

• Exploit the property (A L U = 0) · S

• Approximate the values in the incomplete factors L and U via the element-parallel fixed-point iteration: 1 j 1 aij likukj ,i>j F (l ,u )= ujj k=1 ij ij i 1 (a ⇣ l u ,i⌘ j ij k=1Pik kj  P • This fixed-point iteration is converging in the asymptotic sense1.

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015).

17 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner ParILU may need a few fixed-point sweeps to converge in the ILU via an iterative process. residual norm.

• Exploit the property (A L U = 0) The generated incomplete factors are · S competitive to the classical .

• Approximate the values in the incomplete factors L and U via the element-parallel fixed-point iteration: 1 j 1 aij likukj ,i>j F (l ,u )= ujj k=1 ij ij i 1 (a ⇣ l u ,i⌘ j ij k=1Pik kj  P • This fixed-point iteration is converging in the asymptotic sense1.

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015). 2Chow, Anzt, and Dongarra. “Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs”. In: LNCS. 2015.

18 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner ParILU may need a few fixed-point sweeps to converge in the ILU via an iterative process. residual norm.

• Exploit the property (A L U = 0) The generated incomplete factors are · S competitive to the classical algorithm.

• Approximate the values in the incomplete factors L and U via the element-parallel fixed-point iteration: 1 j 1 aij likukj ,i>j F (l ,u )= ujj k=1 ij ij i 1 (a ⇣ l u ,i⌘ j ij k=1Pik kj  P • This fixed-point iteration is converging in the asymptotic sense1.

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015). 2Chow, Anzt, and Dongarra. “Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs”. In: LNCS. 2015.

19 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU

• Generate an incomplete factorization preconditioner ParILU may need a few fixed-point sweeps to converge in the ILU via an iterative process. residual norm.

• Exploit the property (A L U = 0) The generated incomplete factors are · S competitive to the classical algorithm.

• Approximate the values in the incomplete factors L and U via the element-parallel fixed-point iteration: 1 j 1 aij likukj ,i>j F (l ,u )= ujj k=1 ij ij i 1 (aij ⇣ likukj,i⌘ j k=1P  Model Order Reduction (MOR) P • This fixed-point iteration is converging 8 6 in the asymptotic sense1. 4

Speedup 2 0 1, 357 5, 177 20, 209 79, 841 317, 377 1, 265, 537 Discretization granularity (elements per direction)

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015). 2Chow, Anzt, and Dongarra. “Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs”. In: LNCS. 2015. 3Anzt et al. “Updating incomplete factorization preconditioners for model order reduction”. In: Numerical Algorithms (2016).

20 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

64bit read, 1 DP-FLOP

John D. McCaplin (TACC)

• Explosion in core count. • Compute power (#FLOPs) grows much faster than bandwidth.

”Parallelism needed – synchronization kills performance.” ”Operations are free, memory access is what counts.”

21 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

64bit read, 1 DP-FLOP

John D. McCaplin (TACC)

• Compute power (#FLOPs) grows much faster than bandwidth.

”Operations are free, memory access is what counts.”

22 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

• The arithmetic operations should use high precision formats natively supported by hardware. • Data access should be as cheap as possible, reduce precision. • Radically decouple storage format from arithmetic format. • Double precision in all arithmetic operations. • Algorithm-aware precision when accessing data.

64bit read, 1 DP-FLOP

John D. McCaplin (TACC)

• Compute power (#FLOPs) grows much faster than bandwidth.

”Operations are free, memory access is what counts.”

23 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution

• The arithmetic operations should use high precision formats natively supported by hardware. • Data access should be as cheap as possible, reduce precision. • Radically decouple storage format from arithmetic format. • Double precision in all arithmetic operations. • Algorithm-aware precision when accessing data.

Challenges when using double precision + IEEE single/half precision: • Need explicit conversion. 64bit read, 1 DP-FLOP • Data range reduction: protect against under- / overflow. • Need to duplicate data in memory (half/single/double). John D. McCaplin (TACC)

• Better: Modular Precision Format • Compute power (#FLOPs) grows much faster than bandwidth.

”Operations are free, memory access is what counts.”

24 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

63 51 0

head (32 bit) tail (32 bit)

mantissa (20 bit) • Special “conversion” routines to double precision. • Mantissa much shorter than IEEE single/half precision. • No under- / overflow.

25 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

• For efficient data access (coalesced reads), interleave data in memory. 63 51 0 head (32 bit) tail (32 bit)

mantissa (20 bit) • Special “conversion” routines to double precision. • Mantissa much shorter than IEEE single/half precision. • No under- / overflow. heads tails

26 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

• For efficient data access (coalesced reads), interleave data in memory. 63 51 0

head (32 bit) tail (32 bit) • Data can be accessed much faster if low precision is acceptable. 10-3 45 IEEE double 4 3 mantissa (20 bit) • Special “conversion” routines to double precision. 3 2 • Mantissa much shorter than IEEE single/half precision. 2 Runtime [s] • No under- / overflow. 1 1 heads tails

0 0 2 4 6 8 10 12 14 Dataset size [# of elements] 107 NVIDIA P100 “Pascal” 5.3 TFLOP/s DP 16GB RAM @720 GB/s

27 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

• For efficient data access (coalesced reads), interleave data in memory. 63 51 0

head (32 bit) tail (32 bit) • Data can be accessed much faster if low precision is acceptable. 10-3 54 IEEE double 2-segment [64 bit] 4 3 mantissa (20 bit) • Special “conversion” routines to double precision. 3 2 • Mantissa much shorter than IEEE single/half precision. 2 Runtime [s] • No under- / overflow. 1 1 heads tails

0 0 2 4 6 8 10 12 14 Dataset size [# of elements] 107 NVIDIA P100 “Pascal” 5.3 TFLOP/s DP 16GB RAM @720 GB/s

28 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

• For efficient data access (coalesced reads), interleave data in memory. 63 51 0

head (32 bit) tail (32 bit) • Data can be accessed much faster if low precision is acceptable. 10-3 54 IEEE double 2-segment [64 bit] 4 3 4-segment [64 bit] mantissa (20 bit) • Special “conversion” routines to double precision. 3 2 • Mantissa much shorter than IEEE single/half precision. 2 Runtime [s] • No under- / overflow. 1 1 heads tails

0 0 2 4 6 8 10 12 14 Dataset size [# of elements] 107 NVIDIA P100 “Pascal” 5.3 TFLOP/s DP 16GB RAM @720 GB/s

29 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

sign exponent (11 bit) mantissa (52 bit) • Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…)

• For efficient data access (coalesced reads), interleave data in memory. 63 51 0

head (32 bit) tail (32 bit) • Data can be accessed much faster if low precision is acceptable. 10-3 54 IEEE double 2-segment [64 bit] 4 3 4-segment [64 bit] mantissa (20 bit) 2-segment [32 bit] • Special “conversion” routines to double precision. 3 4-segment [48 bit] 2 4-segment [32 bit] 4-segment [16 bit] • Mantissa much shorter than IEEE single/half precision. 2 Runtime [s] • No under- / overflow. 1 1 heads tails

0 0 2 4 6 8 10 12 14 Dataset size [# of elements] 107 NVIDIA P100 “Pascal” 5.3 TFLOP/s DP 16GB RAM @720 GB/s

30 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

• Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…) • For efficient data access (coalesced reads), interleave data in memory. • Data can be accessed much faster if low precision is acceptable. • Need precision-aware algorithms, e.g. Adaptive Precision Jacobi1. • Use low precision data reads whenever accuracy allows for it.

1Anzt, Dongarra, Quintana-Orti. ”Adaptive precision solvers for sparse linear systems”. In: Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing. (2015).

31 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

• Split the IEEE double precision format into segments. (2-segment modular precision, 4-segment modular precision…) • For efficient data access (coalesced reads), interleave data in memory. • Data can be accessed much faster if low precision is acceptable. • Need precision-aware algorithms, e.g. Adaptive Precision Jacobi1. • Use low precision data reads whenever accuracy allows for it. 2-segment 4-segment 10000 10000

9000 9000

8000 8000

7000 7000

6000 6000

5000 5000

4000 4000

3000 3000

2000 2000 Vector component Vector component Vector 1000 1000

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Iterations Iterations 1Anzt, Dongarra, Quintana-Orti. ”Adaptive precision solvers for sparse linear systems”. In: Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing. (2015). 2Grützmacher, Anzt. “A Modular Precision Format for decoupling Arithmetic Format and Storage Format”. Submitted to HeteroPar workshop 2018.

32 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Modular Precision Format

• Split the IEEE double precision format into segments. 1.3 1e-2 (2-segment modular precision, 4-segment modular precision…) 0.5564 0.7231 0.9995 1.024 1.191 1.242 1.072 1.3 1.279 1.306 1.304 1.2 • For efficient data access (coalesced reads), interleave data in memory. 1e-4 0.5623 0.9004 1.084 1.138 1.069 1.263 1.249 1.269 1.223 1.208 1.139 1.1 • Data can be accessed much faster if low precision is acceptable.

0.7068 0.9472 0.8516 1.111 1.038 1.182 1.16 1.16 1.136 1.124 1.078 1 1e-6 • Need precision-aware algorithms, e.g. Adaptive Precision Jacobi1.

0.9 0.7079 0.9662 1.049 1.081 1.024 1.13 1.118 1.112 1.096 1.086 1.049 • Use low precision data reads whenever accuracy allows for it. 1e-8

0.8

2-segment 4-segment 0.765 0.9671 1.041 1.072 1.015 1.108 1.095 1.086 1.074 1.065 1.032 10000 10000 Relativeresidual stoppingnorm threshold 1e-10 0.7 9000 9000

8000 8000 0.7667 0.9674 0.9176 1.059 1.009 1.088 1.078 1.069 1.06 1.067 1.006 7000 7000 1e-12 0.6

6000 6000 5 11 15 17 33 39 55 65 97 129 257 5000 5000 Number of Elements per row 4000 4000 3000 3000 Speedup adaptive precision Jacobi in 2-segment 2000 2000

Vector component Vector component Vector modular precision format over vanilla Jacobi. 1000 1000

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Iterations Iterations 1Anzt, Dongarra, Quintana-Orti. ”Adaptive precision solvers for sparse linear systems”. In: Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing. (2015). 2Grützmacher, Anzt. “A Modular Precision Format for decoupling Arithmetic Format and Storage Format”. Submitted to HeteroPar workshop 2018.

33 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Where is the link to domain scientists?

• Scientific community only benefits if new algorithms are available in production-code. • Dissemination of algorithms in numerical linear algebra libraries. • We should accept the general trends and use cases in the community. • Interface and documentation has to allow for easy integration into existing software ecosystems. • For a sustainable software effort, we need a healthy software lifecycle!

https://bssw.io/ https://xsdk.info/

34 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Sustainable Software

• Open source software, Community effort • Open license (MIT / modified BSD) • Public repository, e.g. git • Code review process for internal / external contributions

• User-friendly installation with CMake / autoconf • Clearly listed dependencies • Documentation of all functionality and central design principles • Tutorials and code integration examples

• Unit testing for all functionality • Reference implementation for all functionality ensuring correctness (not tuned for performance) • Platform portability, Continuous integration (CI) • Dashboard reporting routine status

35 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing • Open source sparse linear algebra library • Generic algorithm implementation + architecture-specific highly optimized kernels • Focused on GPU accelerators (i.e. NVIDIA GPUs) • Working on OpenMP backend • Git repository • Doxygen, Cmake • Modified BSD license (3 clause) • Unit testing, googletest • CI system

https://github.com/ginkgo-project/ginkgo

36 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing • Based on C++ (11), C bindings planned • Templated precision (ValueType, Integer) • By default, compiles Z,C,D,S • Smart pointers to avoid memory leaks (unique/shared pointers) • Runtime polymorphism • Kernels have the same signature for different architectures • Executor determines which kernel is used Determines where Data lives & operation is executed • LinOp class for any linear operator: • Matrices generate • Solvers apply • Preconditioners … • …

37 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing • Based on C++ (11), C bindings planned • Templated precision (ValueType, Integer) • By default, compiles Z,C,D,S • Smart pointers to avoid memory leaks (unique/shared pointers) • Runtime polymorphism • Kernels have the same signature for different architectures • Executor determines which kernel is used Determines where Data lives & operation is executed • LinOp class for any linear operator: • Matrices generate • Solvers apply • Preconditioners … • …

38 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ├── core │ ├── base │ ├── matrix “Core” contains the algorithms, • │ ├── preconditioner Based on C++ (11), C bindings planned │ ├── solver like ILU, CG, GMRES etc. • Templated precision (ValueType, Integer) │ └── test Anything not device-specific. │ ├── base • By default, compiles Z,C,D,S │ ├── matrix │ ├── preconditioner • Smart pointers to avoid memory leaks (unique/shared pointers) │ ├── solver │ └── utils • Runtime polymorphism ├── reference • Kernels have the same signature for different architectures │ ├── matrix │ ├── preconditioner • Executor determines which kernel is used │ ├── solver │ └── test Determines where Data lives │ ├── base & operation is executed │ ├── matrix │ ├── preconditioner • LinOp class for any linear operator: │ └── solver └── gpu • Matrices ├── matrix generate ├── preconditioner • Solvers apply ├── solver “gpu” contains the • … └── test Preconditioners ├── matrix hardware-tuned • … ├── preconditioner numerical kernels └── solver

39 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ├── core │ ├── base │ ├── matrix “Core” contains the algorithms, • │ ├── preconditioner Based on C++ (11), C bindings planned │ ├── solver like ILU, CG, GMRES etc. • Templated precision (ValueType, Integer) │ └── test Anything not device-specific. │ ├── base • By default, compiles Z,C,D,S │ ├── matrix │ ├── preconditioner • Smart pointers to avoid memory leaks (unique/shared pointers) │ ├── solver Unit tests ensure correctness. │ └── utils • Runtime polymorphism ├── reference • Kernels have the same signature for different architectures │ ├── matrix │ ├── preconditioner “reference” kernels • Executor determines which kernel is used │ ├── solver ensure correctness │ └── test Determines where Data lives │ ├── base & operation is executed │ ├── matrix │ ├── preconditioner • LinOp class for any linear operator: │ └── solver └── gpu • Matrices ├── matrix generate ├── preconditioner • Solvers apply ├── solver “gpu” contains the • … └── test Preconditioners ├── matrix hardware-tuned • … ├── preconditioner numerical kernels └── solver

40 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ├── core │ ├── base │ ├── matrix “Core” contains the algorithms, • │ ├── preconditioner Based on C++ (11), C bindings planned │ ├── solver like ILU, CG, GMRES etc. • Templated precision (ValueType, Integer) │ └── test Anything not device-specific. │ ├── base • By default, compiles Z,C,D,S │ ├── matrix │ ├── preconditioner • Smart pointers to avoid memory leaks (unique/shared pointers) │ ├── solver Unit tests ensure correctness. │ └── utils • Runtime polymorphism ├── reference • Kernels have the same signature for different architectures │ ├── matrix │ ├── preconditioner “reference” kernels • Executor determines which kernel is used │ ├── solver ensure correctness │ └── test Determines where Data lives │ ├── base & operation is executed │ ├── matrix │ ├── preconditioner • LinOp class for any linear operator: │ └── solver └── gpu • Matrices ├── matrix generate ├── preconditioner • Solvers apply ├── solver “gpu” contains the • … └── test Preconditioners ├── matrix hardware-tuned • … ├── preconditioner numerical kernels └── solver

41 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Helmholtz-Young-Investigator-Group FiNE @ SCC

• Development of new algorithms targeting Exascale hardware architectures. • Asynchronous methods • Task-based algorithms • Mixed precision kernels

• Implementation and optimization of the novel methods in numerical linear algebra packages.

• Bridging the gap to application scientists by dissemination of sustainable software. https://github.com/ginkgo-project/ginkgo

Pratik Nayak Terry Cojean Goran Flegar Thomas Tobias Ribizel Helmholtz Impuls und Vernetzungsfond Grützmacher VH-NG-1241

42 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Motivation

We are looking for a factorization-based preconditioner such that . A L U is a good approximation with moderate nonzero count (e.g. ).nnz(L +⇡U)=· nnz(A)

• Where should these nonzero elements be located? • How can we compute the preconditioner in a highly parallel fashion?

Exact LU Factorization Incomplete LU Factorization (ILU)

• Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? • What about sparse? Often significant fill-in… • Fill-in in threshold ILU (ILUT) bases on the significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize.

43 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Motivation

We are looking for a factorization-based preconditioner such that . A L U is a good approximation with moderate nonzero count (e.g. ).nnz(L +⇡U)=· nnz(A)

• Where should these nonzero elements be located? • How can we compute the preconditioner in a highly parallel fashion?

Exact LU Factorization Incomplete LU Factorization (ILU)

• Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? • What about sparse? Often significant fill-in… • Fill-in in threshold ILU (ILUT) bases on the significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize.

44 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Motivation

We are looking for a factorization-based preconditioner such that . A L U is a good approximation with moderate nonzero count (e.g. ).nnz(L +⇡U)=· nnz(A)

• Where should these nonzero elements be located? • How can we compute the preconditioner in a highly parallel fashion?

Rethink the overall strategy!

• Use a parallel iterative process to generate factors. • The preconditioner should have a moderate number of nonzero elements, but we don’t care too much about intermediate data.

1. Select a set of nonzero locations. 2. Compute values in those locations such that is a “good” approximation.A L U ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

45 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem…

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C B C B C B C @ A @ A @ A ILU residual R = A - L U ⇥

46 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem…

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C B C B C B C @ A @ A @ A ???? ? ???? ?? 0??? 1 B??C B C B ???C B C B?? ??C B C @ A 47 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem…

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C B C B C B C @ A @ A @ A ???? ? ???? ?? 0???? 1 B??C B C B ???C B C B?? ??C B C @ A 48 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem…

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C B C B C B C @ A @ A @ A ???? ? ???? ?? 0???? ? ? 1 B? ? ? ? ? C B C B ? ? ??C B C B??? ? ??C B C @ A 49 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem…

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C Residual: B C B C B C @ A @ A @ A ???? ? ???? ? ???? ? ?????? ??? ?? ???? ?? 0??????1 0??? 1 0???? ? ? 1 = B???? ?C B??C B? ? ? ? ? C B C B C B C B ?? ??C B ???C B ? ? ??C B C B C B C B??????C B?? ??C B??? ? ??C B C B C B C @ A @ A @ A 50 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem with equationsnnz(A L U) · and variables.nnz(L + U)

???? ? ? ???? ? ??? ?? ?? ?? ?? 0??? 1 0??? 1 0 ? 1 B??C B??C ⇥ B ? C B C B C B C B ???C B ??C B ??C B C B C B C B?? ??C B?? ??C B ? C B C B C B C @ A @ A @ A ???? ? ???? ? ? ???? ? ?????? ??? ?? ? ? ?? ?? Sparsity pattern 0??????1 0??? 1 0??? 1 0 ? 1 = S B???? ?C B??C B? ? C ⇥ B ? C B C B C B C B C B ?? ??C B ???C B ? ? C B ??C B C B C B C B C B??????C B?? ??C B?? ?? C B ? C B C B C B C B C @ A @ A @ A @ A 51 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem with equations nnz(A L U) · and variables.nnz(L + U) • We may want to compute the values in such that ,L, U R = A L U =0 the approximation being exact in the locations included in , but not outside! · |S S nnz(L + U)equations nnz(L + U)variables

???? ? ???? ? ? ???? ? ???? ?? ??? ?? ? ? ?? ?? Sparsity pattern 0???? ? ? 1 0??? 1 0??? 1 0 ? 1 = S B? ? ? ? ? C B??C B? ? C ⇥ B ? C B C B C B C B C B ? ? ??C B ???C B ? ? C B ??C B C B C B C B C B??? ? ??C B?? ??C B?? ?? C B ? C B C B C B C B C @ A @ A @ A @ A 52 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem with equations nnz(A L U) · and variables.nnz(L + U) • We may want to compute the values in such that ,L, U R = A L U =0 the approximation being exact in the locations included in , but not outside! · |S S • This is the underlying idea of Edmond Chow’s parallel ILU algorithm1: 1 j 1 a l u ,i>j ujj ij k=1 ik kj L U = A F (lij,uij)= S i 1 · | ) (a ⇣ l u ,i⌘ j ij k=1Pik kj  P

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015).

53 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem with equations nnz(A L U) · and variables.nnz(L + U) • We may want to compute the values in such that ,L, U R = A L U =0 the approximation being exact in the locations included in , but not outside! · |S S • This is the underlying idea of Edmond Chow’s parallel ILU algorithm1: 1 j 1 a l u ,i>j ujj ij k=1 ik kj L U = A F (lij,uij)= S i 1 · | ) (a ⇣ l u ,i⌘ j ij k=1Pik kj  P • Converges in the asymptotic sense towards incomplete factors L, U such that R = A L U =0 · |S 1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015).

54 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• This is an optimization problem with equations nnz(A L U) · and variables.nnz(L + U) • We may want to compute the values in such that ,L, U R = A L U =0 the approximation being exact in the locations included in , but not outside! · |S S • This is the underlying idea of Edmond Chow’s parallel ILU algorithm1: 1 j 1 a l u ,i>j ujj ij k=1 ik kj L U = A F (lij,uij)= S i 1 · | ) (a ⇣ l u ,i⌘ j ij k=1Pik kj  P • We may not need high accuracy here, because we may change the pattern again…

• One single fixed-point sweep. 1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015).

55 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• Comparing sparsity patterns extremely difficult. • Maybe use the ILU residual as convergence check.

???? ? ???? ? ? ???? ? ???? ?? ??? ?? ? ? ?? ?? 0???? ? ? 1 0??? 1 0??? 1 0 ? 1 = B? ? ? ? ? C B??C B? ? C ⇥ B ? C B C B C B C B C B ? ? ??C B ???C B ? ? C B ??C B C B C B C B C B??? ? ??C B?? ??C B?? ?? C B ? C B C B C B C B C @ A @ A @ A @ A

56 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• The sparsity pattern of might be a A good initial start for nonzero locations.

57 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• The sparsity pattern of might be a A good initial start for nonzero locations.

• Then, the approximation will be exact for all locations 0 = (L0 + U0) and nonzero in locations =( (A) (L U ))S (LS + U )1. S1 S [ S 0 · 0 \S 0 0

???? ? ? ???? ? ???? ? ??? ?? ? ? ?? ?? ???? ?? 0??? 1 0??? 1 0 ? 1 0???? ? ? 1 = B??C B? ? C ⇥ B ? C B? ? ? ? ? C B C B C B C B C B ???C B ? ? C B ??C B ? ? ??C B C B C B C B C B?? ??C B?? ?? C B ? C B??? ? ??C B C B C B C B C @ A @ A @ A @ A 1Saad. “Iterative Methods for Sparse Linear Systems, 2nd Edition”. (2003).

58 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• The sparsity pattern of might be a A good initial start for nonzero locations.

• Then, the approximation will be exact for all locations 0 = (L0 + U0) and nonzero in locations =( (A) (L U ))S (LS + U )1. S1 S [ S 0 · 0 \S 0 0 • Adding all these locations (level-fill!) might be good idea…

???? ? ? ???? ? ????? ? ??? ?? ? ? ??? ?? ???? ?? 0??? 1 0??? 1 0 ? ? ? ? 1 0???? ? ? 1 = B??C B? ? ? ? C ⇥ B ? ? C B? ? ? ? ? ? C B C B C B C B C B ???C B ? ? ? C B ??C B? ? ? ? ??C B C B C B C B C B?? ??C B??? ? ? ? C B ? C B??? ? ??C B C B C B C B C @ A @ A @ A @ A

59 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• The sparsity pattern of might be a A good initial start for nonzero locations.

• Then, the approximation will be exact for all locations 0 = (L0 + U0) and nonzero in locations =( (A) (L U ))S (LS + U )1. S1 S [ S 0 · 0 \S 0 0 • Adding all these locations (level-fill!) might be good idea, but adding these will again generate new nonzero residuals =( (A) (L U )) (L + U ) S2 S [ S 1 · 1 \S 1 1

???? ? ? ???? ? ????? ? ??? ?? ? ? ??? ?? ???? ?? 0??? 1 0??? 1 0 ? ? ? ? 1 0???? ? ? 1 = B??C B? ? ? ? C ⇥ B ? ? C B? ? ? ? ? ? C B C B C B C B C B ???C B ? ? ? C B ??C B? ? ? ? ??C B C B C B C B C B?? ??C B??? ? ? ? C B ? C B??? ? ??C B C B C B C B C @ A @ A @ A @ A

60 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Considerations

1. Select a set of nonzero locations. 2. Compute values in those locations such that A L U is a “good” approximation. ⇡ · 3. Maybe change some locations in favor of locations that result in a better preconditioner. 4. Repeat until the preconditioner quality stagnates.

• At some point we should remove some locations again, e.g. the smallest elements, and start over looking at locations R = A L U … k · k

61 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT

Interleaving fixed-point sweeps approximating values with pattern-changing symbolic routines.

62 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT : Parallelism inside the blocks

Interleaving fixed-point sweeps approximating values with pattern-changing symbolic routines. Parallelism inside the building blocks.

63 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Parallelism inside the blocks: Fixed-point sweeps

Fixed-point sweeps approximate values in ILU factors and residual1:

1 j 1 aij likukj ,i>j • Inherently parallel operation. F (l ,u )= ujj k=1 ij ij i 1 • Elements can be updated asynchronously. (a ⇣ l u ,i⌘ j ij k=1Pik kj  bilinear fixed-point iteration can be parallelized by elements • We can expect 100% parallel efficiency if P number of cores < number of elements

• Residual norm is a global reduction.

1Chow and Patel. “Fine-grained Parallel Incomplete LU Factorization”. In: SIAM J. on Sci. Comp. (2015).

64 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Parallelism inside the blocks: Candidate search

Identify locations that are symbolically nonzero: ⇤ =( (A) (L U)) (L + U) S S [ S · \S • Combination of product and sparse matrix product sparse matrix sums. • Building blocks available in SparseBLAS. sparse matrix sum • Blocks can be combined into one kernel for higher (memory) efficiency. • Kernel can be parallelized by rows. sparse matrix sum

• Cost heavily dependent on sparsity pattern. • Kernel performance bound by memory bandwidth.

65 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Parallelism inside the blocks: Selecting thresholds

A threshold separating the smallest elements is needed for removing insignificant locations and keeping sparsity. • Standard approach: sort / selection algorithms • High computational cost • Memory-intensive • Hard to parallelize

• Thresholds do not need to be exact: • Inaccurate thresholds result in a few additional / less elements. • We can use sampling to get reasonable approximations. • Multiple sampling-based selection runs allow to generate thresholds of reasonable quality in parallel.

• Is this appropriate for many-core architectures with 5K threads executing simultaneously?

66 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Threshold selection on parallel architectures

Selection algorithms traditionally based on re-arranging elements in memory.

• SelectionSort ( comparisons, element swaps)(n2) (n) • QuickSelect (average comparisons, element swaps)O (n) O (n) O O • Floyd-Rivest algorithm ( )n + min(k, n k)+O(pn) • IntroSelect (worst case comparisons, (n) (n)element swaps) O O

• Compute power (#FLOPs) grows much faster than memory bandwidth. • Data-rearranging selection algorithms become inefficient.

”Operations are free, memory access is what counts.”

64bit read, 1 FLOP

John D. McCaplin (TACC)

67 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Threshold selection on parallel architectures: StreamSelect

Rethink the overall strategy!

• Primary goal: reduce the memory traffic. 80 Multiprocessors, each with 64 FP32 cores • Account for high core counts, each having a set of registers. [oversubscribe with threads to hide latency] • Assume “nice” distribution of values ( ~uniform ). • Accept some inaccuracy in the generated threshold (approximation).

1. Find the largest and smallest elements to get the data range. 2. Generate a fine grid of thresholds, distribute them to the cores. 3. Stream all data one single time. 4. Each core handles a set of thresholds and counts how many elements are larger/smaller. 5. Select the threshold with the element count closest to the target value.

NVIDIA V100 “Volta” 7.8 TFLOP/s DP 16GB RAM @900 GB/s

68 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Threshold selection on parallel architectures: StreamSelect

Rethink the overall strategy!

• Primary goal: reduce the memory traffic. 80 Multiprocessors, each with 64 FP32 cores • Account for high core counts, each having a set of registers. [oversubscribe with threads to hide latency] • Assume “nice” distribution of values ( ~uniform ). • Accept some inaccuracy in the generated threshold (approximation). Run 5120 threads (bind to cores) each a few thresholds:

1. Find the largest and smallest elements to get the data range. 2. Generate a fine grid of thresholds, distribute them to the cores. 40 3. Stream all data one single time. Linear StreamSelect 4. Each core handles a set of thresholds and counts how many 30 elements are larger/smaller. 5. Select the threshold with the element count closest to the 20 target value. Runtime 10

0 NVIDIA V100 “Volta” 0 5 10 15 20 25 30 35 7.8 TFLOP/s DP Thresholds per thread 16GB RAM @900 GB/s 32 thresholds gives mesh granularity of 6.1035e-06

69 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Threshold selection on parallel architectures: StreamSelect

Rethink the overall strategy! • Set size m, subset size n* • Runtime increases linear with set size. • Primary goal: reduce the memory traffic. On a V100 GPU: 80 Multiprocessors • For uniform distribution: quality ~ mesh granularity. • Runtime independent of subset size. • Account for high core counts, each having a set of registers. each with 32 FP64 cores 2 • Assume “nice” distribution of values ( ~uniform ).100 10 64 FP32 cores (n-n*) / m • Accept some inaccuracy in the generated threshold (approximation). [oversubscribe with threads to hide latency]

0 1. Find the largest and smallest elements to get the data range. 10Run 5120 threads (bind to cores) each a few thresholds: 2. 10Generate a -5 fine grid of thresholds, distribute them to the cores. 40

Runtime Linear 3.Deviation Stream all data one single time. -2 10 StreamSelect 4. Each core handles a set of thresholds and counts how many 30 elements are larger/smaller. -4 5.10-10Select the threshold with the element count closest to the 10 20 3 4 5 6 7 8 103 104 105 106 107 108 target value.10 10 10 10 10 10 Runtime Subset size n* 10 Set size m

0 NVIDIA V100 “Volta” 0 5 10 15 20 25 30 35 7.8 TFLOP/s DP Thresholds per thread 16GB RAM @900 GB/s 32 thresholds gives mesh granularity of 6.1035e-06

70 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values with pattern-changing symbolic routines. Parallelism inside the building blocks.

71 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values with pattern-changing symbolic routines. Parallelism inside the building blocks.

Bulk-Synchronous Algorithm!

72 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values Dependencies with pattern-changing symbolic routines. Parallelism inside the building blocks.

Bulk-Synchronous Algorithm!

73 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values Dependencies with pattern-changing symbolic routines. Parallelism inside the building blocks.

Bulk-Synchronous Algorithm!

74 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values Dependencies with pattern-changing symbolic routines. Parallelism inside the building blocks.

Bulk-Synchronous Algorithm!

75 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Interleaving fixed-point sweeps approximating values Dependencies with pattern-changing symbolic routines. Parallelism inside the building blocks.

Bulk-Synchronous Algorithm!

76 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Strong dependency – we can not start before finished.

77 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Strong dependency – we can not start before finished. Weak dependency – if we start before: +/- few nonzeros.

78 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

Strong dependency – we can not start before finished. Weak dependency – if we start before: +/- few nonzeros. We can already start with the next iteration.

79 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Is this a future-oriented algorithm?

GPU?

Strong dependency – we can not start before finished. Weak dependency – if we start before: +/- few nonzeros. We can already start with the next iteration.

Excellent candidate for hybrid hardware? Asynchronous execution?

80 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT – A New Parallel Threshold ILU

Slides available: Next steps: • Hybrid ParILUT version utilizing GPU and CPU, overlapping communication & computation.

• Asynchronous version relaxing dependencies.

• Use a different sparsity-pattern generator: • Randomized? • Machine learning techniques?

• Increasing fill-in towards “full” factorization.

• ParILUT routines available in MAGMA-sparse – they will be in Ginkgo!

This research is in cooperation with Edmond Chow (GaTech) and Jack Dongarra (University of Tennessee). Helmholtz Impuls und Vernetzungsfond VH-NG-1241

81 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Scalability Intel Xeon Phi 7250 “Knights Landing” 68 cores @1.40 GHz, thermal2 matrix from SuiteSparse, RCM ordering, 8 el/row. 16GB MCDRAM @490 GB/s

70 1 CSC CSR CSC CSR Candidates 0.9 Candidates 60 Residuals Residuals ILU-norm 0.8 ILU-norm CSR CSC Select 50 0.7 Add Add Sweep1 0.6 Sweeps 40 Select2Rm Remove Remove 0.5 30 Sweep2 Speedup 0.4 Runtime fraction 20 0.3 0.2 10 0.1

0 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 Number of Threads Number of Threads

• Building blocks scale with 15% - 100% parallel efficiency. • Transposition and sort are the bottlenecks. • Overall speedup ~35x when using 68 KNL cores.

82 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Scalability Intel Xeon Phi 7250 “Knights Landing” 68 cores @1.40 GHz, topopt120 matrix from topology optimization, 67 el/row. 16GB MCDRAM @490 GB/s

70 1 CSC CSR CSC CSR Candidates 0.9 Candidates 60 Residuals Residuals ILU-norm 0.8 ILU-norm CSR CSC Select 50 0.7 Add Add Sweep1 0.6 Sweeps 40 Select2Rm Remove Remove 0.5 30 Sweep2 Speedup 0.4 Runtime fraction 20 0.3 0.2 10 0.1

0 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 Number of Threads Number of Threads

• Building blocks scale with 15% - 100% parallel efficiency. • Dominated by candidate search. • Overall speedup ~52x when using 68 KNL cores.

83 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Performance Intel Xeon Phi 7250 “Knights Landing” 68 cores @1.40 GHz, Runtime of 5 ParILUT / ParICT steps and speedup over SuperLU ILUT*. 16GB MCDRAM @490 GB/s

Matrix Origin Rows Nonzeros Ratio SuperLU ParILUT ParICT

ani7 2D Anisotropic Diffusion 203,841 1,407,811 6.91 10.48 s 0.45 s 23.34 0.30 s 35.16

apache2 Suite Sparse Matrix Collect. 715,176 4,817,870 6.74 62.27 s 1.24 s 50.22 0.65 s 95.37

cage11 Suite Sparse Matrix Collect. 39,082 559,722 14.32 60.89 s 0.54 s 112.56 --

jacobianMat9 Fun3D Fluid Flow Problem 90,708 5,047,042 55.64 153.84 s 7.26 s 21.19 --

thermal2 Thermal Problem (Suite Sp.) 1,228,045 8,580,313 6.99 91.83 s 1.23 s 74.66 0.68 s 134.25

tmt_sym Suite Sparse Matrix Collect. 726,713 5,080,961 6.97 53.42 s 0.70 s 76.21 0.41 s 131.25

topopt120 Geometry Optimization 132,300 8,802,544 66.53 44.22 s 14.40 s 3.07 8.24 s 5.37

torso2 Suite Sparse Matrix Collect. 115,967 1,033,473 8.91 10.78 s 0.27 s 39.92 --

venkat01 Suite Sparse Matrix Collect. 62,424 1,717,792 27.52 8.53 s 0.74 s 11.54 --

*We thank Sherry Li and Meiyue Shao for technical help in generating the performance numbers.

84 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing NVIDIA V100 “Volta” How about GPUs? 7.8 TFLOP/s DP 16GB RAM @900 GB/s

• Fine-grained parallelism • High bandwidth for coalescent reads • No deep cache hierarchy • We need to oversubscribe cores for hiding latency

85 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing NVIDIA V100 “Volta” How about GPUs? 7.8 TFLOP/s DP 16GB RAM @900 GB/s

• Fine-grained parallelism • High bandwidth for coalescent reads • No deep cache hierarchy • We need to oversubscribe cores for hiding latency thermal2 matrix from SuiteSparse, RCM ordering, 8 el/row.

0.1

0.05

0 Runtime [s] -0.05

-0.1 NVIDIA P100 GPU Intel Haswell 20c Res Sort Add Trans Cand Trans Thres Remv Sweep1 Sweep2

86 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing NVIDIA V100 “Volta” How about GPUs? 7.8 TFLOP/s DP 16GB RAM @900 GB/s

• Fine-grained parallelism • High bandwidth for coalescent reads • No deep cache hierarchy • We need to oversubscribe cores for hiding latency topopt120 matrix from topology optimization, 67 el/row.

1

0.5

0 Runtime [s] -0.5

-1 NVIDIA P100 GPU Intel Haswell 20c Res Sort Add Trans Cand Trans Thres Remv Sweep1 Sweep2

87 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT quality Anisotropic fluid flow problem n: 741, nz: 4,951

80 IC(0) 70 ICT ParICT 60

50

40

CG Iterations 30

20

10

0 0 2 4 6 8 10 Number of ParICT steps (2 sweeps per step)

• Top-level solver iterations as quality metric. • Few sweeps give a “better” preconditioner than ILU(0). • Better than ILUT?

88 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT quality Anisotropic fluid flow problem n: 741, nz: 4,951

80 10 IC(0) 9 70 ICT ParICT 8 60 7 50 6

40 5

4

CG Iterations 30 3 20 2 10 1 Number of ParICT steps (2 sweeps per step) 0 0 0 2 4 6 8 10 0 500 1000 1500 Number of ParICT steps (2 sweeps per step) ILU(0) Pattern discrepancy ILUT

• Top-level solver iterations as quality metric. • Few sweeps give a “better” preconditioner than ILU(0). • Better than ILUT?

89 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing ParILUT quality Anisotropic fluid flow problem n: 741, nz: 4,951

80 10 IC(0) ParICT 9 70 ICT ParICT 8 60 7 50 6

40 5

4

CG Iterations 30 3 20 2 10 1 Number of ParICT steps (2 sweeps per step) 0 0 0 2 4 6 8 10 0 500 1000 1500 Number of ParICT steps (2 sweeps per step) ILU(0) Pattern discrepancy ILUT

• Top-level solver iterations as quality metric. • Pattern stagnates after few sweeps. • Few sweeps give a “better” preconditioner than ILU(0). • Pattern “more like” ILUT than ILU(0). • Better than ILUT?

90 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Test matrices

91 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Convergence: GMRES iterations

92 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Convergence: CG iterations

93 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Preconditioning

We iteratively solve a linear system of the form Ax = b n n n Where nonsingular and .A R ⇥ b, x R 2 2

The convergence rate typically depends on the conditioning of the linear system, which is the ratio between the largest and smallest eigenvalue. 1 max min 1 cond2(A)= = 1 = cond2(A ) min max 1 With we can M A transform the linear system into a system with a lower :⇡ MAx = Mb (left preconditioned) AMy = b, x = My (right preconditioned)

If we now apply the iterative solver to the preconditioned system, we usually get faster convergence. MAx = Mb

94 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Preconditioning

1 We iteratively solve a linear system of the form Ax = b Assume , then: .M = A x = MAx = Mb n n n Where nonsingular and .A R ⇥ b, x R Solution right available, but computing 2 2 1 M = A is expensive…

The convergence rate typically depends on the conditioning of the linear system, which is the ratio between the largest and smallest eigenvalue. 1 max min 1 cond2(A)= = 1 = cond2(A ) min max 1 With we can M A transform the linear system into a system with a lower condition number:⇡ MAx = Mb (left preconditioned) AMy = b, x = My (right preconditioned)

If we now apply the iterative solver to the preconditioned system, we usually get faster convergence. MAx = Mb

95 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Preconditioning

1 We iteratively solve a linear system of the form Ax = b Assume , then: .M = A x = MAx = Mb n n n Where nonsingular and .A R ⇥ b, x R Solution right available, but computing 2 2 1 M = A is expensive…

The convergence rate typically depends on the The preconditioned system is rarely formed explicitely,MA conditioning of the linear system, which is the ratio insted is applied implicitely:M zk+1 = Mrk+1 between the largest and smallest eigenvalue. 1 max min 1 cond2(A)= = 1 = cond2(A ) min max 1 With we can M A transform the linear system into a system with a lower condition number:⇡ MAx = Mb (left preconditioned) AMy = b, x = My (right preconditioned)

If we now apply the iterative solver to the preconditioned system, we usually get faster convergence. MAx = Mb

96 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Preconditioning

1 We iteratively solve a linear system of the form Ax = b Assume , then: .M = A x = MAx = Mb n n n Where nonsingular and .A R ⇥ b, x R Solution right available, but computing 2 2 1 M = A is expensive…

The convergence rate typically depends on the The preconditioned system is rarely formed explicitely,MA conditioning of the linear system, which is the ratio insted is applied implicitely:M zk+1 = Mrk+1 between the largest and smallest eigenvalue. 1 1 Instead of forming the preconditioner M A explicitly cond (A)= max = min = cond (A 1) ⇡ 2 1 2 and applying as , zk+1 = Mrk+1 min max Incomplete Factorization Preconditioner (ILU) try to 1 With we can M A transform the linear system into find an approximate factorization A L U. a system with a lower condition number:⇡ ⇡ · MAx = Mb (left preconditioned) AMy = b, x = My (right preconditioned)

If we now apply the iterative solver to the preconditioned system, we usually get faster convergence. MAx = Mb

97 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Preconditioning

1 We iteratively solve a linear system of the form Ax = b Assume , then: .M = A x = MAx = Mb n n n Where nonsingular and .A R ⇥ b, x R Solution right available, but computing 2 2 1 M = A is expensive…

The convergence rate typically depends on the The preconditioned system is rarely formed explicitely,MA conditioning of the linear system, which is the ratio insted is applied implicitely:M zk+1 = Mrk+1 between the largest and smallest eigenvalue. 1 1 Instead of forming the preconditioner M A explicitly cond (A)= max = min = cond (A 1) ⇡ 2 1 2 and applying as , zk+1 = Mrk+1 min max Incomplete Factorization Preconditioner (ILU) try to 1 With we can M A transform the linear system into find an approximate factorization A L U. a system with a lower condition number:⇡ ⇡ · In the application phase, the preconditioner is only MAx = Mb (left preconditioned) given implicitly, requiring two triangular solves: AMy = b, x = My (right preconditioned) zk+1 = Mrk+1 1 M zk+1 = rk+1 If we now apply the iterative solver to the preconditioned LUz = r system, we usually get faster convergence. MAx = Mb k+1 k+1 =:y Ly = r ,Uz = y | {z)} k+1 k+1 98 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing