Numerical Linear Algebra for High Performance Computing

Numerical Linear Algebra for High Performance Computing SBD Workshop: Transfer of HPC and Data Know-How to Scientific Communities June 11th/12th 2018, Juelich Supercomputing Centre (JSC) Hartwig Anzt, Terry Cojean, Goran Flegar, Thomas Grűtzmacher, Pratik Nayak,Tobias Ribizel Steinbuch Centre for Computing (SCC) KIT – The Research University in the Helmholtz Association www.kit.edu Algorithms reflecting hardware evolution • Explosion in core count. ”Parallelism needed – synchronization kills performance.” 2 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution 64bit read, 1 DP-FLOP John D. McCaplin (TACC) • Explosion in core count. • Compute power (#FLOPs) grows much faster than bandwidth. ”Parallelism needed – synchronization kills performance.” ”Operations are free, memory access is what counts.” 3 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution • Explosion in core count. ”Parallelism needed – synchronization kills performance.” 4 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks. time → • Explosion in core count. ”Parallelism needed – synchronization kills performance.” 5 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks. • Break fork-join-model. • Synchronize locally, avoid global synchronization. • OpenMP tasks, OmpSs, PARSEC, StarPU. time → • Explosion in core count. ”Parallelism needed – synchronization kills performance.” 6 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Algorithms reflecting hardware evolution Task-based algorithms • Define work packages as “tasks”. • Identify dependencies between tasks. • Break fork-join-model. • Synchronize locally, avoid global synchronization. • OpenMP tasks, OmpSs, PARSEC, StarPU. Reformulate algorithms in terms of fixed-point iterations • Fixed-point iteration converging in the asymptotic sense can accept lack of synchronization. • Element-wise independent iterations allow to scale one multi- and many-core architectures. • Explosion in core count. ”Parallelism needed – synchronization kills performance.” 7 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U 8 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization • Decompose system matrix into product .A = L U · • Based on Gaussian elimination. • Triangular solves to solve a system :Ax = b Ly = b y Ux = y x ) ) ) • De-Facto standard for solving dense problems. • What about sparse? Often significant fill-in… 9 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . S • Triangular solves to solve a system :Ax = b n n L R ⇥ lower (unit-) triangular, sparse. • For 2ILU(0), is the sparsity pattern of A. Ly = b y Ux = y x n nS ) ) ) U• RWorks well for many problems.⇥ upper triangular, sparse. 2 • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? Lij = Uij =0 (i, j) / . • What about sparse? Often significant fill-in… 8 2 S R = A L U, R =0 (i, j) . • Fill-in in threshold ILU (− · ij ILUT8) bases on the2 S significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize. 10 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . S • Triangular solves to solve a system :Ax = b n n L R ⇥ lower (unit-) triangular, sparse. • For 2ILU(0), is the sparsity pattern of A. Ly = b y Ux = y x n nS ) ) ) U• RWorks well for many problems.⇥ upper triangular, sparse. 2 • De-Facto standard for solving dense problems. • Is this the best we can get for nonzero count? Lij = Uij =0 (i, j) / . • What about sparse? Often significant fill-in… 8 2 S R = A L U, R =0 (i, j) . • Fill-in in threshold ILU (− · ij ILUT8) bases on the2 S significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize. 11 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Fill-in in threshold ILU (ILUT) bases on the significance of elements (e.g. magnitude).S • Often better preconditioners than level-based ILU. • Difficult to parallelize. 12 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Gaussian Elimination naturally sequential. • Fill•-in in threshold ILU (Level-scheduling brings limited parallelism.ILUT) bases on the significance of elements (e.g. magnitude).• A L U usually only a rough approximation.S ⇡ · • Often better preconditioners than level-based ILU. • Difficult to parallelize. 13 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU We want to solve a linear problem of the form Ax = b. For this, we factorize into the product of a lower triangular matrix A L and an upper triangular matrix .U Exact LU Factorization Incomplete LU Factorization (ILU) A L U ⇡ · • Decompose system matrix into product .A = L U • Focused on restricting fill-in to a · • Based on Gaussian elimination. specific sparsity pattern . • Triangular solves to solve a system :Ax = b S Ly = b y Ux = y x • For ILU(0), is the sparsity pattern of A. ) ) ) • Works well for many problems.S • De-Facto standard for solving dense problems. • How to generate an ILU in parallel? • What about sparse? Often significant fill-in… • Gaussian Elimination naturally sequential. • Fill•-in in threshold ILU (Level-scheduling brings limited parallelism.ILUT) bases on the significance of elements (e.g. magnitude).• A L U usually only a rough approximation.S ⇡ · • Often better preconditioners than We would be better off if we level-based ILU. could cheaply (parallel) • Difficult to parallelize.generate an approximation. 14 05/07/2018 Hartwig Anzt: Numerical Linear Algebra for High Performance Computing Steinbuch Centre for Computing Fixed-point based algorithms: ParILU • Generate an incomplete factorization preconditioner via an iterative process. • Exploit the property (A L U = 0) − · S ???? ? ???? ? ? ???? ? ???? ?? ??? ?? ? ? ?? ?? 0???? ? ? 1 0??? 1 0??? 1 0 ? 1 = B? ? ? ? ? C B??C − B? ? C ⇥ B ? C B C B C B C B C B ? ? ??C B ???C B ? ? C B ??C B C B C B C B C B??? ? ??C B??

Load more