Linear Algebra, Krylov-subspace methods, and multigrid solvers for the discovery of New Physics (LyNcs): WP 8 Project Presentations - 28 September 2020

Jacob Finkenrath

CaSToRC, The Cyprus Institute

1 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu PRACE 6IP WP8 LyNcs: People

▶ Constantia Alexandrou, CaSToRC – PI ▶ Emmanuel Agullo, Inria ▶ Simone Bacchio, CaSToRC ▶ Olivier Coulaud, Inria ▶ Jacob Finkenrath, CaSToRC ▶ Luc Giraud, Inria ▶ Kyriakos Hadjiyiannakou, CaSToRC ▶ Giannis Koutsou, CaSToRC ▶ Stéphane Lanteri, Inria ▶ Gilles Marait, Inria ▶ Michele Martone, LRZ ▶ Shuhei Yamamoto, CaSToRC

2 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Project Objectives

LyNcs: Linear Algebra, Krylov-subspace methods, and multigrid solvers for the discovery of New Physics Address the Exascale challenge for sparse linear systems & iterative solvers

▶ Targeting all level of software stack ● on lowest level: sparse Matrix vector kernels: librsb ● on intermediate level: iterative linear solver libraries: fabulous, DDalphaAMG, QUDA ● on highest level : community codes: LyNcs API, tmLQCD, openQCD

Applications in: ▶ Fundamental Physics - lattice QCD ▶ Electrodynamics ▶ Computational Chemistry

3 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Talk Outline

▶ Motivation

▶ LyNcs within 6IP Prace WP8

● Project structure

● Tasks and Deliverables

▶ Overview on software developments

● Lyncs API - New community-driven software

● Fabulous - Iterative Block Krylov Solvers library

● librsb - a Sparse BLAS format library

● DDalphaAMG - Multigrid library for lattice QCD

▶ Outlook

4 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu LyNcs vision: Towards Exascale Computing

▶ Objective 1: Performance and Efficiency

● focusing on efficient node level performance by using state of the art implementations and algorithms like Block Krylov solvers and multigrid approaches

▶ Objective 2: Scaling

● extend scalability and speeding up the time to solution beyond DD+strong scaling with alternative parallelization strategies

▶ Objective 3: Diversity

● Hardware and software: Software stack of community software is rich and optimized for various hardware systems, optimizations and algorithm developments are on going in various tasks like given by DOEs Exascale initiative

Focus of LyNcs is to be complementary/in a synergy to those efforts

▶ on user-friendly API to enable the rich software stack of solver and application codes

▶ on specific software optimization and developments used in the European Community

5 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Tasks & Targets:

Project identifies 4 tasks: Task 1: Sparse linear solvers suitable for Exascale Task 2: Prototyping new methods for Exascale Task 3: Portable APIs Task 4: Community code enabling

We target software implementation, matching the objectives of the tasks: Target 1: LyNcs API (Task 3,4,2) - New community-driven software - portable, flexible, modular and user-friendly

Target 2: Fabulous (Task 1,2) - Iterative Block Krylov Solvers library - study and implementation of new algorithms

Target 3: librsb (Task 1) - Recursive Sparse Blocks format library - improved portability, benchmarking and code quality

Target 4: DDalphaAMG (Task 1,2) - Multigrid library for lattice QCD - extension to block solver and linkage to Fabulous

6 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Deliverables & contributions

PRACE WP 8 – Deliverables Contribution by LyNcs

▶ D8.2 (MS 6) Interim progress report: ▶ Finalized set of tasks and milestones, PM distribution, and staff and established project structure team, including any new staff.

▶ D8.3 (MS 12) Interim progress report: ▶ First version of APIs from Task 3. Progress of library Public Prototype software release and optimization (Task 1) and evaluation of prototypes new development infrastructure algorithms (Task 2)

▶ D8.4 (MS 24) Interim progress report: ▶ First release of software, including documentation, Public Software release (docs, testing, benchmark results and issue tracking statistics (Tasks 1, issue tracker) and integration in external and 3). Report on evaluation of new methods of Task 2, and codes candidates selected for implementation within Task 1. Progress of integration into community codes (Task 4).

▶ D8.5 (MS 30) Final report: including ▶ Performance results on community software linked to APIs performance results on (pre)Exascale (Tasks 1, 2, 3, and 4). Report on results presented in systems conferences. Publication of results of Task 2 (as peer-review publication or PRACE white paper)

7 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Deliverables & contributions

PRACE WP 8 – Deliverables Contribution by LyNcs

▶ D8.2 (MS 6) Interim progress report: ▶ Finalized set of tasks and milestones, PM distribution, and staff and established project structure team, including any new staff.

▶ D8.3 (MS 12) Interim progress report: ▶ First version of APIs from Task 3. Progress of library Public Prototype software release and optimization (Task 1) and evaluation of prototypes new development infrastructure algorithms (Task 2)

▶ D8.4 (MS 24) Interim progress report: ▶ First release of software, including documentation, Public Software release (docs, testing, benchmark results and issue tracking statistics (Tasks 1, issue tracker) and integration in external and 3). Report on evaluation of new methods of Task 2, and codes candidates selected for implementation within Task 1. Progress of integration into community codes (Task 4).

▶ D8.5 (MS 30) Final report: including ▶ Performance results on community software linked to APIs performance results on (pre)Exascale (Tasks 1, 2, 3, and 4). Report on results presented in systems conferences. Publication of results of Task 2 (as peer-review publication or PRACE white paper)

8 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Project Synergy

DDalphaAMG

Integration of major software for Lattice QCD simulations Advance Krylov block solvers for multigrid methods Lyncs API Implementation of high-quality community-software standards Fabulous librsb Performance improvements of low level kernels

9 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Lyncs API - Overview Lyncs-API A Python API for LQCD applications Flexible, portable, extendible Task and data parallelism ▶ Lyncs-API covers the entire stack, from low-level dask.distributed, mpi4py libraries to fully-fledged community codes Distributed data Python bindings dask.Array cppyy

Main features: LQCD libraries Data management DDalphaAMG, QUDA, numpy, cupy, h5py tmLQCD, OpenQCD... ➔ Flexible framework for easy-integration of external libraries /C++ libraries Python modules ➔ Parallel tasking and modular computing Developed / wrapped / contributed to

➔ Automatic benchmarking and cross Contribution to Objectives checking of different implementations ▶ O2: alternative parallelization methods will speed-up real ➔ High-level user-friendly Python API time to solution of Markov-Chains beyond strong scaling regime ➔ Low-level integration of C/C++ libraries ▶ O3: flexible to diversity in hardware by utilization of community software stack, e.g. QUDA for GPU systems

10 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Lyncs API - Targeted libraries

▶ Lyncs-API provides standalone Python In the Lyncs-API we aim to target as many LQCD interfaces to several LQCD libraries: libraries as possible, such that we can DDalphaAMG, clime, tmLQCD, QUDA, openQCD, QphiX and others to come... ▶ Optimally support various architecture with ▶ Source-codes are downloaded, patched and architecture-specific libraries compiled using git & CMake ▶ Crosscheck and benchmark various ▶ Python bindings are automatically generated implementations via cppyy ▶ Involve all the community and be open to contributions

▶ Cover most of the research interests cppyy

▶ Give a second life to legacy codes Automatic C/C++ bindings >>> cppyy.include(‘header.h’) >>> cppyy.load_library(‘lib.so’) For doing this we have created several low- and ➢ Direct access to library functions high-level interfacing tools... ➢ Automatic casting of Python types ➢ Support of templates and overloading

11 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Lyncs API - Python interface

▶ Top level: Lyncs, an API that offers a high-level common framework for LQCD Natively scales Python ▶ Task scheduling, parallelism and data distribution is managed by Dask with ➢ Management of distributed data and tasks improved support for MPI applications ➢ Task scheduling and parallelism via futures via Lyncs-mpi ➢ Distributed numpy arrays

▶ Available implementations of the same function are benchmarked, cross-checked and selected via tuneit. tuneit ▶ All the low-level interfaces to external Tune, benchmark and crosscheck libraries can be used independently of Lyncs and the Lyncs-API is independent from them ➢ Generic-purpose package developed relying on alternative implementations beside Lyncs as a necessity ➢ Function implementations/parameters >> pip install lyncs_DDalphaAMG are used for tuning to best performance >> pip install lyncs[DDalphaAMG] ➢ Benchmarking and testing tools provided

12 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Lyncs API - Development status

Available on github.com/Lyncs-API

▶ Currently in alpha-state

▶ Open-source, BSD-3-Clause license

▶ Modular structure of many packages: ○ divided purpose-wise, e.g. interfaces, common tools, target architecture, etc.. ○ with independent testing, version control and installation Contribution to Task 2, 3 and 4 ▶ High-quality community-software standards ▶ D8.4: First release of software, including ○ Well documented documentation, benchmark results and issue tracking statistics. Report on evaluation of new ○ CI/CD via Github actions methods of Task 2, Progress of integration into ○ Code formatting and linting community codes (Task 4). ▶ D8.5: Performance results on community software ▶ Available via Pipy (and conda in a future) linked to APIs

13 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Fabulous - Block Krylov solver library Block Minimum Norm Residual Techniques

▶ Solve:

▶ Search space: ● Look for approx. in nested space

Main Difficulties ▶ Individual convergence rate for each rhs ● rank deficiency in search space (inexact breakdown) ● keep most relevant direction (backward error) Contribution to Objectives ▶ O1: Algorithm development to utilize node ▶ Storage of the search space basis level performance ● restart by keeping infos from search space ▶ O2: Algorithm development to extend (spectral information recycling) strong scaling regime

14 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Fabulous - Block Krylov solver library PRELIMINARY Generalized Conjugate Residual (GCR) with Optimal Truncation

▶ GCR mathematical equivalent to GMRES ▶ requires an extra memory of 5-20 vectors ▶ “straightforward” extension to block case ▶ support of flexible preconditioners ▶ cope with memory constraints both restart and truncation policies are possible

Next Steps within LyNcs: ▶ Recycling spectral information between block solutions in Fabulous (IB-BGCRO-DR) ● additional flexible variants (IB-FBGCRO-DR) Contribution to Task 1 and 2 ● utilization of Block Krylov solver in multigrid ▶ D8.4: Report on evaluation of new methods approaches of Task 2, and candidates selected for ▶ More robust orthonormalization schemes implementation within Task 1 ● Householder block GMRES family - ongoing ▶ D8.5: Performance results on community

software linked to APIs,

15 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Librsb - Sparse BLAS Librsb as a stable Sparse BLAS implementation for use in LyNcs

▶ Implements Sparse BLAS API in C/C++/ ▶ http://librsb.sourceforge.net/ ▶ Wrappers for Python, Julia, GNU Octave ▶ C99 + OpenMP, LGPLv3-licensed

▶ 1.2.0.9 (stable branch, released in Aug. 2020) ● API is 41 functions and one struct ● readily in Debian, Ubuntu, SPACK, GUIX-HPC ● GNU Octave sparsersb plugin

▶ master ● stable API: one extra function, compatible specs changes ● experimental features: ● 64-bit indices for huge matrices ● new auto-tuning techniques, sorting, kernels Contribution to Objectives

▶ next developments ▶ O1: Kernel optimization to increase node ● performance experiments via CI/CD level performance ● improving block SpMV and autotuning techniques ▶ O3: Address coverage on diverse hardware with extensive test suite

16 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu

c Librsb - Testing and Coverage

▶ coverage ● ~ 50% reduction in (M4-generated) code lines to 150 KLOC ● reached near-total test coverage ● 99.8% 9165/9179 of functions ● 92.1% 222315/241493 of lines

▶ Testing as part of librsb infrastructure ● continuous integration (CI) on LRZ GitLab ● relied on GitHub+Travis for PyRSB

▶ Coverage rate strictly related to overall testing ● extensive bug mining: exploit parameter ranges, compiler flags, tools, use different high level approaches, random number, artificial errors ● test automation via gitlab, trigger builds and tests after CI ● extensive testing stack in librsb internal and external

▶ Bugs were related to ● extreme parameter ranges ● combination of non-standard sort options, limited-size matrices, unusual transposition, repeated autotuning (fixed bugs documented in NEWS file)

17 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu

c Librsb - Next Steps

Next steps within LyNcs: ▶ bugs regression (CI) ▶ integration within partners codes and inputs ▶ Guix-based PLAFRIM platform ▶ track performance

▶ Towards CI/CD backed performance and integration experiments ● On Inria’s Guix-powered GITLAB backed CI/CD ● LRZ test hardware systems

▶ develop improved multi-RHS SpMV kernels Contribution to Task 1 and 3

▶ D8.4: towards librsb 1.3.0, including documentation,

benchmark results and issue tracking statistics

▶ D8.5: Performance results of librsb on community

sparse matrix kernels linked to APIs

18 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu

c DDalphaAMG multiple RHS

▶ FGMRES with coarse grid correction and SAP ▶ Implements fermion actions commonly used by European community ▶ C-code with hybrid parallelization with MPI + openMP ▶ LyNcs dev-branch under : https://github.com/sy3394/DDalphaAMG

▶ Multigrid solvers in Lattice QCD ● ~ 10⨉ to 100⨉ faster convergence compared to e.g. CG ● Strong-scaling: limited due to >50% time spent in coarse grid ● Implemented in DDalphaAMG (CPUs) & QUDA (Nvidia GPUs)

Contribution to Objectives

▶ O2: Algorithm development to to extend strong scaling regime

▶ O3: Optimization of computing kernel to add flexibility for target architecture

19 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu DDalphaAMG multiple RHS Objectives for multiple rhs ▶ Increase Efficiency : Decrease bandwidth to computing ratio ▶ Enhance Flexibility : Utilization of compiler optimization ▶ Improve Convergence : Using Block-Krylov solvers from Fabulous

Implementation: ▶ Definition of a new data structure, index on vectors runs the fastest ● re-writing of all low level kernels

▶ using auto-vectorization to enhance flexibility ● # pragma unroll # pragma vector aligned # pragma ivdep ● on Skylakes chips this shifts vectorization from 128 bit to 256 bit

20 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu DDalphaAMG multiple RHS

Ongoing: ▶ Enable Fabulous ● Fabulous solvers available on all levels of DDalphaAMG ● currently tuning parameters and testing different Block Krylov solvers ● preliminary results: See a speed up for rhs>4

Next Steps: ▶ Enhance scaling of the coarse grid ● enabling pipeline versions of GMRES

Contribution to Task 1 and 2 ▶ Utilize Fabulous effort on re-usable deflation ▶ D8.4: Contribution via benchmark results (Tasks 1). space, potential speed-up of coarse grid Report on evaluation of new methods of Task 2 iterations ▶ D8.5: Publication of results (conference or ▶ Performance on ARM and AMD chips peer-review publication) ▶ Documentation and testing

21 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu Next Steps and Adjustments

▶ Dissemination ● Task 1: Presented results on DDalphaAMG-multi rhs at major community conference and gave an invited talk to RIKEN/Kobe ● Task 3-4: Outreach to the community scheduled to D8.4, i.e. if core-features of API is well tested and documented ● interest of the community is there: Nvidia developers/community scientist showed interest, ETMC is evaluating LyNcs-API

▶ possible Adjustments to LyNcs ● Impact of EuroHPC Pre-exascale machines not clear yet, possible adjustment in priority to enable potential of machines (Task 1 and 3) ● might include to enable and run ROCm/HIP supported application if an European machine is available (?)

▶ Next in WP 8 ● On a good way towards deliverable D8.4 and D8.5 ● … but still many rocks to climb to have a sustained API for the New Era of European Exa-Computing

22 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu THANK YOU FOR YOUR ATTENTION

www.prace-ri.eu

23 PRACE 6IP – WP8 – LyNcs – CaSToRC www.prace-ri.eu