Heterogeneous Architectures, Hybrid Methods, Hierarchical Matrices for Sparse Linear Solvers Pierre Ramet

Heterogeneous architectures, Hybrid methods, Hierarchical matrices for Sparse Linear Solvers Pierre Ramet To cite this version: Pierre Ramet. Heterogeneous architectures, Hybrid methods, Hierarchical matrices for Sparse Linear Solvers. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Bordeaux, 2017. tel- 01668740v2 HAL Id: tel-01668740 https://hal.inria.fr/tel-01668740v2 Submitted on 6 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. UNIVERSITÉ DE BORDEAUX LABORATOIRE BORDELAIS DE RECHERCHE EN INFORMATIQUE HABILITATION À DIRIGER DES RECHERCHES AU TITRE DE L’ÉCOLE DOCTORALE DE MATHÉMATIQUES ET D’INFORMATIQUE Par Pierre RAMET Heterogeneous architectures, Hybrid methods, Hierarchical matrices for Sparse Linear Solvers Soutenue et présentée publiquement le : 27 novembre 2017 Après avis des rapporteurs : Frédéric DESPREZ Deputy Scientific Director, Inria Iain DUFF . Senior Scientist, Rutherford Appleton Lab. Yousef SAAD . Professor, University of Minnesota Devant la commission d’examen composée de : Iain DUFF . Senior Scientist, Rutherford Appleton Lab. Examinateur Raymond NAMYST Professor, Bordeaux University . Examinateur Esmond NG . Senior Scientist, Lawrence Berkeley National Lab. Examinateur Yves ROBERT . Professor, ENS-Lyon . Examinateur Yousef SAAD . Professor, University of Minnesota . Examinateur Isabelle TERRASSE Research Director, Airbus Group . Examinateur Sivan TOLEDO . Professor, Tel-Aviv University . Examinateur 2017 2 Contents Introduction 1 1 Static scheduling and data distributions 7 1.1 Context and algorithmic choices . .7 1.2 Background . .8 1.3 Static scheduling . 11 1.3.1 Preprocessing steps . 11 1.3.2 Parallel scheme of the factorization . 11 1.3.3 Hybrid MPI+Thread implementation . 12 1.3.4 Parallel factorization algorithm . 13 1.3.5 Matrix mapping and task scheduling . 15 1.4 Reducing the memory footprint using partial agregation techniques . 16 2 Dynamic scheduling and runtime systems 19 2.1 A dedicated approach . 19 2.1.1 NUMA architectures . 19 2.1.2 NUMA-aware allocation . 21 2.1.3 Communication overlap . 22 2.1.4 Dynamic scheduling . 23 2.1.5 Numerical experiments . 27 2.2 A generic approach . 28 2.2.1 Related work . 29 2.2.2 Runtimes . 30 2.2.3 Supernodal factorization over DAG schedulers . 33 2.2.4 Heterogeneous experiments . 39 3 Incomplete factorization and domain decomposition 43 3.1 Amalgamation algorithm for iLU(k) factorization . 43 3.1.1 Methodology . 44 3.1.2 Amalgamation algorithm . 46 3.1.3 Numerical experiments . 49 3.2 Graph partitioning for well balanced domain decomposition . 52 3.2.1 Multilevel Framework . 55 3.2.2 Graph Partitioning Algorithms . 57 3.2.3 Experimental Results . 63 4 Towards H-matrices in sparse direct solvers 67 4.1 Reordering strategy for blocking optimization . 67 4.1.1 Intra-node reordering . 67 4.1.2 Related work . 70 4.1.3 Improving the blocking size . 70 4.2 On the use of low rank approximations in PaStiX ...................... 72 4.2.1 Block Low-Rank solver . 73 4.2.2 Low-rank kernels . 75 4.2.3 Numerical experiments . 78 i 4.2.4 Discussion . 83 Conclusion 87 A PaStiX 6.0 updated performance 91 A.1 Experimental conditions . 91 A.2 Performance on the Kepler architecture . 92 A.3 Performance on the KNL architecture . 92 B Bibliography 93 ii Introduction Over the last few decades, there have been innumerable science, engineering and societal breakthroughs enabled by the development of high performance computing (HPC) applications, algorithms and architectures. These powerful tools have provided researchers with the ability to computationally find efficient solutions for some of the most challenging scientific questions and problems in medicine and biology, climatology, nanotechnology, energy and environment; to name a few. It is admitted today that numerical simulation, including data intensive treatment, is a major pillar for the development of scientific discovery at the same level as theory and experimentation. Numerous reports and papers also confirmed that extreme scale simulations will open new opportunities not only for research but also for a large spectrum of industrial and societal sectors. An important force which has continued to drive HPC has been to focus on frontier milestones which consist in technical goals that symbolize the next stage of progress in the field. In the 1990s, the HPC community sought to achieve computing at a teraflop rate and currently we are able to compute on the first leading architectures at more than ten petaflops. General purpose petaflop supercomputers are available and exaflop computers are foreseen in early 2020. For application codes to sustain petaflops and more in the next few years, hundreds of thousands of processor cores or more are needed, regardless of processor technology. Currently, a few HPC simulation codes easily scale to this regime and major algorithms and codes development efforts are critical to achieve the potential of these new systems. Scaling to a petaflop and more involves improving physical models, mathematical modeling, super scalable algorithms. In this context, the purpose of my research is to contribute to performing efficiently frontier simulations arising from challenging academic and industrial research. It involves massively parallel computing and the design of highly scalable algorithms and codes to be executed on emerging hierarchical many- core, possibly heterogeneous, platforms. Throughout this approach, I contribute to all steps that go from the design of new high-performance more scalable, robust and more accurate numerical schemes to the optimized implementations of the associated algorithms and codes on very high performance supercomputers. This research is conducted in close collaboration with European and US initiatives. Thanks to three associated teams, namely PhyLeaS 1, MORSE 2 and FASTLA 3, I have contributed with world leading groups to the design of fast numerical solvers and their parallel implementation. I For the solution of large sparse linear systems, we design numerical schemes and software packages for direct and hybrid parallel solvers. Sparse direct solvers are mandatory when the linear system is very ill-conditioned; such a situation is often encountered in structural mechanics codes, for example. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with more than 50 million unknowns, which is now a reachable challenge with new multicore supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that provides some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers. It would be hard work to provide here a complete survey on direct methods for sparse linear systems and the corresponding software. Jack Dongarra has maintained for many years a list of 1http://www-sop.inria.fr/nachos/phyleas/ 2http://icl.cs.utk.edu/morse/ 3https://www.inria.fr/en/associate-team/fastla 1 freely available software packages for linear algebra 4. And a recent survey by Tim Davis et al. has been published [44] with a list of available software in the last section. New supercomputers incorporate many microprocessors which are composed of one or many computational cores. These new architectures induce strongly hierarchical topologies. These are called NUMA architectures. In the context of distributed NUMA architectures, we study optimization strategies to im- prove the scheduling of communications, threads and I/O. In the 2000s, our first attempts to use available generic runtime systems, such as PM2 [15] or Athapascan [25], failed to reach acceptable performance. We then developed dedicated dynamic scheduling algorithms designed for NUMA architectures in the PaStiX solver. The data structures of the solver, as well as the patterns of communication have been modified to meet the needs of these architectures and dynamic scheduling. We are also interested in the dynamic adaptation of the computational granularity to use efficiently multi-core architectures and shared memory. Several numerical test cases have been used to prove the efficiency of the approach on different architectures. The pressure to maintain reasonable levels of performance and portability forces application develop- ers to leave the traditional programming paradigms and explore alternative solutions. We have studied the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with generic runtime systems. The task graph of the factorization step is made available to the runtimes systems, allowing them to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. The aim was to design algorithms

Load more