Scalability of Parallel Sparse Direct Solvers: Methods, Memory and Performance Alfredo Buttari

Scalability of parallel sparse direct solvers: methods, memory and performance Alfredo Buttari To cite this version: Alfredo Buttari. Scalability of parallel sparse direct solvers: methods, memory and performance. Distributed, Parallel, and Cluster Computing [cs.DC]. Toulouse INP, 2018. tel-01913033 HAL Id: tel-01913033 https://hal.archives-ouvertes.fr/tel-01913033 Submitted on 5 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Institut de Recherche Centre National de la en Informatique de Toulouse Recherche Scientiﬁque Scalability of parallel sparse direct solvers: methods, memory and performance Alfredo Buttari Chargé de Recherche, CNRS MÉMOIRE D’HABILITATION À DIRIGER DES RECHERCHES présenté et soutenu publiquement le 26/09/2018 Rapporteurs Michael A. HEROUX Senior Scientist Sandia National Lab. (USA) Xiaoye Sherry LI Senior Scientist LBNL (USA) Yves ROBERT Professeur des universités ENS-Lyon, LIP (France) Examinateurs Patrick AMESTOY Professeur des universités Toulouse INP, IRIT (France) Iain DUFF Senior Scientist STFC-RAL (UK) Raymon NAMYST Professeur des universités Univ. de Bordeaux, LaBRI(France) Stéphane OPERTO Directeur de Recherche CNRS, Geoazur (France) Résumé La solution rapide et précise de systèmes linéaires creux de grande taille est au coeur de nombreuses applications numériques issues d’une gamme très large de domaines incluant la mécanique de structure, la dynamique des fluides, la géophysique, l’imagerie médicale, la chimie. Parmi les techniques les plus couramment utilisées pour la résolution de tels problèmes, les méthodes directes, basées sur la factorisation de la matrice du système, sont généralement appréciées pour leur robustesse numérique et facilité d’utilisation. Cepen- dant, ces méthodes induisent une complexité, en termes d’opérations et de consommation de mémoire, très élevée. Les travaux présentés dans cette thèse se concentrent sur l’amélio- ration de la scalabilité des méthodes creuses directes, définie comme la capacité de traiter des problèmes de taille de plus en plus importante. Nous introduisons des algorithmes capables d’atteindre un meilleur degré de parallélisme en réduisant les communications et synchronisations afin d’améliorer la scalabilité des performances, à savoir, la capacité de réduire le temps d’exécution lorsque les ressources de calcul disponibles augmentent. Nous nous intéressons à l’utilisation de nouveaux paradigmes et outils de programma- tion parallèle permettant une implémentation de ces algorithmes efficace et portable sur des supercalculateurs hétérogènes. Nous adressons la scalabilité en mémoire à l’aide de méthodes d’ordonnancement permettant de tirer profit du parallélisme sans augmenter la consommation de mémoire. Finalement, nous démontrons comment il est possible de réduire la complexité des méthodes creuses directes, en termes de nombre d’opérations et taille mémoire, grâce à l’utilisation de techniques d’approximation de rang faible. Les méthodes présentées, dont l’efficacité a été vérifiée sur des problèmes issus d’applications réelles, ont été implantées dans les plateformes logicielles MUMPS et qr_mumps distribuées sous licence libre. Abstract The fast and accurate solution of large size sparse systems of linear equations is at the heart of numerical applications from a very broad range of domains including structural mechanics, fluid dynamics, geophysics, medical imaging, chemistry. Among the most com- monly used techniques, direct methods, based on the factorization of the system matrix, are generally appreciated for their numerical robustness and ease of use. These advan- tages, however, come at the price of a considerable operations count and memory footprint. The work presented in this thesis is concerned with improving the scalability of sparse direct solvers, intended as the ability to solve problems of larger and larger size. More precisely, our work aims at developing solvers which are scalable in performance, memory consumption and complexity. We address performance scalability, that is the ability to reduce the execution time as more computational resources are available, introducing algorithms that improve parallelism by reducing communications and synchronizations. We discuss the use of novel parallel programming paradigms and tools to achieve their implementation in an efficient and portable way on modern, heterogeneous supercomputers. We present methods that make sparse direct solvers memory-scalable, that is, capable of taking advantage of parallelism without increasing the overall memory footprint. Finally we show how it is possible to use data sparsity to achieve an asymptotic reduction of the cost of such methods. The presented algorithms have been implemented in the freely distributed MUMPS and qr_mumps solver packages and their effectiveness assessed on real life problems from academic and industrial applications. i Acknowledgments I am very grateful to Michael Heroux, Xiaoye Sherry Li and Yves Robert for reporting on the manuscript and to Iain Duff, Raymond Namyst and Stéphane Operto for taking partin my jury; their insightful questions and comments on the manuscript and the presentation have been extremely rewarding and encouraging for the continuation of my career. During my career I’ve had the chance to meet and work with very talented researchers. I wish to thank all my previous colleagues from the University or Rome Tor Vergata and from the Innovative Computing Laboratory at UT Knoxville as well as my present colleagues from the APO team of the IRIT laboratory. I am very grateful to the people of the MUMPS team and, especially, to Patrick and Jean-Yves who have been throughout these years, and still are, for me a model and a great source of inspiration. The SOLHAR project has been a fantastic professional experience; I wish to thank all the members and, especially, Abdou and Emmanuel. Special thanks go to the PhD students I’ve had the pleasure to supervise: François- Henry, Clément, Florent and Theo. Their work, talent and dedication have been of great contribution to my career. Thanks to my boys, Roberto and Marco, who did their best to keep me from writing this manuscript but give me, every day, the motivation to improve myself. I owe the deepest gratitude to my wife, Federica, for her constant support and encouragement. iii Contents Contents v 1 Introduction 1 2 Background 7 2.1 Dense matrix factorizations .......................... 7 2.1.1 Unsymmetric matrices: the Gaussian Elimination .......... 7 2.1.1.1 Accuracy of the Gaussian Elimination and pivoting . 8 2.1.2 Symmetric matrices: the Cholesky and LDLT factorizations . 10 2.1.3 QR factorization ............................ 11 2.1.3.1 Householder QR decomposition . 14 2.1.4 Blocked variants ............................ 16 2.1.4.1 Blocked LU factorization . 16 2.1.4.2 Blocked QR factorization . 17 2.1.4.3 QR factorization with pivoting . 18 2.2 Sparse matrix factorizations .......................... 19 2.2.1 The Cholesky multifrontal method . 19 2.2.2 The QR multifrontal method ..................... 26 2.2.3 Additional topics ............................ 31 2.2.3.1 Fill-reducing orderings .................... 31 2.2.3.2 Pivoting ........................... 34 2.2.3.3 Unsymmetric methods .................... 36 2.2.3.4 Multifrontal solve ...................... 36 2.2.3.5 A three-phases approach . 38 2.2.3.6 Sparse, direct solver packages . 38 2.3 Supercomputer architectures ......................... 39 2.4 Runtime systems and the STF model ..................... 44 3 Parallelism and performance scalability 47 3.1 Task-based multifrontal QR for manycore systems . 49 3.2 A hand coded task-based parallel implementation . 53 3.2.1 Blocking of dense matrix operations . 56 3.2.2 Experimental results .......................... 56 3.2.2.1 Understanding the memory utilization . 57 3.2.2.2 The effect of tree pruning and reordering . 59 3.2.2.3 Absolute performance and Scaling . 59 3.3 Runtime-based multifrontal QR for manycore systems . 61 3.4 Communication Avoiding QR fronts factorizations . 65 3.4.1 Communication Avoiding dense factorizations . 67 v Contents 3.4.2 Using Communication Avoiding factorizations within the multifrontal method ................................. 70 3.5 Multifrontal QR for heterogeneous systems . 75 3.5.1 Frontal matrices partitioning schemes . 76 3.5.2 Scheduling strategies .......................... 78 3.5.3 Implementation and experimental results . 81 3.5.4 Combining communication-avoiding methods and GPUs . 83 3.6 A performance analysis approach for task-based parallelism . 83 3.6.1 Analysis for homogeneous multicore systems . 87 3.6.2 Analysis for heterogeneous systems . 88 4 Memory-aware 93 4.1 Memory-aware scheduling in shared-memory systems . 95 4.1.1 Experimental results .......................... 98 4.2 Memory-aware scheduling and mapping in distributed-memory systems . 101 4.2.1 Mapping techniques . 102 4.2.1.1 Memory scalability of proportional mapping: an example 104 4.2.2 Memory-aware mapping algorithms . 104 4.2.3 Experiments .............................

Load more