A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems Xinzhe Wu, Serge Petiton
Total Page:16
File Type:pdf, Size:1020Kb
A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems Xinzhe Wu, Serge Petiton To cite this version: Xinzhe Wu, Serge Petiton. A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems. HPC Asia 2018 - International Conference on High Performance Computing in Asia-Pacific Region, Jan 2018, Tokyo, Japan. 10.1145/3149457.3154481. hal-01677110 HAL Id: hal-01677110 https://hal.archives-ouvertes.fr/hal-01677110 Submitted on 8 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Distributed and Parallel Asynchronous Unite and Conquer Method to Solve Large Scale Non-Hermitian Linear Systems Xinzhe WU Serge G. Petiton Maison de la Simulation, USR 3441 Maison de la Simulation, USR 3441 CRIStAL, UMR CNRS 9189 CRIStAL, UMR CNRS 9189 Université de Lille 1, Sciences et Technologies Université de Lille 1, Sciences et Technologies F-91191, Gif-sur-Yvette, France F-91191, Gif-sur-Yvette, France [email protected] [email protected] ABSTRACT the iteration number. These methods are already well implemented Parallel Krylov Subspace Methods are commonly used for solv- in parallel to profit from the great number of computation cores ing large-scale sparse linear systems. Facing the development of on large clusters. The solving of complicated linear systems with extreme scale platforms, the minimization of synchronous global basic iterative methods cannot always converge fast. The conver- communication becomes critical to obtain good efficiency and scal- gence rate depends on the specialties of operator matrix. Thus the ability. This paper highlights a recent development of a hybrid researchers introduce a kind of preconditioners which combine the (unite and conquer) method, which combines three computation stationary methods and iterative methods, to improve the spectral algorithms together with asynchronous communication to acceler- proprieties of operator A and to accelerate the convergence. This ate the resolution of non-Hermitian linear systems and to improve kind of preconditioners includes the incomplete LU factorization its fault tolerance and reusability. Experimentation shows that our preconditioner (ILU) [6], the Jacobi preconditioner [5], the succes- method has an up to 5× speedup and better scalability than the sive over-relaxation preconditioner (SOR) [1], etc. Meanwhile, there conventional methods for the resolution on hierarchical clusters is a kind of deflated preconditioners which use the approximated with hundreds of nodes. eigenvalues during the solving procedure to form a new initial vector for the next restart procedure, which allows to speed up a KEYWORDS further computation. Erhel [11] studied a deflated technique for the restarted GMRES algorithm, based on an invariant subspace linear algebra, iterative methods, asynchronous communication approximation which is updated at each cycle. Lehoucq [15] in- ACM Reference Format: troduced a deflation procedure to improve the convergence ofan Xinzhe WU and Serge G. Petiton. 2018. A Distributed and Parallel Asyn- Implicitly Restarted Arnoldi Method (IRAM) for computing the chronous Unite and Conquer Method to Solve Large Scale Non-Hermitian eigenvalues of large matrices. Saad [21] presented a deflated ver- Linear Systems. In HPC Asia 2018: International Conference on High Per- sion of the conjugate gradient algorithm for solving linear systems. formance Computing in Asia-Pacific Region, January 28–31, 2018, Chiyoda, Tokyo, Japan. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/ The implementation of these iterative methods was a good tool to 3149457.3154481 resolve linear systems for a long time during past decades. Nowadays, the HPC cluster systems continue not only to scale up 1 INTRODUCTION in compute node and Central Processing Unit (CPU) core count, but also the increase of components heterogeneity with the introduc- Scientific applications require the solving of large-scale non-Hermitian tion of the Graphics Processing Unit (GPU) and other accelerators. linear system Ax = b. The collection of Krylov iterative methods, This trend causes the transition to multi- and many cores inside such as the Generalized minimal residual method (GMRES) [20], of computing nodes which communicate explicitly through fast the Conjugate Gradient (CG) [13] and the Biconjugate Gradient Sta- interconnection networks. These hierarchical supercomputers can bilized Method (BiCGSTAB) [22] are used to solve different kinds of be seen as the intersection of distributed and parallel computing. linear systems. These iterative methods approximate the solution Indeed, with a large number of cores, the communication of overall x of specific matrix A and right-hand vector b from an initial m reduction and global synchronization of applications are the bottle- guessed solution x . In practice, these methods are always restarted 0 neck. When solving a large-scale problem on parallel architectures after a specific number of iterations, caused by the augmentation with preconditioned Krylov methods, the cost per iteration of the of memory and computational requirements with the increase of method becomes the most significant concern, typically because of Permission to make digital or hard copies of all or part of this work for personal or communication and synchronization overheads [8]. Consequently, classroom use is granted without fee provided that copies are not made or distributed large scalar products, overall synchronization, and other opera- for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the tions involving communication among all cores have to be avoided. author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or The numerical applications should be optimized for more local republish, to post on servers or to redistribute to lists, requires prior specific permission communication and less global communication. To benefit the full and/or a fee. Request permissions from [email protected]. HPC Asia 2018, January 28–31, 2018, Chiyoda, Tokyo, Japan computational power of such hierarchical systems, it is central to © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- explore novel parallel methods and models for the solving of linear tion for Computing Machinery. systems. These methods should not only accelerate the conver- ACM ISBN 978-1-4503-5372-4/18/01...$15.00 https://doi.org/10.1145/3149457.3154481 gence but also have the abilities to adapt to multi-grain, multi-level HPC Asia 2018, January 28–31, 2018, Chiyoda, Tokyo, Japan Xinzhe WU and Serge G. Petiton memory, to improve the fault tolerance, reduce synchronization are the eigenvalues. With the help of asynchronous communication, and promote asynchronization. The conventional preconditioners we can select to save the computed eigenvalues by ERAM method have much additional global communication and sparse matrix- into a local file and reuse it for the other solving procedures with vector products. They will lose their advantages on the large-scale the same matrix. hierarchical platforms. We implement UCGLE method based on the scientific libraries In order to explore the novel methods for modern computer PETSc and SLEPc for both CPU and GPU versions. We make use architectures, Emad [9] proposed the unite and conquer approach. of these mature libraries in order to focus on the prototype of This approach is a model for the design of numerical methods by the asynchronous model instead of exploiting the optimization of combining different computation components together to work codes performance inside of each component. PETSc provides also for the same objective, with asynchronous communication among different kinds of preconditioners which can be easily used tothe them. Unite implies the combination of different calculation com- performance comparison with UCGLE method. ponents, and conquer represents different components work to- In Section 2, we present the three basic numerical algorithms gether to solve one problem. In the unite and conquer methods, in detail which construct the computation components of UCGLE different computation parallel components work independently method. The implementation of different levels parallelism and with asynchronous communication. These different components communication are shown in Section 3. In Section 4, we evaluate the can be deployed on different platforms such as P2P, cloud and the convergence and performance of UCGLE method with our scientific supercomputer systems, or on the same platform with different large-scale sparse matrices on top of hierarchical CPU/GPU clusters. processors. The idea of unite and conquer approach came from the We give the conclusions and perspectives in Section 5. article of Saad [17] in 1984, where he suggested using Chebyshev polynomial to accelerate the convergence