A Splitting Approach for the Parallel Solution of Linear Systems on Gpu Cards

Total Page:16

File Type:pdf, Size:1020Kb

A Splitting Approach for the Parallel Solution of Linear Systems on Gpu Cards A SPLITTING APPROACH FOR THE PARALLEL SOLUTION OF LINEAR SYSTEMS ON GPU CARDS by Ang Li A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical and Computer Engineering) at the UNIVERSITY OF WISCONSIN–MADISON 2016 Date of final oral examination: 05/09/16 The dissertation is approved by the following members of the Final Oral Committee: Dan Negrut, Associate Professor, Mechanical Engineering Mikko Lipasti, Professor, Electrical and Computer Engineering Yu-Hen Hu, Professor, Electrical and Computer Engineering Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Radu Serban, Associate Scientist, Mechanical Engineering © Copyright by Ang Li 2016 All Rights Reserved i I would like to dedicate this dissertation to Prof. Dan Negrut, my advisor, without whom none of my accomplishment during my PhD life would have been possible. While I realize it’s unusual to include here, I think I need to share with the world how Dan has also been helpful in changing my attitude towards others and life. ii acknowledgments I would like to give my thanks to my advisor, Professor Dan Negrut, for his guidance and support while working with him. I would also like to thank the committee members for their time and suggestions. I enjoyed the time I spent with all my colleagues and friends in the Simulation-Based Engineering Lab. I benefited a lot from their support, especially Dr. Radu Serban’s instructions. Finally, to my beloved family, I miss YOU, and – I am YOU. iii contents Contents . iii List of Tables . xii List of Figures . xiv Abstract . xviii 1 Introduction . 1 1.1 The Science and Engineering algorithmic backdrop ........... 1 1.2 Solving linear systems .......................... 2 1.3 Contributions ............................... 3 2 Motivation: the rise of parallel computing and emergence of GPU computing 5 2.1 Three efficiency walls in sequential computing ............. 5 2.2 GPU computing and lack of GPU linear system solvers . 6 2.3 Brief overview of relevant hardware ................... 8 2.3.1 The Tesla K40 Graphics Processing Unit . 9 2.3.2 Intel Xeon processor . 10 2.4 Brief overview of Compute Unified Device Architecture . 11 3 The “Split and Parallelize” Method . 12 3.1 SaP for dense banded linear systems ................... 12 3.1.1 SPIKE algorithm . 12 3.1.2 A generalization of the Cyclic Reduction: the Block Cyclic Reduction algorithm . 17 3.1.3 Nomenclature, solution strategies . 20 3.1.4 A comparison of SPIKE algorithm and Block Cyclic Reduction algorithm . 20 3.2 SaP for general sparse linear systems . 25 iv 3.3 SaP components and computational flow . 26 4 Reordering Methodologies . 29 4.1 Diagonal Boosting ............................. 29 4.2 Bandwidth Reduction ........................... 31 4.3 Third-stage reordering .......................... 33 5 Krylov Subspace Method . 35 5.1 A brief introduction to preconditioning . 36 5.2 Conjugate Gradient method ....................... 37 5.2.1 Backgrounds . 37 5.2.2 Further discussion . 39 5.3 Biconjugate Gradient Stabilized method and its generalization . 40 5.4 A profiling of Krylov subspace method . 44 6 Implementation Details . 48 6.1 Dense banded matrix factorization details . 48 6.1.1 Number of partitions and partition size . 48 6.1.2 Matrix storage . 49 6.1.3 LU/UL/Cholesky factorization . 50 6.1.4 Optimization techniques . 51 6.2 Implementation details for reordering methodologies . 53 6.2.1 DB reordering implementation details . 53 6.2.2 CM reordering implementation details . 54 6.3 CPU-GPU data transfer in SaP::GPU . 56 6.4 Multi-GPU support ............................ 59 7 Numerical Experiments . 64 7.1 Numerical experiments related to dense banded linear systems . 64 7.1.1 Sensitivity with respect to P ................... 64 7.1.2 Sensitivity with respect to d ................... 67 v 7.1.3 Comparison with LAPACK and Intel’s MKL over a spectrum of N and K ............................ 73 7.1.4 Profiling results for dense banded linear system solver . 78 7.2 Numerical experiments related to sparse matrix reorderings . 91 7.2.1 Assessment of the diagonal boosting reordering solution . 91 7.2.1.1 Efficiency evaluation of diagonal boosting algorithm . 91 7.2.1.2 Performance comparison against Harwell Sparse Li- brary’s solution . 95 7.2.2 Assessment of the bandwidth reduction reordering solution . 97 7.3 Numerical experiments related to sparse linear systems . 100 7.3.1 Profiling results . 100 7.3.2 The impact of the third stage reordering . 112 7.3.3 Comparison against state-of-the-art direct linear solvers on the CPU . 114 7.3.4 Comparison against a state-of-the-art GPU linear solver . 118 7.4 Numerical experiments highlighting multi-GPU support . 120 8 Use of SaP in computational multibody dynamics problems . 124 9 Conclusions and future work . 133 A Namespace Index . 135 A.1 Namespace List ..............................135 B Class Index . 136 B.1 Class Hierarchy ..............................136 C Class Index . 138 C.1 Class List .................................138 D File Index . 140 D.1 File List ..................................140 vi E Namespace Documentation . 142 E.1 sap Namespace Reference . 142 E.1.1 Detailed Description . 144 E.1.2 Enumeration Type Documentation . 145 E.1.2.1 KrylovSolverType . 145 E.1.2.2 PreconditionerType . 145 E.1.3 Function Documentation . 145 E.1.3.1 bicgstab . 145 E.1.3.2 bicgstabl . 145 E.1.3.3 minres . 146 F Class Documentation . 147 F.1 sap::Graph< T >::AbsoluteValue< VType > Struct Template Reference147 F.2 sap::Graph< T >::AccumulateEdgeWeights Struct Reference . 147 F.3 sap::BandedMatrix< Array > Class Template Reference . 147 F.3.1 Detailed Description . 148 F.3.2 Constructor & Destructor Documentation . 149 F.3.2.1 BandedMatrix . 149 F.4 sap::BiCGStabLMonitor< SolverVector > Class Template Reference . 149 F.4.1 Detailed Description . 150 F.5 sap::Graph< T >::ClearValue Struct Reference . 151 F.6 sap::Graph< T >::CompareValue< VType > Struct Template Reference151 F.7 sap::CPUTimer Class Reference . 151 F.7.1 Detailed Description . 152 F.8 sap::Graph< T >::Difference Struct Reference . 152 F.9 sap::Graph< T >::EdgeLength Struct Reference . 153 F.10 sap::Graph< T >::EqualTo< Type > Struct Template Reference . 153 F.11 sap::Graph< T >::Exponential Struct Reference . 154 F.12 sap::Graph< T >::GetCount Struct Reference . 154 F.13 sap::GPUTimer Class Reference . 154 F.13.1 Detailed Description . 155 vii F.14 sap::Graph< T > Class Template Reference . 155 F.14.1 Detailed Description . 159 F.15 sap::Graph< T >::is_not Struct Reference . 160 F.16 sap::Graph< T >::Map< Type > Struct Template Reference . 160 F.17 sap::Monitor< SolverVector > Class Template Reference . 160 F.17.1 Detailed Description . 162 F.18 sap::Multiply< T > Struct Template Reference . 162 F.19 sap::MVBanded< Matrix > Class Template Reference . 162 F.19.1 Detailed Description . 163 F.20 sap::Options Struct Reference . 164 F.20.1 Detailed Description . 165 F.20.2 Constructor & Destructor Documentation . 165 F.20.2.1 Options . 165 F.20.3 Member Data Documentation . 165 F.20.3.1 absTol . 165 F.20.3.2 applyScaling . 165 F.20.3.3 dbFirstStageOnly . 165 F.20.3.4 dropOffFraction . 165 F.20.3.5 factMethod . 166 F.20.3.6 gpuCount . 166 F.20.3.7 ilu_level . 166 F.20.3.8 isSPD . 166 F.20.3.9 maxBandwidth . 166 F.20.3.10maxNumIterations . 166 F.20.3.11performDB . 166 F.20.3.12performReorder . 166 F.20.3.13precondType . 166 F.20.3.14relTol . 167 F.20.3.15safeFactorization . 167 F.20.3.16saveMem . 167 F.20.3.17solverType . 167 viii F.20.3.18testDB . 167 F.20.3.19trackReordering . 167 F.20.3.20variableBandwidth . 167 F.21 sap::Graph< T >::PermutedEdgeLength Struct Reference . 168 F.22 sap::Graph< T >::PermuteEdge Struct Reference . 168 F.23 sap::Precond< PrecVector > Class Template Reference . 169 F.23.1 Detailed Description . 172 F.23.2 Constructor & Destructor Documentation . 172 F.23.2.1 Precond . 172 F.23.2.2 Precond . 173 F.23.2.3 Precond . 173 F.23.3 Member Function Documentation . 173 F.23.3.1 operator() . 173 F.23.3.2 setup . 173 F.23.3.3 solve . 174 F.23.3.4 update . 174 F.24 sap::SegmentedMatrix< Array, MemorySpace > Class Template Reference174 F.25 sap::SmallerThan< T > Struct Template Reference . 176 F.26 sap::Solver< Array, PrecValueType > Class Template Reference ..
Recommended publications
  • Enhanced Capabilities of the Spike Algorithm and a New Spike- Openmp Solver
    University of Massachusetts Amherst ScholarWorks@UMass Amherst Masters Theses Dissertations and Theses November 2014 Enhanced Capabilities of the Spike Algorithm and a New Spike- OpenMP Solver Braegan S. Spring University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/masters_theses_2 Recommended Citation Spring, Braegan S., "Enhanced Capabilities of the Spike Algorithm and a New Spike-OpenMP Solver" (2014). Masters Theses. 116. https://doi.org/10.7275/5912243 https://scholarworks.umass.edu/masters_theses_2/116 This Open Access Thesis is brought to you for free and open access by the Dissertations and Theses at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Masters Theses by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER A Thesis Presented by BRAEGAN SPRING Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL AND COMPUTER ENGINEERING September 2014 Electrical and Computer Engineering ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER A Thesis Presented by BRAEGAN SPRING Approved as to style and content by: Eric Polizzi, Chair Zlatan Aksamija, Member Hans Johnston, Member Christopher V. Hollot , Department Chair Electrical and Computer Engineering ACKNOWLEDGMENTS I would like to thank Dr. Zlatan Aksamija & Dr. Hans Johnston for assisting me in this project, Dr. Polizzi for guiding me through it, and Dr. Anderson for his help beforehand. iii ABSTRACT ENHANCED CAPABILITIES OF THE SPIKE ALGORITHM AND A NEW SPIKE-OPENMP SOLVER SEPTEMBER 2014 BRAEGAN SPRING B.Sc., UNIVERSITY MASSACHUSETTS AMHERST M.S.E.C.E., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Eric Polizzi SPIKE is a parallel algorithm to solve block tridiagonal matrices.
    [Show full text]
  • A Parallel Solver for Incompressible Fluid Flows
    Available online at www.sciencedirect.com Procedia Computer Science 00 (2013) 000–000 International Conference on Computational Science, ICCS 2013 A parallel solver for incompressible fluid flows Yushan Wanga,∗, Marc Baboulina,b, Jack Dongarrac,d,e, Joel¨ Falcoua, Yann Fraigneauf, Olivier Le Maˆıtref aLRI, Universit´eParis-Sud, France bInria Saclay Ile-de-France,ˆ France cUniversity of Tennessee, USA dOak Ridge National Laboratory, USA eUniversity of Manchester, United Kingdom fLIMSI-CNRS, France Abstract The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlin- earity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply a prediction-projection method that transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we ap- ply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We present an implementation based on MPI where the computations are performed on each subdomain and information is exchanged at the interfaces be- tween subdomains. We describe in particular how the solution of tridiagonal systems can be accelerated using vectorization techniques. Keywords: Navier-Stokes equations ; prediction-projection method; ADI method; partial diagonalization; SIMD ; parallel computing, tridiagonal systems. 1. Introduction The incompressible Navier-Stokes (NS) equations express the Newton’s second law applied to fluid motion, with the assumptions that the fluid density is constant and the fluid stress is the sum of a viscous term (proportional to the gradient of the velocity) and an isotropic pressure term.
    [Show full text]
  • PSPIKE: a Parallel Hybrid Sparse Linear System Solver *
    PSPIKE: A Parallel Hybrid Sparse Linear System Solver ? Murat Manguoglu1, Ahmed H. Sameh1, and Olaf Schenk2 1 Department of Computer Science,Purdue University, West Lafayette IN 47907 2 Computer Science Department,University of Basel,Klingelbergstrasse 50,CH-4056 Basel Abstract. The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) perfor- mance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, we present a new parallel solver that combines the desirable characteristics of direct meth- ods (robustness) and effective iterative solvers (low computational cost), while alleviating their drawbacks (memory requirements, lack of robust- ness). Our proposed hybrid solver is based on the general sparse solver PARDISO, and the “Spike” family of hybrid solvers. The resulting algo- rithm, called PSPIKE, is as robust as direct solvers, more reliable than classical preconditioned Krylov subspace methods, and much more scal- able than direct sparse solvers. We support our performance and parallel scalability claims using detailed experimental studies and comparison with direct solvers, as well as classical preconditioned Krylov methods. Key words: Hybrid Solvers, Direct Solvers, Krylov Subspace Methods, Sparse Linear Systems 1 Introduction The emergence of extreme-scale parallel platforms, along with increasing number of cores available in conventional processors pose significant challenges for algo- rithm and software development. Machines with tens of thousands of processors and beyond place tremendous constraints on the communication requirements of algorithms.
    [Show full text]
  • Dense and Sparse Parallel Linear Algebra Algorithms on Graphics Processing Units
    Departament de Sistemes Informatics` i Computacio´ Dense and sparse parallel linear algebra algorithms on graphics processing units Author: Alejandro Lamas Davi~na Director: Jos´eE. Rom´anMolt´o October 2018 To the extent possible under law, the author has waived all copyright and related or neighboring rights to this work. To one, two, and three. Acknowledgments I would like to express my gratitude to my director Jos´eRom´an,for his permanent support during all these years of work. His wise advice and selfless guidance have been decisive for the culmination of this thesis. His door has always been open for me, and he has solved all my doubts with unlimited patience. For all that and for more, I thank him. I would like to extend my gratitude to my colleagues of the SLEPc project. Here I thank again to Jos´eRom´anfor his unique humor sense. To Carmen, for showing me the way and for all her good advices. To Enrique, who helped me to get rolling. And to the former members Andr´esand Eloy, to whom I had the opportunity to meet and who enliven the group meals. I will keep good memories from these years. I do not want to forget to mention to Xavier Cartoix`a,Jeff Steward and Altuˇg Aksoy, great researchers with whom I have had the opportunity to collaborate. The afternoon snacks would not have been the same without the excellent discus- sions and comments of Fernando, David and of course Salva, who, without noticing it, also helped to improve this dissertation. Last, I would like to thank to Jos´eLuis, IT staff of the department, for his high valuable work behind the scenes and his promptly response to any incidence.
    [Show full text]
  • A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst
    University of Massachusetts Amherst ScholarWorks@UMass Amherst Masters Theses 1911 - February 2014 2011 A Banded Spike Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/theses Mendiratta, Karan, "A Banded Spike Algorithm and Solver for Shared Memory Architectures" (2011). Masters Theses 1911 - February 2014. 699. Retrieved from https://scholarworks.umass.edu/theses/699 This thesis is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Masters Theses 1911 - February 2014 by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. A BANDED SPIKE ALGORITHM AND SOLVER FOR SHARED MEMORY ARCHITECTURES A Thesis Presented by KARAN MENDIRATTA Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE IN ELECTRICAL AND COMPUTER ENGINEERING September 2011 Electrical & Computer Engineering A BANDED SPIKE ALGORITHM AND SOLVER FOR SHARED MEMORY ARCHITECTURES A Thesis Presented by KARAN MENDIRATTA Approved as to style and content by: Eric Polizzi, Chair David P Schmidt, Member Michael Zink, Member Christopher V Hollot, Department Chair Electrical & Computer Engineering For Krishna Babbar, and Baldev Raj Mendiratta ACKNOWLEDGMENTS I would like to thank my advisor and mentor Prof. Eric Polizzi for his guidance, patience and support. I would also like to thank Prof. David P Schmidt and Prof. Michael Zink for agreeing to be a part of my thesis committee. Last but not the least, I am grateful to my family and friends for always encouraging and supporting me in all my endeavors.
    [Show full text]
  • Some Efficient Methods for Computing the Determinant of Large Sparse
    Some efficient methods for computing the determinant of large sparse matrices Emmanuel Kamgnia, Louis Bernard Nguenang To cite this version: Emmanuel Kamgnia, Louis Bernard Nguenang. Some efficient methods for computing the determi- nant of large sparse matrices. Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées, INRIA, 2014, 17, pp.73-92. hal-01300060 HAL Id: hal-01300060 https://hal.inria.fr/hal-01300060 Submitted on 8 Apr 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Special issue CARI’12 Some efficient methods for computing the determinant of large sparse matrices E. KAMGNIA(1) — L. B. NGUENANG(1) (1) Department of Computer Science, University of Yaounde I, Yaounde, Cameroon. Emails: [email protected], [email protected] ABSTRACT. The computation of determinants intervenes in many scientific applications, as for ex- ample in the localization of eigenvalues of a given matrix A in a domain of the complex plane. When a procedure based on the application of the residual theorem is used, the integration pro- cess leads to the evaluation of the principal argument of the complex logarithm of the function g(z) = det((z + h)I − A)/ det(zI − A), and a large number of determinants is computed to insure that the same branch of the complex logarithm is followed during the integration.
    [Show full text]
  • Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-Core Architectures
    c 2014 Li-Wen Chang SCALABLE PARALLEL TRIDIAGONAL ALGORITHMS WITH DIAGONAL PIVOTING AND THEIR OPTIMIZATION FOR MANY-CORE ARCHITECTURES BY LI-WEN CHANG THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2014 Urbana, Illinois Adviser: Professor Wen-mei W. Hwu ABSTRACT Tridiagonal solvers are important building blocks for a wide range of scientific applications that are commonly performance-sensitive. Recently, many-core architectures, such as GPUs, have become ubiquitous targets for these ap- plications. Therefore, a high-performance general-purpose GPU tridiagonal solver becomes critical. However, no existing GPU tridiagonal solver provides comparable quality of solutions to most common, general-purpose CPU tridi- agonal solvers, like Matlab or Intel MKL, due to no pivoting. Meanwhile, conventional pivoting algorithms are sequential and not applicable to GPUs. In this thesis, we propose three scalable tridiagonal algorithms with diag- onal pivoting for better quality of solutions than the state-of-the-art GPU tridiagonal solvers. A SPIKE-Diagonal Pivoting algorithm efficiently par- titions the workloads of a tridiagonal solver and provides pivoting in each partition. A Parallel Diagonal Pivoting algorithm transforms the conven- tional diagonal pivoting algorithm into a parallelizable form which can be solved by high-performance parallel linear recurrence solvers. An Adaptive R-Cyclic Reduction algorithm introduces pivoting into the conventional R- Cyclic Reduction family, which commonly suffers limited quality of solutions due to no applicable pivoting. Our proposed algorithms can provide com- parable quality of solutions to CPU tridiagonal solvers, like Matlab or Intel MKL, without compromising the high throughput GPUs provide.
    [Show full text]
  • A Scalable, Numerically Stable, High-Performance Tridiagonal Solver Using Gpus
    A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs Li-Wen Chang, John A. Stratton, Hee-Seok Kim, and Wen-Mei W. Hwu Electrical and Computer Engineering University of Illinois at Urbana-Champaign Urbana, Illinois, USA flchang20, stratton, kim868, [email protected] Abstract—In this paper, we present a scalable, numerically of a single system at a time. Assuming that the systems stable, high-performance tridiagonal solver. The solver is based are all large enough to require partitioning, the amount of on the SPIKE algorithm for partitioning a large matrix into communication and transfer overhead will grow with the small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1-by-1 or number of times each system is partitioned. The number of 2-by-2 diagonal pivoting algorithm, which is also known to be partitions will grow directly in proportion to the number of numerically stable. Our paper makes two major contributions. unrelated systems being solved simultaneously on the same First, our solver is the first numerically stable tridiagonal solver device. All else being equal, we should then prefer to send for GPUs. Our solver provides comparable quality of stable large integrated systems to the GPU for minimal transfer solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It overhead. Internally, the GPU solver is then free to perform is also scalable to multiple GPUs and CPUs. Second, we present further partitioning as necessary to facilitate parallel execution, and analyze two key optimization strategies for our solver: a high- which does not require external communication.
    [Show full text]
  • A Feature Complete SPIKE Banded Algorithm and Solver
    A A Feature Complete SPIKE Banded Algorithm and Solver BRAEGAN S. SPRING and ERIC POLIZZI, University of Massachusetts, Amherst AHMED H. SAMEH, Purdue University New features and enhancements for the SPIKE banded solver are presented. Among all the SPIKE algorithm versions, we focus our attention on the recursive SPIKE technique which provides the best trade-off between generality and parallel efficiency, but was known for its lack of flexibility. Its application was essentially limited to power of two number of cores/processors. This limitation is successfully addressed in this paper. In addition, we present a new transpose solve option, a standard feature of most numerical solver libraries which has never been addressed by the SPIKE algorithm so far. A pivoting recursive SPIKE strategy is finally presented as an alternative to non-pivoting scheme for systems with large condition numbers. All these new enhancements participate to create a feature complete SPIKE algorithm and a new black-box SPIKE-OpenMP package that significantly outperforms the performance and scalability obtained with other state-of-the-art banded solvers. arXiv:1811.03559v1 [cs.NA] 8 Nov 2018 A:2 Spring, Polizzi, Sameh Contents 1 Introduction 2 2 SPIKE background 4 2.1 Central concept of SPIKE . .4 2.2 Recursive reduced system . .8 2.3 Optimizing per-partition costs . 11 3 Flexible partitioning scheme for recursive SPIKE 13 3.1 Distribution of threads . 14 3.2 Reducing factorization stage sweeps . 15 3.3 Load balancing scheme . 16 3.4 Performance measurements . 19 3.4.1 Partition ratio accuracy . 19 3.4.2 Scalability and performance comparisons .
    [Show full text]
  • Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with Opencl to Target Fpgas, Gpus, and Cpus
    Hindawi International Journal of Reconfigurable Computing Volume 2019, Article ID 3679839, 13 pages https://doi.org/10.1155/2019/3679839 Research Article Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs Hamish J. Macintosh ,1,2 Jasmine E. Banks ,1 and Neil A. Kelson2 1School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Queensland 4001, Australia 2eResearch Office, Division of Research and Innovation, Queensland University of Technology, Brisbane, Queensland 4001, Australia Correspondence should be addressed to Hamish J. Macintosh; [email protected] Received 27 February 2019; Revised 3 August 2019; Accepted 6 September 2019; Published 13 October 2019 Guest Editor: Sven-Bodo Scholz Copyright © 2019 Hamish J. Macintosh et al. /is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are in- creasing in priority. /is work presents the oclspkt routine. /e oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. /e routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute performance.
    [Show full text]
  • SPIKE::GPU a SPIKE-Based Preconditioned GPU Solver for Sparse Linear Systems
    SPIKE::GPU A SPIKE-based preconditioned GPU Solver for Sparse Linear Systems Ang Li Andrew Seidl Radu Serban Dan Negrut January 13, 2014 Abstract This contribution outlines an approach that draws on general purpose graphics processing unit (GPGPU) computing to solve large linear systems. To methodology proposed relies on a SPIKE-based preconditioner with a Krylov-subspace method and has the following three stages: (i) row/column reordering for boosting diagonal dominance and reducing bandwidth; (ii) applying single precision truncated SPIKE on a dense banded matrix obtained after dropping small elements that fall outside a carefully selected bandwidth; and (iii) preconditioning within the BiCGStab(2) framework. The reordering strategy adopted aims at generating a narrow bandwidth dense matrix that is diagonally heavy. The results of several numerical experiments indicate that when used in conjunction with large dense banded matrices, the proposed approach is two to three times faster than the latest version of the MKL dense solver as soon as d > 0:23. When handling sparse systems, synthetic results obtained for large random matrices suggest that the proposed approach is three to five times faster than the PARDISO solver as long as the reordered matrices are close to being diagonally dominant. For a random set of smaller dimension application matrices, some out of the University of Florida Sparse Matrix Collection, the proposed approach is shown to be better or comparable in performance to PARDISO. 1 Introduction and Motivation Recently, GPGPU computing has been adopted at an increasingly faster pace in a wide spectrum of scientific and engineering data analysis, simulation, and visualization tasks.
    [Show full text]
  • A Spike Algorithm for Solving Tridiagonal Block Matrix Equations on a Multicore Machine Shinji TOKUDA Research Organization for Information Science and Technology
    20th NEXT Workshop, Kyoto TERSA 2015 January 13 A Spike Algorithm for Solving Tridiagonal Block Matrix Equations on a Multicore Machine Shinji TOKUDA Research Organization for Information Science and Technology Acknowledgement: N. Aiba (JAEA), M. Yagi (JAEA) 2015/1/9 1 Abstract We are developing a recursive direct-solver for linear equations with the coefficient matrix of a tridiagonal block form by adopting the spike algorithm. The solver is parallelized by open MP that supports sections and task constructs. The block structure in the coefficient matrix naturally leads to nested parallelism. Some tests are reported on the performance of the parallelized solver, which shows the spike algorithm is expected to provide an efficient solver on modern multicore machines. 2015/1/9 2 Contents ●Introduction ●Spike Algortithm for Tridiagonal Block Matrix ◇ What is Spike? ◇ Recursive Spike Algorithm and Its Parallelization by open MP ◇ Nested Parallelism for Spike Algorithm ●Summary ◇ Numerical Experiments of the parallelized solever ◇ Future Works 2015/1/9 3 Introduction We want to solve linear equations that have a coefficient matrix of a tridiagonal block form (1) Ax = b on a multicore machine (for example, a personal computer with an accelerator ). Assumed Problem Sizes: The size of each block (□) (Nblock)2 = (128)2~ (512)2 The block can be sparse or dense. A = ● ● ● The number of diagonal blocks msize0 = 512~1024 2015/1/9 4 Such equations are familiar in computational physics. In tokamak fusion research they appear in MHD (MagnetoHydroDymanic ) equilibrium and stability analyses and in nonlinear MHD simulations. Algorithms for tridiagonal block matrix equations has been studied since 1970’s, and several solvers have been developed[1].
    [Show full text]